Skip to content

Sources

A source is one location KDBL Context Lake (K-Lake) indexes: an S3 bucket, an SMB share, an NFS export. This page covers the full lifecycle — adding, listing, configuring, and removing sources — across the UI, CLI, and API.

The source model

Every source has:

Field Description
source_id Stable, unique identifier within your tenant. Use a URI-style name such as s3://my-bucket or smb://nas.corp/finance.
protocol One of s3, smb, smbfs, nfs.
config Protocol-specific connection settings. See the protocol sections below.
Credentials Provided once at creation. Stored encrypted at rest.
enabled When false, workers stop crawling this source. Defaults to true.
bulk_ingest When true, uses the optimized first-crawl write path. Defaults to true.
meta_caps Set of optional enrichments to gather (S3 tags, NTFS / NFSv4 ACLs, xattrs).

Sources are tenant-scoped. Users in other tenants cannot see or address them.

Adding sources

S3

S3-compatible object stores including AWS S3, MinIO, Wasabi, and on-prem gateways.

Required: bucket name. Optional: endpoint_url, region, force_path_style (for MinIO-style gateways).

Credentials: access key ID + secret access key, or leave blank to use ambient credentials (IRSA, environment variables on the worker pod).

CLI:

echo "<secret-access-key>" | kdbl-control \
  --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
  source add-s3 \
  --source-id 's3://my-bucket' \
  --bucket my-bucket \
  --region us-east-1 \
  --access-key-id AKIA... \
  --secret-access-key-stdin

API:

curl -X POST -H "Authorization: Bearer $KDBL_TOKEN" \
     -H "Content-Type: application/json" \
     "$KDBL_URL/api/sources" \
     -d '{
       "source_id": "s3://my-bucket",
       "protocol": "s3",
       "config": { "bucket": "my-bucket", "region": "us-east-1" },
       "secret": { "access_key_id": "AKIA...", "secret_access_key": "..." }
     }'

SMB (userspace)

For SMB / CIFS shares accessed without a kernel mount.

Required: server, share. Optional: domain.

Credentials: username + password.

CLI:

echo "<password>" | kdbl-control \
  --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
  source add-smb \
  --source-id 'smb://nas.corp/finance' \
  --server nas.corp --share finance \
  --domain CORP \
  --username svc-indexer \
  --password-stdin

SMBFS (kernel mount)

For SMB / CIFS shares mounted via the kernel CIFS client. Higher throughput than userspace SMB for large shares.

Required: server, share. Optional: domain, vers (defaults to 3.1.1), max_channels, extra_opts, backup_intent (defaults to false).

CLI:

echo "<password>" | kdbl-control \
  --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
  source add-smbfs \
  --source-id 'smbfs://nas.corp/finance' \
  --server nas.corp --share finance \
  --username svc-indexer \
  --password-stdin

Backup-operator access (bypass file ACLs)

By default the extractor only sees files the configured account is granted by the share's per-file ACLs — indexing everything otherwise means re-permissioning the data. Instead, enable backup-operator intent and grant the service account the backup privilege on the NAS:

  • What it does: adds backupuid=/backupgid= to the CIFS mount, so every file open carries FILE_OPEN_FOR_BACKUP_INTENT. The server honours it (bypassing per-file ACLs) iff the account holds SeBackupPrivilege — i.e. is a member of the server's Backup Operators group (or the vendor equivalent on NetApp / EMC Isilon / HPE). No file ACL changes are required.
  • How to enable: pass --backup-intent to source add-smbfs, tick "Backup operator access" in the New Source form, or set "backup_intent": true in the config JSON. To flip it on an existing source without re-adding it, use the toggle (API mode), which re-mounts within ~30 s:

kdbl-control --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
  source set-backup-intent --source-id 'smbfs://nas.corp/finance' --enabled true
- Failback (never-regress): if the worker's kernel doesn't understand the option it logs a warning and remounts without it, so a source that mounts today never breaks. A server that lacks the privilege for the account simply applies the normal ACL check (ACL-readable files stay readable). - Not on userspace SMB: the userspace smb backend cannot send backup intent; the API rejects backup_intent there with a pointer to smbfs. If a NAS hard-rejects backup intent for non-privileged accounts, leave the flag off.

echo "<password>" | kdbl-control \
  --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
  source add-smbfs \
  --source-id 'smbfs://nas.corp/finance' \
  --server nas.corp --share finance \
  --username svc-backup \
  --backup-intent \
  --password-stdin

NFS

NFSv3 and NFSv4 exports mounted into the worker.

Required: server, export (must start with /). Optional: vers (defaults to 4.2), sec (defaults to sys), nconnect (defaults to 16), extra_opts.

CLI:

kdbl-control \
  --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
  source add-nfs \
  --source-id 'nfs://nas.corp/export/data' \
  --server nas.corp \
  --export /export/data

Listing sources

UI: the Sources page lists every source in your tenant with file count, bytes, and last crawl time.

CLI:

kdbl-control --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" source list

API:

curl -H "Authorization: Bearer $KDBL_TOKEN" "$KDBL_URL/api/sources"

Enabling and disabling

A disabled source stays in the registry but workers stop crawling it. Re-enabling resumes from the next crawl trigger.

UI: toggle the Enabled switch on the source detail page.

CLI:

kdbl-control source disable --source-id 's3://my-bucket'
kdbl-control source enable  --source-id 's3://my-bucket'

API:

curl -X POST -H "Authorization: Bearer $KDBL_TOKEN" \
     -H "Content-Type: application/json" \
     "$KDBL_URL/api/sources/<urlencoded-source-id>/enabled" \
     -d '{"enabled": false}'

Hybrid search (lexical vs hybrid)

Search has two stacks: lexical (full-text — always on) and hybrid, which adds a dense embedding arm + reranking on top. Hybrid is more accurate but costs more: embedding compute at ingest and vector index storage. You can run a tenant or an individual source lexical-only to keep that cost off, and turn hybrid on later.

The effective setting is source override ?? tenant default ?? off, then ANDed with whether an embedder is deployed (KDBL_EMBED_ENDPOINT). New tenants default to lexical-only.

# Tenant default (tenant-admin / cluster-admin):
curl -X PATCH -H "Authorization: Bearer $KDBL_TOKEN" -H "Content-Type: application/json" \
     "$KDBL_URL/api/tenants/<slug>/hybrid" -d '{"enable_hybrid": true}'

# Per-source override (true/false to set, null to inherit the tenant default):
kdbl-control source search enable  --source-id 's3://my-bucket'
kdbl-control source search disable --source-id 's3://my-bucket'
kdbl-control source search inherit --source-id 's3://my-bucket'
kdbl-control source search show    --source-id 's3://my-bucket'

Timing — the toggle is not instantaneous for ingest. Queries honour the setting immediately. New ingest, however, only picks up the change within ~30 s — workers read the resolved setting per task when work is dispatched. So if you flip a source and immediately force a re-extract, the in-flight task can still use the previous setting. Wait ~30 s after toggling before (re-)extracting if you need the new setting to apply to that ingest.

Disabling hybrid is non-destructive: existing embeddings are retained (and harmless to lexical results) until you explicitly reclaim them (kdbl-control source search reclaim --source-id …), which NULLs them in the background. Re-enabling later needs a re-crawl/backfill to rebuild vectors — see /capacity (or kdbl-control capacity) for the retained-vector bytes and the per-stack storage split.

Adjusting metadata enrichment

Use source meta-caps (CLI) or the source detail page (UI) to choose which optional enrichments are gathered: S3 tags, NTFS ACLs, NFSv4 ACLs, extended attributes. Start narrow — every additional cap adds work per file.

kdbl-control source meta-caps --source-id 's3://my-bucket' --caps s3_tags

You can also enqueue a backfill to retroactively enrich files that were indexed before a cap was added:

kdbl-control source backfill-meta --source-id 's3://my-bucket'

Triggering crawls

UI: click Crawl on the source detail page. Optionally narrow with a path prefix.

CLI:

kdbl-control crawl --source-id 's3://my-bucket'
kdbl-control crawl --source-id 's3://my-bucket' --prefix 'reports/2026/'

API:

curl -X POST -H "Authorization: Bearer $KDBL_TOKEN" \
     "$KDBL_URL/api/sources/<urlencoded-source-id>/crawl"

Removing sources

Removing a source deletes its registry entry and its indexed files from the catalog. This is not recoverable — re-add and re-crawl to get back to a populated state.

UI: Delete action on the source detail page (requires confirmation).

CLI:

kdbl-control source remove --source-id 's3://my-bucket'

API:

curl -X DELETE -H "Authorization: Bearer $KDBL_TOKEN" \
     "$KDBL_URL/api/sources/<urlencoded-source-id>"