Sources¶
A source is one location KDBL Context Lake (K-Lake) indexes: an S3 bucket, an SMB share, an NFS export. This page covers the full lifecycle — adding, listing, configuring, and removing sources — across the UI, CLI, and API.
The source model¶
Every source has:
| Field | Description |
|---|---|
source_id |
Stable, unique identifier within your tenant. Use a URI-style name such as s3://my-bucket or smb://nas.corp/finance. |
protocol |
One of s3, smb, smbfs, nfs. |
config |
Protocol-specific connection settings. See the protocol sections below. |
| Credentials | Provided once at creation. Stored encrypted at rest. |
enabled |
When false, workers stop crawling this source. Defaults to true. |
bulk_ingest |
When true, uses the optimized first-crawl write path. Defaults to true. |
meta_caps |
Set of optional enrichments to gather (S3 tags, NTFS / NFSv4 ACLs, xattrs). |
Sources are tenant-scoped. Users in other tenants cannot see or address them.
Adding sources¶
S3¶
S3-compatible object stores including AWS S3, MinIO, Wasabi, and on-prem gateways.
Required: bucket name. Optional: endpoint_url, region, force_path_style (for MinIO-style gateways).
Credentials: access key ID + secret access key, or leave blank to use ambient credentials (IRSA, environment variables on the worker pod).
CLI:
echo "<secret-access-key>" | kdbl-control \
--api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
source add-s3 \
--source-id 's3://my-bucket' \
--bucket my-bucket \
--region us-east-1 \
--access-key-id AKIA... \
--secret-access-key-stdin
API:
curl -X POST -H "Authorization: Bearer $KDBL_TOKEN" \
-H "Content-Type: application/json" \
"$KDBL_URL/api/sources" \
-d '{
"source_id": "s3://my-bucket",
"protocol": "s3",
"config": { "bucket": "my-bucket", "region": "us-east-1" },
"secret": { "access_key_id": "AKIA...", "secret_access_key": "..." }
}'
SMB (userspace)¶
For SMB / CIFS shares accessed without a kernel mount.
Required: server, share. Optional: domain.
Credentials: username + password.
CLI:
echo "<password>" | kdbl-control \
--api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
source add-smb \
--source-id 'smb://nas.corp/finance' \
--server nas.corp --share finance \
--domain CORP \
--username svc-indexer \
--password-stdin
SMBFS (kernel mount)¶
For SMB / CIFS shares mounted via the kernel CIFS client. Higher throughput than userspace SMB for large shares.
Required: server, share. Optional: domain, vers (defaults to 3.1.1), max_channels, extra_opts, backup_intent (defaults to false).
CLI:
echo "<password>" | kdbl-control \
--api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
source add-smbfs \
--source-id 'smbfs://nas.corp/finance' \
--server nas.corp --share finance \
--username svc-indexer \
--password-stdin
Backup-operator access (bypass file ACLs)¶
By default the extractor only sees files the configured account is granted by the share's per-file ACLs — indexing everything otherwise means re-permissioning the data. Instead, enable backup-operator intent and grant the service account the backup privilege on the NAS:
- What it does: adds
backupuid=/backupgid=to the CIFS mount, so every file open carriesFILE_OPEN_FOR_BACKUP_INTENT. The server honours it (bypassing per-file ACLs) iff the account holdsSeBackupPrivilege— i.e. is a member of the server's Backup Operators group (or the vendor equivalent on NetApp / EMC Isilon / HPE). No file ACL changes are required. - How to enable: pass
--backup-intenttosource add-smbfs, tick "Backup operator access" in the New Source form, or set"backup_intent": truein the config JSON. To flip it on an existing source without re-adding it, use the toggle (API mode), which re-mounts within ~30 s:
kdbl-control --api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
source set-backup-intent --source-id 'smbfs://nas.corp/finance' --enabled true
smb backend cannot send
backup intent; the API rejects backup_intent there with a pointer to smbfs.
If a NAS hard-rejects backup intent for non-privileged accounts, leave the
flag off.
echo "<password>" | kdbl-control \
--api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
source add-smbfs \
--source-id 'smbfs://nas.corp/finance' \
--server nas.corp --share finance \
--username svc-backup \
--backup-intent \
--password-stdin
NFS¶
NFSv3 and NFSv4 exports mounted into the worker.
Required: server, export (must start with /). Optional: vers (defaults to 4.2), sec (defaults to sys), nconnect (defaults to 16), extra_opts.
CLI:
kdbl-control \
--api-url "$KDBL_URL" --api-token "$KDBL_TOKEN" \
source add-nfs \
--source-id 'nfs://nas.corp/export/data' \
--server nas.corp \
--export /export/data
Listing sources¶
UI: the Sources page lists every source in your tenant with file count, bytes, and last crawl time.
CLI:
API:
Enabling and disabling¶
A disabled source stays in the registry but workers stop crawling it. Re-enabling resumes from the next crawl trigger.
UI: toggle the Enabled switch on the source detail page.
CLI:
kdbl-control source disable --source-id 's3://my-bucket'
kdbl-control source enable --source-id 's3://my-bucket'
API:
curl -X POST -H "Authorization: Bearer $KDBL_TOKEN" \
-H "Content-Type: application/json" \
"$KDBL_URL/api/sources/<urlencoded-source-id>/enabled" \
-d '{"enabled": false}'
Hybrid search (lexical vs hybrid)¶
Search has two stacks: lexical (full-text — always on) and hybrid, which adds a dense embedding arm + reranking on top. Hybrid is more accurate but costs more: embedding compute at ingest and vector index storage. You can run a tenant or an individual source lexical-only to keep that cost off, and turn hybrid on later.
The effective setting is source override ?? tenant default ?? off, then ANDed
with whether an embedder is deployed (KDBL_EMBED_ENDPOINT). New tenants default
to lexical-only.
# Tenant default (tenant-admin / cluster-admin):
curl -X PATCH -H "Authorization: Bearer $KDBL_TOKEN" -H "Content-Type: application/json" \
"$KDBL_URL/api/tenants/<slug>/hybrid" -d '{"enable_hybrid": true}'
# Per-source override (true/false to set, null to inherit the tenant default):
kdbl-control source search enable --source-id 's3://my-bucket'
kdbl-control source search disable --source-id 's3://my-bucket'
kdbl-control source search inherit --source-id 's3://my-bucket'
kdbl-control source search show --source-id 's3://my-bucket'
Timing — the toggle is not instantaneous for ingest. Queries honour the setting immediately. New ingest, however, only picks up the change within ~30 s — workers read the resolved setting per task when work is dispatched. So if you flip a source and immediately force a re-extract, the in-flight task can still use the previous setting. Wait ~30 s after toggling before (re-)extracting if you need the new setting to apply to that ingest.
Disabling hybrid is non-destructive: existing embeddings are retained (and
harmless to lexical results) until you explicitly reclaim them
(kdbl-control source search reclaim --source-id …), which NULLs them in the
background. Re-enabling later needs a re-crawl/backfill to rebuild vectors — see
/capacity (or kdbl-control capacity) for the retained-vector bytes and the
per-stack storage split.
Adjusting metadata enrichment¶
Use source meta-caps (CLI) or the source detail page (UI) to choose which optional enrichments are gathered: S3 tags, NTFS ACLs, NFSv4 ACLs, extended attributes. Start narrow — every additional cap adds work per file.
You can also enqueue a backfill to retroactively enrich files that were indexed before a cap was added:
Triggering crawls¶
UI: click Crawl on the source detail page. Optionally narrow with a path prefix.
CLI:
kdbl-control crawl --source-id 's3://my-bucket'
kdbl-control crawl --source-id 's3://my-bucket' --prefix 'reports/2026/'
API:
curl -X POST -H "Authorization: Bearer $KDBL_TOKEN" \
"$KDBL_URL/api/sources/<urlencoded-source-id>/crawl"
Removing sources¶
Removing a source deletes its registry entry and its indexed files from the catalog. This is not recoverable — re-add and re-crawl to get back to a populated state.
UI: Delete action on the source detail page (requires confirmation).
CLI:
API: