Directory enrichment & identity correlation¶

When a caller signs in, your IdP (Entra / Azure AD, Google, Okta, Keycloak) hands KDBL Context Lake (K-Lake) a set of claims: an object id, an email, a UPN, some group ids. When K-Lake indexes a file, the file carries a different kind of identity entirely — a POSIX uid/gid, or an NTFS / Active Directory SID. These two namespaces rarely match on their own. Directory enrichment is the layer that ties them together, so security trimming can decide whether a given caller is allowed to see a given file.

This page covers the correlation model, the four edge sources that build it, the CLI for configuring and running them, the automatic worker refresh, and — candidly — what is and isn't manageable across CLI, API, and UI.

The correlation problem¶

A caller's token yields refs like oid:<issuer>:<guid>, email:alice@corp.com, upn:alice@corp.com. A file's ACL yields refs like sid::S-1-5-21-… or posixgid:<source>:5000. Trimming authorizes a caller for a file only when the caller's refs and the file's grant refs land on the same principal. Directory enrichment's job is to discover the edges that make that true — that the Entra group oid:…:<guid> is the on-prem sid::S-1-5-21-…, that the POSIX group posixgid:<source>:5000 is name:CORP:finance.

The principal_ref namespace¶

Every identity — caller-side or file-side — is normalized to a principal_ref string of the form <kind>:<scope>:<value>. Emails and names are lowercased; SIDs are globally unique so they use the empty scope.

Kind	Form	Origin
`oid`	`oid:<issuer>:<guid>`	OIDC subject or group object id; issuer-scoped
`email`	`email:<addr>`	Token email claim, or AD `mail`
`upn`	`upn:<userPrincipalName>`	Token UPN claim, or AD `userPrincipalName` (suffix-rewritten)
`sid`	`sid::<SID>`	NTFS / AD ACL; Entra `securityIdentifier` / `onPremisesSecurityIdentifier`
`posixuid`	`posixuid:<source>:<uid>`	NFS / POSIX file owner
`posixgid`	`posixgid:<source>:<gid>`	NFS / POSIX file group
`name`	`name:<directory>:<value>`	Name-based IdP groups (Okta / Keycloak); resolved POSIX names
`nfs4who`	`nfs4who:<source>:<who>`	NFSv4 named ACE principal (e.g. `finance@corp`)

The alias graph and fail-closed confidence¶

Correlation is stored as an alias graph: bidirectional edges a ⇄ b between two interned principals. At trim time, K-Lake takes the caller's refs and expands them across the alias graph to reach every principal they're equivalent to, then checks those against the file's grants.

Each edge carries a confidence: high or medium.

Confidence	Meaning	Used for grant expansion?
`high`	Exact identity key (SID⇄objectId, rewritten UPN, `mail`, declared mapping)	Yes
`medium`	Fuzzy fallback (e.g. `sAMAccountName` / display-name match)	No — recorded only

This is deliberately fail-closed: only high-confidence edges feed grant expansion. A medium edge is materialized and visible, but never authorizes anyone on its own. An operator who has verified a fuzzy correlation can promote it, but nothing fuzzy grants access by default.

Correlation strategies¶

K-Lake builds the alias graph from four edge sources. You can use any combination; pick by what your environment can offer.

Strategy	Bridges	When to use
Declared mappings	Any ref ⇄ any ref	You know specific equivalences and want them deterministic and immediate
POSIX names	`posixuid` / `posixgid` ⇄ `name:<dir>:<…>`	NFS / POSIX sources where a name-based IdP (Okta, Keycloak) emits group names
Entra Graph	`oid:<iss>:<guid>` ⇄ `sid::<SID>`	The file ACLs' SIDs and the caller's Entra groups belong to the same (or AD-synced) tenant
AD / LDAP	`sid::<SID>` ⇄ `upn:` / `email:` / `name:`	The file server's AD is a separate, unlinked directory from the cloud IdP

Declared mappings¶

The deterministic, immediately-usable path. You state {from, to, confidence} pairs in the tenant's oidc_config.principal_mappings; each pair becomes a high-confidence (unless you mark it medium) bidirectional edge. Good for one-off equivalences you already know, or for promoting a correlation you've verified by hand.

POSIX names¶

For per-file-trimmed NFS / POSIX sources, K-Lake resolves the uid/gid set captured during crawling to names (via the worker host's name-service resolution, e.g. on an AD-joined worker) and links posixgid:<source>:<gid> to name:<directory>:<group> — the ref a name-based IdP produces. This runs automatically; there's no separate CLI command.

Entra Graph discovery¶

For Entra / Azure AD, K-Lake uses Microsoft Graph (app-only / client-credentials) to discover three kinds of edge:

Groups — each group's objectId is aliased to its securityIdentifier and, when the group is AD-synced, its onPremisesSecurityIdentifier. Whichever SIDs are present become sid::<SID> edges to the caller's oid:<issuer>:<guid> group ref. This is how a file ACL'd to a group resolves to a caller's group membership.
Users — each synced user's onPremisesSecurityIdentifier is aliased to their upn: / email: ref. This bridges files a user owns (their user SID on the ACL) to the caller's UPN — the sync-native equivalent of the LDAP UPN path, but it reads the already-correct cloud UPN straight from Graph, so no --upn-rewrite is needed for a properly-synced tenant.
Membership (overage resolution) — for each SID-bearing group, K-Lake reads its transitive user members and adds a directed upn:<member> → group edge. This handles the JWT group overage: when a user belongs to more than ~200 groups, Entra omits the groups claim from the token (setting the hasgroups / _claim_sources marker) — but the caller's upn: ref still reaches their groups' grants through these pre-synced edges, with no per-request Graph call. The edges are directed, so two members of the same group never gain access to each other's personal files.

The app registration needs the Graph application permissions Group.Read.All, GroupMember.Read.All, and User.Read.All (all read-only) plus admin consent.

This works when the SIDs on the files belong to the same Entra tenant (or its AD-synced on-prem side). It does not bridge a SID that lives in a directory Entra has never seen — see below.

Correctly-configured-sync example. For a synced user danielle@kdbl.co.uk whose on-prem onPremisesSecurityIdentifier is S-1-5-21-…-1101: files she owns match via the user edge (sid::…-1101 ⇄ upn:danielle@kdbl.co.uk), and files granted to a group she's in (e.g. Management, on-prem SID …-1104) match via the group edge — even if her token carries no groups claim, thanks to the membership edge. No LDAP required.

AD / LDAP correlation (unlinked directories)¶

The common hard case: the file server's AD (e.g. demo.kdbl.com, with S-1-5-21-… SIDs) is a separate directory from a cloud-only Entra tenant (kdbl.co.uk, with S-1-12-1-… SIDs). The two directories share no SIDs and use different UPN suffixes, so Graph SID-bridging alone cannot connect them — Entra has never heard of the on-prem SID.

LDAP correlation solves this. K-Lake binds to the on-prem DC over LDAPS, enumerates users and groups, and aliases each AD principal's objectSid to the caller-side ref the Entra token actually carries. The key knob is a UPN-suffix rewrite: an on-prem UPN like danielle@demo.kdbl.com is rewritten to the cloud UPN danielle@kdbl.co.uk, producing a high-confidence sid::<DEMO SID> ⇄ upn:danielle@kdbl.co.uk edge. mail, when present, is likewise high. The sAMAccountName match is a medium name fallback (recorded, never grants on its own).

Because the worker binds over LDAPS, the deployment must trust the DC's CA in the worker's trust store.

Choosing a correlation strategy¶

Same / AD-synced Entra tenant → Entra Graph (set-graph).
Separate, unlinked on-prem AD vs cloud-only Entra → AD / LDAP with --upn-rewrite (set-ldap), or declared mappings for the handful of groups you care about. Graph alone will not connect them.
NFS / POSIX with a name-based IdP → POSIX names (automatic) plus declared mappings for anything the names don't cover.

CLI¶

All directory CLI commands run in direct-DB mode: they need direct database access (--postgres-url or KDBL_POSTGRES_URL) and, where secrets are involved, the master key (KDBL_MASTER_KEY). They are not proxied through the API, because the API process does not hold the master key. The non-secret config (the graph / ldap / principal_mappings blocks, minus the credentials) can also be set through the API or the Directory correlation card on the tenant detail page — see Managing directory enrichment.

Secrets are always read from an environment variable, never a flag — a deliberate design choice so credentials stay out of argv, shell history, and logs.

Discover Entra groups once: `directory sync-graph`¶

Runs Graph discovery immediately and materializes all three edge sets (group SIDs, user SIDs, and membership), without storing any config. Use it to validate credentials or do a one-shot sync. The client secret comes from KDBL_GRAPH_CLIENT_SECRET.

export KDBL_GRAPH_CLIENT_SECRET='<app-client-secret>'
kdbl-control --postgres-url "$KDBL_POSTGRES_URL" \
  directory sync-graph \
  --tenant <tenant-slug> \
  --entra-tenant <azure-ad-tenant-id> \
  --client-id <app-registration-client-id>

Store Entra config for automatic discovery: `directory set-graph`¶

Stores the non-secret config in oidc_config.graph and encrypts the client secret at rest using KDBL_MASTER_KEY. After this, the worker runs Graph discovery for the tenant automatically — you don't run sync-graph again.

export KDBL_GRAPH_CLIENT_SECRET='<app-client-secret>'
export KDBL_MASTER_KEY='<base64-master-key>'
kdbl-control --postgres-url "$KDBL_POSTGRES_URL" \
  directory set-graph \
  --tenant <tenant-slug> \
  --entra-tenant <azure-ad-tenant-id> \
  --client-id <app-registration-client-id>

Store AD/LDAP config: `directory set-ldap`¶

Stores the non-secret config in oidc_config.ldap and encrypts the bind password (from KDBL_LDAP_BIND_PASSWORD) at rest. --name-scope sets the scope for the name:: fallback refs (typically the NetBIOS domain). --upn-rewrite is repeatable; each is from=to.

export KDBL_LDAP_BIND_PASSWORD='<ad-bind-password>'
export KDBL_MASTER_KEY='<base64-master-key>'
kdbl-control --postgres-url "$KDBL_POSTGRES_URL" \
  directory set-ldap \
  --tenant <tenant-slug> \
  --url 'ldaps://dc.demo.kdbl.com:636' \
  --bind-user 'DEMO\svc-kdbl' \
  --base-dn 'DC=demo,DC=kdbl,DC=com' \
  --name-scope DEMO \
  --upn-rewrite demo.kdbl.com=kdbl.co.uk

Once stored, the worker runs LDAP correlation automatically. To run it immediately, use sync-ldap below.

Run AD/LDAP correlation once: `directory sync-ldap`¶

Runs LDAP correlation immediately using the tenant's stored config (decrypting the stored secret with KDBL_MASTER_KEY). Useful right after set-ldap to materialize edges without waiting for the next refresh.

export KDBL_MASTER_KEY='<base64-master-key>'
kdbl-control --postgres-url "$KDBL_POSTGRES_URL" \
  directory sync-ldap \
  --tenant <tenant-slug>

There is no sync-posix command — POSIX name resolution and declared mappings only run inside the workers' automatic refresh, which you trigger implicitly by storing config and letting it run.

Automatic refresh¶

Once set-graph / set-ldap are stored, the workers keep the alias graph fresh on their own, running a directory-sync refresh roughly every 10 minutes. Each refresh runs:

declared mappings + POSIX name resolution for every tenant — always.
Entra Graph and AD/LDAP discovery for every tenant that has stored config — but only when the worker has KDBL_MASTER_KEY set (it needs the key to decrypt the stored secrets).

If KDBL_MASTER_KEY is absent from the worker, declared and POSIX edges still refresh, but Graph and LDAP do not. The refresh is best-effort and idempotent: a failure on one tenant logs and retries next cycle without stalling the others.

Managing directory enrichment (CLI / API / UI)¶

A candid view of the management surface:

Surface	Coverage	Detail
CLI	Full	`sync-graph`, `set-graph`, `set-ldap`, `sync-ldap` — all in direct-DB mode (`--postgres-url` + `KDBL_MASTER_KEY`). The only path that can store an encrypted secret or run a one-shot sync.
API	Non-secret config	`GET`/`PATCH /api/tenants/:slug/directory` sets the non-secret `graph` / `ldap` / `principal_mappings` blocks (cluster-admin). The encrypted secrets and one-shot syncs stay CLI-only — they need `KDBL_MASTER_KEY`, which the API process does not hold.
UI	Non-secret config	The tenant detail page's Directory correlation card (cluster-admin) edits the graph / ldap / declared-mapping blocks and shows a per-block badge for whether the encrypted secret is stored yet. Secrets stay CLI-only.

In practice the two surfaces compose: set the non-secret config over the API (or in the CLI), then store the credential once with set-graph / set-ldap. The API response reports has_graph_secret / has_ldap_secret so you can tell whether a block is still waiting on its secret.

Setting non-secret config over the API¶

PATCH /api/tenants/:slug/directory merges only the blocks you send into oidc_config — it leaves issuer, audience, mcp_audience, and any sibling block untouched (unlike PATCH /api/tenants/:slug, which replaces oidc_config wholesale). Send any of graph, ldap, principal_mappings:

curl -X PATCH -H "Authorization: Bearer $KDBL_TOKEN" \
     -H "Content-Type: application/json" \
     "$KDBL_URL/api/tenants/<tenant-slug>/directory" \
     -d '{
       "graph": { "entra_tenant_id": "<entra-tenant>", "client_id": "<app-client-id>" },
       "ldap":  { "url": "ldaps://dc.demo.example:636", "bind_user": "DEMO\\svc-kdbl",
                  "base_dn": "DC=demo,DC=example", "name_scope": "DEMO",
                  "upn_rewrite": { "demo.example": "corp.example" } },
       "principal_mappings": [
         { "from": "oid:<issuer>:<group-guid>", "to": "sid:DEMO:S-1-5-21-…", "confidence": "high" }
       ]
     }'

The response returns the stored blocks plus the secret-presence flags:

{ "graph": {…}, "ldap": {…}, "principal_mappings": [...],
  "has_graph_secret": false, "has_ldap_secret": false }

has_graph_secret: false here means the Graph config is in place but inert until you run set-graph to store the encrypted client secret. Read the current config back any time with GET /api/tenants/:slug/directory.