Skip to content

KDBL Context Lake (K-Lake) MCP Server — Skills Reference

The K-Lake knowledge platform exposes its crawled-and-extracted content to LLM agents over the Model Context Protocol (MCP). This is the capability ("skills") reference for that server: the tools it offers, what each returns, the security model, and the usage patterns that get good answers.

Every call is security-trimmed per tenant, source, and file (enforced by the datastore's row-level security) and audited. The model only ever sees content the calling principal is authorised to see.


Connection

Endpoint POST /mcp on kdbl-api (in-cluster: http://kdbl-api.kdbl.svc.cluster.local/mcp)
Transport Streamable HTTP, JSON-RPC 2.0. POST-onlyGET /mcp returns 405 (no SSE channel)
Protocol revision 2025-11-25
Auth Authorization: Bearer <token> — a tenant PAT (kdblpat_…, offline hash check, no IdP) or an OIDC access token. Tenant-scoped: a cluster-admin token is rejected (MCP tools are tenant-scoped).
Discovery Protected-Resource Metadata at /.well-known/oauth-protected-resource
Enable KDBL_MCP_ENABLED=true on kdbl-api; scopes kdbl.search, kdbl.read
Smoke test kdbl-control --api-url … --api-token <PAT> mcp smoke (initialize → tools/list → tools/call)

Note: the ingress only proxies /api/, so in-cluster MCP clients must hit kdbl-api.kdbl.svc.cluster.local/mcp directly, not the public front door.

Result shape

Each tools/call returns both a back-compat content array (a JSON text block, plus any citations/resources) and a validated structuredContent object. Search hits carry resource_link citations; file reads can carry an embedded resource. Files are addressed by a stable URI:

kdbl://file/<url-encoded source_id>/<url-encoded key>

These URIs are resolvable via MCP resources/read (returns the same body as get_file_text), and resources/list enumerates them.


Skills (tools)

Six tools: one for finding content, three for reading it, two for browsing the catalogue.

🔍 search_content — search (start here)

Search over extracted content the caller may see. Hybrid by default (semantic + keyword search, fused and reranked); set mode:"lexical" for keyword-only search.

Input: query (required), source_id (optional — restrict to one source), limit (default 20, max 100), mode ("hybrid" default | "lexical"), path / content_type / modified_after / modified_before (optional filters).

Output: { hits: [ … ] }, each hit: - source_id, key, seq — the citation + chunk locator - textthe matched chunk and its neighbouring chunks (a small window, expanded server-side, bounded ~1800 chars). Answer from this. - truncatedtrue if the window was longer than returned (read on with get_file_window) - snippet — a <mark>-highlighted fragment (a locator, not the full text) - page_no / char_start / char_end / ts_start_ms / ts_end_ms — in-document locators (documents set page/char; audio/video set ts) - url — a clickable, short-lived signed link to the ORIGINAL file (when downloads are enabled). Cite this so the user can open the source and verify the grounding; null when the feature is off. - rank — relevance score

Query semantics: the keyword arm is OR-over-terms weighted by term rarity — a chunk holding the rare decisive term out-ranks one dense in a common word, with no manual phrasing tricks. Lead with the most distinctive terms (a proper noun / rare word / specific figure), not a full sentence. There is no AND/coverage dance to manage: one query covers it, and a question phrased differently from the corpus still recalls on the words it shares. (Snippet highlighting still marks any query term.)

Modes. hybrid (default) adds a semantic (vector) arm — recall for paraphrase / no-shared-lexeme queries — fused with the keyword arm and reranked. lexical is keyword-only (no embeddings/rerank): faster, and the right choice when embeddings are disabled for the tenant/source, or to drive a multi-query agentic search (see below). Hybrid is a per-tenant / per-source toggle; if it's off (or no embedder is deployed) a mode:"hybrid" request transparently degrades to lexical.

📄 get_file_text — whole document

Retrieve a file's full extracted text (as ordered chunks) plus its extraction status. Input: source_id, key. Output: { chunks: [{seq, text, page_no, …}], status } (+ embedded resource). Bounded at 5000 chunks (truncated flag); use get_file_window for very large files.

🪟 get_file_window — bounded sliding read (the read-through tool)

Read a bounded window of a file's chunks — a cursor over the document. Input: source_id, key, start (0-based chunk index, default 0), length (default 40, max 200). Output: { chunks, total_chunks, next_cursor, window: {returned}, has_more, text }.

Use it to (a) read a large document in pieces, or (b) pull more context around a search_content hit — set start ≈ the hit's seq minus a few, then slide forward with next_cursor.

🏷️ get_file_metadata — file facts

Metadata for one file. Input: source_id, key. Output: size, mtime, etag, storage class, owner, content type, and POSIX mode/uid/gid where captured.

📚 list_sources — browse sources

Sources the caller can access, paginated. Input: cursor, limit (default 50, max 200). Output: { sources: [{source_id, protocol, enabled, last_indexed_at}], next_cursor? }.

🗂️ list_files — browse a source

Files within a source, paginated by key. Input: source_id (required), cursor, limit (default 50, max 200). Output: { files: [...], next_cursor? }.


  1. Locate with search_content. Lead with the most distinctive terms. Don't stop after one query — if the top hits don't contain the answer, reformulate with the document's own wording (exact line-item names, section/statement titles, proper nouns, a specific figure or date) and search again; narrow with path / content_type / modified_*; or page via has_more.
  2. Read from the hit's text (it already includes neighbouring chunks). If text is truncated, or a hit looks relevant but the answer continues beyond the window, call get_file_window around the hit's seq.
  3. Answer, grounded, citing the kdbl://file/… source. If nothing answers the question after reading, say so — don't fall back to prior knowledge.

This loop is what a chat client's system prompt should enforce; see the Air-gapped AI demo for a ready-made prompt.

Agentic search beats single-shot — especially in lexical mode

The reformulate-and-retry loop is not a fallback; it is the way to get high recall. On a public financial-filings benchmark, single-shot lexical retrieval found the right document in the top-10 only ~40–46% of the time, vs ~60–75% for hybrid. But an agent that issues a few lexical queries — reformulating with each statement's printed wording — found the right document 75–95% of the time (scaling with model quality), matching or beating single-shot hybrid with no embeddings at all. The takeaway for client/system-prompt authors: instruct the model to treat the first result set as a clue, not an answer, and to re-query in the corpus's vocabulary before giving up. lexical mode is a first-class agentic retrieval surface — cheaper than hybrid and fully air-gappable (no embedder, no GPU).


Each search_content hit can include a url — a clickable, short-lived HS256-signed link to the original source file (the PDF/etc.), so a user can open it and verify the grounding. K-Lake doesn't store originals; the link re-fetches on demand:

GET /api/files/download?t=<signed token>   (add &dl=1 to force download vs preview)

The token (≈15 min) carries the principal it was minted for; the endpoint re-checks RLS before streaming (404 if the principal can no longer see the file), and the byte fetch is delegated to the component that holds the source connectors and credentials (the API never does). Every download is audited (tool = files/download). Enabled when KDBL_DOWNLOAD_SIGNING_SECRET, KDBL_API_PUBLIC_URL, and KDBL_INTERNAL_FETCH_TOKEN are set; otherwise url is null and the route 404s.


Security & audit

  • Per-file trimming — every query runs in a row-level-security-scoped transaction that cannot bypass the policies, so tenant + source-ACL + per-file-grant visibility is enforced on top of the explicit tenant filter. There is no way for a tool to return content the principal can't see.
  • Audit — every call writes an audit row (principal, tool, arguments, sources, keys returned, row count, status, client IP, timestamp). Review with kdbl-control … mcp audit --tool search_content.
  • Offline auth — PATs verify offline against a stored hash with no IdP/JWKS fetch, so the server is fully usable air-gapped.

Retrieval characteristics & limits

  • Keyword search: ranking weights terms by rarity, so a rare decisive term out-ranks a common one — no AND/coverage tuning needed.
  • Hybrid (default) adds semantics. The semantic arm is an approximate nearest-neighbour search over chunk embeddings, fused with the keyword arm, diversified, and reranked. This catches paraphrase / no-shared-lexeme queries that pure keyword search misses. Hybrid is a per-tenant / per-source toggle and requires a deployed embedder; without one, or when disabled, retrieval is lexical-only and mode:"hybrid" degrades transparently. Cost/latency of each mode is reported by GET /capacity.
  • Keyword mode matches where the words appear. In mode:"lexical", a question phrased in words that never occur in the answer text won't surface it on the first try — but reformulating with the corpus's own vocabulary across a few queries recovers it (see "Agentic search" above). Or use list_files + get_file_window to read a known document directly.
  • Bounded by design. The keyword candidate scan returns a true global top-N (cap ≫ limit) before RLS/filters; get_file_text caps at 5000 chunks; get_file_window caps at 200 per call — all to keep latency and payloads bounded at scale.

Documented clients

Cloud and local MCP clients are covered in Connecting AI clients; a fully air-gapped self-hosted-LLM + chat-interface setup driving these tools is in the Air-gapped AI demo.