Air-Gapped AI with MCP — Demonstration¶

Show a self-hosted LLM answering questions grounded in the KDBL Context Lake (K-Lake) knowledge base over MCP — with zero internet. The model never leaves your network, every answer is retrieved from security-trimmed, audited content, and you can prove it by denying internet egress (or pulling the WAN cable) and watching it keep working while a cloud assistant goes dark.

This works because K-Lake is offline by design: the MCP server (/mcp), PAT auth (no IdP, no internet), and offline content extraction with locally-cached models. The demo adds three things: a self-hosted LLM, a local chat interface that drives K-Lake's MCP tools, and an egress-deny proof.

Architecture¶

   (air-gapped: egress denied / WAN cable pulled)

   Local chat interface ──model──► Self-hosted LLM (GPU-accelerated, on-prem)
       │
       └─tools─► MCP bridge ──MCP(PAT)──► K-Lake /mcp ──per-file trimming + audit──► search index
                                                                                  (kdbl://file/… citations)

Question → the chat interface → the LLM emits a tool call → an MCP bridge → K-Lake's /mcp (per-file trimmed, audited) → grounded answer with kdbl://file/… citations.

Components¶

Component	What
Self-hosted LLM	An open-weights model served locally on a GPU; exposes an OpenAI-compatible API with tool-calling.
MCP bridge	Re-exposes K-Lake's MCP tools to the chat interface.
Local chat interface	The chat surface; configured with the local model and the K-Lake tools.
The K-Lake MCP server (`/mcp`)	Already present — enable via config (below).
Egress-deny policy	The air-gap proof, applied at demo time.

Prerequisites (one-time, while still online)¶

Provision GPU capacity for the self-hosted LLM. A dedicated GPU gives the model room to run with a context window ample for retrieval.
Enable MCP on the K-Lake API and restart it. Set the master switch, the resource URI, and the scopes (kdbl.search,kdbl.read). See Enabling the server for the full env table.
Issue a tenant PAT for the demo tenant and configure the MCP bridge with it (keep the token out of source control by holding it in a secret).
Stage the LLM so its weights are cached locally; for a true air-gap, configure the serving runtime to run fully offline once the weights are in place.
(Optional) Enable clickable citations — let users open the original file from a citation to verify grounding. Once download links are configured, search_content hits carry a signed url; clicking it streams the original (re-fetched on demand, RLS re-checked, audited). See the MCP skills reference.

Deploy¶

Deploy the self-hosted LLM, the MCP bridge, and the local chat interface into your cluster. Then validate the LLM serves and tool-calls, and confirm K-Lake's MCP tools are reachable:

kdbl-control --api-url … --api-token <PAT> mcp smoke   # -> tools listed

Run the demo (chat interface)¶

Open the local chat interface in your browser.
Pick the self-hosted model. Native tool/function calling is enabled, the K-Lake knowledge tools are bound, and a system prompt drives a locate → read → answer loop that hard-grounds the model — stopping the two failure modes (answering from a thin snippet, and leaking training knowledge). See the Chat-UI QuickStart for how the setup works.
Ask, e.g. "Search the knowledge base for invoices — who's the vendor and what's the contact email? Cite the file." The model calls search_content, reads the matched chunk's text (and get_file_window if it needs more), and answers grounded, citing each fact with the hit's signed url — a clickable link to the original file (RLS re-checked + audited on click).

The air-gapped proof¶

Deny internet egress (or physically pull the WAN uplink). Apply a policy that denies public/WAN egress while allowing DNS, in-cluster, and the on-prem LAN (so the sources + platform keep working). Confirm a public host is unreachable from inside the cluster.
Ask again in the chat interface → it still answers, grounded with citations. Nothing left the network.
Show provenance: kdbl-control --api-token <PAT> mcp audit --tool search_content prints the exact row the query produced (PAT principal, sources, row_count, time).
Contrast: with the WAN cut, a cloud MCP client (e.g. Claude Desktop, see Connecting AI clients) pointed at the endpoint fails — it can't reach in, and its model is cloud-hosted. The self-hosted LLM + K-Lake keeps answering.
Restore: remove the egress policy; connectivity returns within a few seconds.

Troubleshooting¶

Tool calls show as plain text instead of being parsed → the serving runtime's tool-call parser doesn't match the model; select the parser the model expects.
The LLM fails to start with an out-of-memory / KV-cache error → lower the context window or raise the share of GPU memory allotted to the model.
The bridge / chat interface can't reach /mcp → confirm MCP is enabled on the K-Lake API and that in-cluster clients target the API service /mcp directly (the public ingress only proxies /api/).
The LLM won't restart under air-gap → put the serving runtime in offline mode (the weights are cached locally); otherwise it tries to reach the internet and the egress policy blocks it.

Scale-up note¶

For a bigger model, nodes with large unified memory can host a larger mixture-of-experts model — at the cost of a matching serving image for that hardware. The single-GPU path is the demo default.