Air-Gapped AI with MCP — Demonstration¶
Show a self-hosted LLM answering questions grounded in the KDBL Context Lake (K-Lake) knowledge base over MCP — with zero internet. The model never leaves your network, every answer is retrieved from security-trimmed, audited content, and you can prove it by denying internet egress (or pulling the WAN cable) and watching it keep working while a cloud assistant goes dark.
This works because K-Lake is offline by design: the MCP server (/mcp), PAT
auth (no IdP, no internet), and offline content extraction with locally-cached
models. The demo adds three things: a self-hosted LLM, a local chat
interface that drives K-Lake's MCP tools, and an egress-deny proof.
Architecture¶
(air-gapped: egress denied / WAN cable pulled)
Local chat interface ──model──► Self-hosted LLM (GPU-accelerated, on-prem)
│
└─tools─► MCP bridge ──MCP(PAT)──► K-Lake /mcp ──per-file trimming + audit──► search index
(kdbl://file/… citations)
Question → the chat interface → the LLM emits a tool call → an MCP bridge →
K-Lake's /mcp (per-file trimmed, audited) → grounded answer with
kdbl://file/… citations.
Components¶
| Component | What |
|---|---|
| Self-hosted LLM | An open-weights model served locally on a GPU; exposes an OpenAI-compatible API with tool-calling. |
| MCP bridge | Re-exposes K-Lake's MCP tools to the chat interface. |
| Local chat interface | The chat surface; configured with the local model and the K-Lake tools. |
The K-Lake MCP server (/mcp) |
Already present — enable via config (below). |
| Egress-deny policy | The air-gap proof, applied at demo time. |
Prerequisites (one-time, while still online)¶
- Provision GPU capacity for the self-hosted LLM. A dedicated GPU gives the model room to run with a context window ample for retrieval.
- Enable MCP on the K-Lake API and restart it. Set the master switch, the
resource URI, and the scopes (
kdbl.search,kdbl.read). See Enabling the server for the full env table. - Issue a tenant PAT for the demo tenant and configure the MCP bridge with it (keep the token out of source control by holding it in a secret).
- Stage the LLM so its weights are cached locally; for a true air-gap, configure the serving runtime to run fully offline once the weights are in place.
- (Optional) Enable clickable citations — let users open the original file
from a citation to verify grounding. Once download links are configured,
search_contenthits carry a signedurl; clicking it streams the original (re-fetched on demand, RLS re-checked, audited). See the MCP skills reference.
Deploy¶
Deploy the self-hosted LLM, the MCP bridge, and the local chat interface into your cluster. Then validate the LLM serves and tool-calls, and confirm K-Lake's MCP tools are reachable:
Run the demo (chat interface)¶
- Open the local chat interface in your browser.
- Pick the self-hosted model. Native tool/function calling is enabled, the K-Lake knowledge tools are bound, and a system prompt drives a locate → read → answer loop that hard-grounds the model — stopping the two failure modes (answering from a thin snippet, and leaking training knowledge). See the Chat-UI QuickStart for how the setup works.
- Ask, e.g. "Search the knowledge base for invoices — who's the vendor and
what's the contact email? Cite the file." The model calls
search_content, reads the matched chunk'stext(andget_file_windowif it needs more), and answers grounded, citing each fact with the hit's signedurl— a clickable link to the original file (RLS re-checked + audited on click).
The air-gapped proof¶
- Deny internet egress (or physically pull the WAN uplink). Apply a policy that denies public/WAN egress while allowing DNS, in-cluster, and the on-prem LAN (so the sources + platform keep working). Confirm a public host is unreachable from inside the cluster.
- Ask again in the chat interface → it still answers, grounded with citations. Nothing left the network.
- Show provenance:
kdbl-control --api-token <PAT> mcp audit --tool search_contentprints the exact row the query produced (PAT principal, sources, row_count, time). - Contrast: with the WAN cut, a cloud MCP client (e.g. Claude Desktop, see Connecting AI clients) pointed at the endpoint fails — it can't reach in, and its model is cloud-hosted. The self-hosted LLM + K-Lake keeps answering.
- Restore: remove the egress policy; connectivity returns within a few seconds.
Troubleshooting¶
- Tool calls show as plain text instead of being parsed → the serving runtime's tool-call parser doesn't match the model; select the parser the model expects.
- The LLM fails to start with an out-of-memory / KV-cache error → lower the context window or raise the share of GPU memory allotted to the model.
- The bridge / chat interface can't reach
/mcp→ confirm MCP is enabled on the K-Lake API and that in-cluster clients target the API service/mcpdirectly (the public ingress only proxies/api/). - The LLM won't restart under air-gap → put the serving runtime in offline mode (the weights are cached locally); otherwise it tries to reach the internet and the egress policy blocks it.
Scale-up note¶
For a bigger model, nodes with large unified memory can host a larger mixture-of-experts model — at the cost of a matching serving image for that hardware. The single-GPU path is the demo default.