Telemetry¶

KDBL Context Lake (K-Lake) exposes Prometheus-format metrics, structured JSON logs, and standard Kubernetes health endpoints. Together they give you everything you need to alert on, dashboard, and debug a running deployment.

Health endpoints¶

Every K-Lake service exposes:

Path	Purpose
`/healthz`	Liveness. 200 if the process is up. Wire to your liveness probe.
`/readyz`	Readiness. 200 only if dependencies (database, mounts) are reachable. Wire to your readiness probe.

Use /readyz for traffic gating; use /healthz only to restart wedged processes.

Metrics endpoint¶

Each service exposes Prometheus text-format metrics at /metrics on a dedicated port:

Service	Port	Notes
API	`9100`
Worker	`9200`	crawl / listing
Worker (stats sidecar)	`9201`	statistics rollup
Metadata enrichment	`9102`	optional metadata enrichment
Extractor	`9200`	content extraction (`kdbl_extract_*`)
Extractor (engine)	`9101`	content-extraction engine metrics

All metrics share the kdbl_ prefix.

Wiring it to Prometheus¶

There is nothing to "turn on" inside K-Lake — every service always exposes /metrics. You just need Prometheus to scrape it. How depends on your setup:

Plain Prometheus (annotation-based discovery): the deployments already carry prometheus.io/scrape: "true" / prometheus.io/port / prometheus.io/path pod annotations, so a Prometheus configured with the standard kubernetes-pods scrape job picks them up automatically — no extra config.
Prometheus Operator / kube-prometheus-stack (the common case): the pod annotations are ignored. Scraping is driven by ServiceMonitor objects, so each component needs one pointing at a Service with a named telemetry port. Ready-made ServiceMonitors for every component (API, workers, metadata enrichment, and the extractor deployments) ship with the deployment manifests — apply them into the same namespace. A minimal one looks like:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kdbl-extractor
  namespace: kdbl
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kdbl-extractor
  endpoints:
    - port: telemetry      # the named Service port, not the number
      path: /metrics
      interval: 10s

If your kube-prometheus-stack restricts ServiceMonitor discovery by label (serviceMonitorSelector), add the release label it expects — e.g. release: <your-stack-release> — to each ServiceMonitor's metadata.labels.

Key metrics¶

Throughput¶

Metric	Type	Labels	What it tells you
`kdbl_files_written_total`	counter	`backend`, `source_id`	Files persisted to the metadata store
`kdbl_files_inserted_total`	counter	`backend`, `source_id`	New files added
`kdbl_files_updated_total`	counter	`backend`, `source_id`	Existing files with changed metadata
`kdbl_files_unchanged_total`	counter	`backend`, `source_id`	Files seen but already up-to-date
`kdbl_bytes_indexed_total`	counter	`backend`, `source_id`	Total bytes covered by indexed files

Queue depth¶

Metric	Type	Labels	What it tells you
`kdbl_queue_depth`	gauge	`state` (`pending`, `running`, `done`, `failed`)	Current size of each queue partition. Watch `pending` to know when to add workers.
`kdbl_inflight_tasks`	gauge	`worker_id`	Per-worker concurrency utilization
`kdbl_tasks_total`	counter	`protocol`, `outcome`	Tasks completed, broken down by result

Latency¶

Metric	Type	Labels	What it tells you
`kdbl_list_seconds`	histogram	`protocol`	Time to list one directory / prefix from a source
`kdbl_sink_write_seconds`	histogram	`backend`	Time to persist one batch to the metadata store
`kdbl_meta_fetch_seconds`	histogram	`protocol`	Time to gather optional enrichments per file

Metadata enrichment¶

Metric	Type	Labels	What it tells you
`kdbl_meta_files_written`	counter	`backend`, `source_id`	Files with enrichment recorded
`kdbl_meta_tasks_total`	counter	`protocol`, `result` (`ok`, `failed`, `skipped`, `parked`)	Enrichment task outcomes
`kdbl_meta_queue_depth`	gauge	`state`	Enrichment queue size

Content extraction¶

Emitted by the extractor deployments (scraped via their ServiceMonitors — see above). The compute metrics are how you size GPU vs CPU extractor pools.

Metric	Type	Labels	What it tells you
`kdbl_extract_tasks_total`	counter	`protocol`, `outcome` (`ok`, `failed`, `skipped`, `unchanged`), `plugin_version`	Extraction tasks processed, by result. `unchanged` = the extractor's per-task dedup guard skipped an already-extracted file.
`kdbl_extract_skipped_unchanged_total`	counter	`source_id`	Files dropped from the extract enqueue because they were already extracted at their current version. This is the recrawl-dedup payoff — rises on a recrawl of an unchanged source; near-zero means files are changing (or the dedup isn't matching).
`kdbl_extract_seconds`	histogram	`protocol`	End-to-end extraction latency per file
`kdbl_extract_plugin_seconds`	histogram	`plugin_version`	Plugin-measured compute time per file (the GPU/CPU work itself)
`kdbl_extract_pages_total`	counter	`protocol`, `plugin_version`	Pages/segments the plugin processed
`kdbl_extract_ocr_total`	counter	`plugin_version`	Extractions where the plugin ran OCR (the heavy GPU path)
`kdbl_extract_chunks_written_total`	counter	`protocol`	Searchable content chunks persisted
`kdbl_extract_bytes_total`	counter	`protocol`	Source bytes streamed to the extractor
`kdbl_extract_queue_depth`	gauge	`state` (`pending`, `running`)	Content-extraction queue depth. The extractor HPA scales off `pending`.

Retention¶

Metric	Type	Labels	What it tells you
`kdbl_retention_runs_total`	counter	`outcome`	Retention sweeps completed
`kdbl_retention_rows_deleted_total`	counter	`tenant`	Rows removed by retention

Logs¶

All services log JSON to stdout. Configure the verbosity with the LOG_LEVEL environment variable:

LOG_LEVEL=info       # default
LOG_LEVEL=debug      # verbose, per-task detail
LOG_LEVEL=warn,error # quiet

Each log line carries (where applicable):

tenant_id — owning tenant
source_id — source the work was for
task_id — work-unit identifier
worker_id — emitting worker pod
dur_ms — elapsed milliseconds, on completion lines
outcome — final result tag

Ship logs into whatever aggregation stack you use (Loki, Elasticsearch, CloudWatch). Filtering by tenant_id or source_id is the fastest way to scope an investigation.

Suggested alerts¶

Starting points for production alerting:

Alert	Condition
Queue building	`kdbl_queue_depth{state="pending"}` rising for >15 minutes and worker CPU at limit
Failed tasks rising	`rate(kdbl_tasks_total{outcome="failed"}[5m]) > 0.1`
Sink writes slow	`histogram_quantile(0.95, rate(kdbl_sink_write_seconds_bucket[5m])) > 1`
Readiness flapping	`kube_pod_status_ready{condition="false"}` for any K-Lake pod
Source unhealthy	`last_error` field returned by `/api/sources/:id/health` is non-null for >1 hour

Dashboarding¶

The basic operator dashboard is just three panels:

kdbl_queue_depth per state, stacked
rate(kdbl_files_written_total[1m]) per source_id, top 10
histogram_quantile(0.95, rate(kdbl_sink_write_seconds_bucket[5m])) per backend

Build out from there based on the sources and SLOs that matter to your tenants.