Skip to content

Telemetry

KDBL Context Lake (K-Lake) exposes Prometheus-format metrics, structured JSON logs, and standard Kubernetes health endpoints. Together they give you everything you need to alert on, dashboard, and debug a running deployment.

Health endpoints

Every K-Lake service exposes:

Path Purpose
/healthz Liveness. 200 if the process is up. Wire to your liveness probe.
/readyz Readiness. 200 only if dependencies (database, mounts) are reachable. Wire to your readiness probe.

Use /readyz for traffic gating; use /healthz only to restart wedged processes.

Metrics endpoint

Each service exposes Prometheus text-format metrics at /metrics on a dedicated port:

Service Port Notes
API 9100
Worker 9200 crawl / listing
Worker (stats sidecar) 9201 statistics rollup
Metadata enrichment 9102 optional metadata enrichment
Extractor 9200 content extraction (kdbl_extract_*)
Extractor (engine) 9101 content-extraction engine metrics

All metrics share the kdbl_ prefix.

Wiring it to Prometheus

There is nothing to "turn on" inside K-Lake — every service always exposes /metrics. You just need Prometheus to scrape it. How depends on your setup:

  • Plain Prometheus (annotation-based discovery): the deployments already carry prometheus.io/scrape: "true" / prometheus.io/port / prometheus.io/path pod annotations, so a Prometheus configured with the standard kubernetes-pods scrape job picks them up automatically — no extra config.

  • Prometheus Operator / kube-prometheus-stack (the common case): the pod annotations are ignored. Scraping is driven by ServiceMonitor objects, so each component needs one pointing at a Service with a named telemetry port. Ready-made ServiceMonitors for every component (API, workers, metadata enrichment, and the extractor deployments) ship with the deployment manifests — apply them into the same namespace. A minimal one looks like:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kdbl-extractor
  namespace: kdbl
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kdbl-extractor
  endpoints:
    - port: telemetry      # the named Service port, not the number
      path: /metrics
      interval: 10s

If your kube-prometheus-stack restricts ServiceMonitor discovery by label (serviceMonitorSelector), add the release label it expects — e.g. release: <your-stack-release> — to each ServiceMonitor's metadata.labels.

Key metrics

Throughput

Metric Type Labels What it tells you
kdbl_files_written_total counter backend, source_id Files persisted to the metadata store
kdbl_files_inserted_total counter backend, source_id New files added
kdbl_files_updated_total counter backend, source_id Existing files with changed metadata
kdbl_files_unchanged_total counter backend, source_id Files seen but already up-to-date
kdbl_bytes_indexed_total counter backend, source_id Total bytes covered by indexed files

Queue depth

Metric Type Labels What it tells you
kdbl_queue_depth gauge state (pending, running, done, failed) Current size of each queue partition. Watch pending to know when to add workers.
kdbl_inflight_tasks gauge worker_id Per-worker concurrency utilization
kdbl_tasks_total counter protocol, outcome Tasks completed, broken down by result

Latency

Metric Type Labels What it tells you
kdbl_list_seconds histogram protocol Time to list one directory / prefix from a source
kdbl_sink_write_seconds histogram backend Time to persist one batch to the metadata store
kdbl_meta_fetch_seconds histogram protocol Time to gather optional enrichments per file

Metadata enrichment

Metric Type Labels What it tells you
kdbl_meta_files_written counter backend, source_id Files with enrichment recorded
kdbl_meta_tasks_total counter protocol, result (ok, failed, skipped, parked) Enrichment task outcomes
kdbl_meta_queue_depth gauge state Enrichment queue size

Content extraction

Emitted by the extractor deployments (scraped via their ServiceMonitors — see above). The compute metrics are how you size GPU vs CPU extractor pools.

Metric Type Labels What it tells you
kdbl_extract_tasks_total counter protocol, outcome (ok, failed, skipped, unchanged), plugin_version Extraction tasks processed, by result. unchanged = the extractor's per-task dedup guard skipped an already-extracted file.
kdbl_extract_skipped_unchanged_total counter source_id Files dropped from the extract enqueue because they were already extracted at their current version. This is the recrawl-dedup payoff — rises on a recrawl of an unchanged source; near-zero means files are changing (or the dedup isn't matching).
kdbl_extract_seconds histogram protocol End-to-end extraction latency per file
kdbl_extract_plugin_seconds histogram plugin_version Plugin-measured compute time per file (the GPU/CPU work itself)
kdbl_extract_pages_total counter protocol, plugin_version Pages/segments the plugin processed
kdbl_extract_ocr_total counter plugin_version Extractions where the plugin ran OCR (the heavy GPU path)
kdbl_extract_chunks_written_total counter protocol Searchable content chunks persisted
kdbl_extract_bytes_total counter protocol Source bytes streamed to the extractor
kdbl_extract_queue_depth gauge state (pending, running) Content-extraction queue depth. The extractor HPA scales off pending.

Retention

Metric Type Labels What it tells you
kdbl_retention_runs_total counter outcome Retention sweeps completed
kdbl_retention_rows_deleted_total counter tenant Rows removed by retention

Logs

All services log JSON to stdout. Configure the verbosity with the LOG_LEVEL environment variable:

LOG_LEVEL=info       # default
LOG_LEVEL=debug      # verbose, per-task detail
LOG_LEVEL=warn,error # quiet

Each log line carries (where applicable):

  • tenant_id — owning tenant
  • source_id — source the work was for
  • task_id — work-unit identifier
  • worker_id — emitting worker pod
  • dur_ms — elapsed milliseconds, on completion lines
  • outcome — final result tag

Ship logs into whatever aggregation stack you use (Loki, Elasticsearch, CloudWatch). Filtering by tenant_id or source_id is the fastest way to scope an investigation.

Suggested alerts

Starting points for production alerting:

Alert Condition
Queue building kdbl_queue_depth{state="pending"} rising for >15 minutes and worker CPU at limit
Failed tasks rising rate(kdbl_tasks_total{outcome="failed"}[5m]) > 0.1
Sink writes slow histogram_quantile(0.95, rate(kdbl_sink_write_seconds_bucket[5m])) > 1
Readiness flapping kube_pod_status_ready{condition="false"} for any K-Lake pod
Source unhealthy last_error field returned by /api/sources/:id/health is non-null for >1 hour

Dashboarding

The basic operator dashboard is just three panels:

  1. kdbl_queue_depth per state, stacked
  2. rate(kdbl_files_written_total[1m]) per source_id, top 10
  3. histogram_quantile(0.95, rate(kdbl_sink_write_seconds_bucket[5m])) per backend

Build out from there based on the sources and SLOs that matter to your tenants.