Telemetry¶
KDBL Context Lake (K-Lake) exposes Prometheus-format metrics, structured JSON logs, and standard Kubernetes health endpoints. Together they give you everything you need to alert on, dashboard, and debug a running deployment.
Health endpoints¶
Every K-Lake service exposes:
| Path | Purpose |
|---|---|
/healthz |
Liveness. 200 if the process is up. Wire to your liveness probe. |
/readyz |
Readiness. 200 only if dependencies (database, mounts) are reachable. Wire to your readiness probe. |
Use /readyz for traffic gating; use /healthz only to restart wedged processes.
Metrics endpoint¶
Each service exposes Prometheus text-format metrics at /metrics on a dedicated port:
| Service | Port | Notes |
|---|---|---|
| API | 9100 |
|
| Worker | 9200 |
crawl / listing |
| Worker (stats sidecar) | 9201 |
statistics rollup |
| Metadata enrichment | 9102 |
optional metadata enrichment |
| Extractor | 9200 |
content extraction (kdbl_extract_*) |
| Extractor (engine) | 9101 |
content-extraction engine metrics |
All metrics share the kdbl_ prefix.
Wiring it to Prometheus¶
There is nothing to "turn on" inside K-Lake — every service always exposes
/metrics. You just need Prometheus to scrape it. How depends on your setup:
-
Plain Prometheus (annotation-based discovery): the deployments already carry
prometheus.io/scrape: "true"/prometheus.io/port/prometheus.io/pathpod annotations, so a Prometheus configured with the standardkubernetes-podsscrape job picks them up automatically — no extra config. -
Prometheus Operator / kube-prometheus-stack (the common case): the pod annotations are ignored. Scraping is driven by
ServiceMonitorobjects, so each component needs one pointing at a Service with a named telemetry port. Ready-made ServiceMonitors for every component (API, workers, metadata enrichment, and the extractor deployments) ship with the deployment manifests — apply them into the same namespace. A minimal one looks like:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kdbl-extractor
namespace: kdbl
spec:
selector:
matchLabels:
app.kubernetes.io/name: kdbl-extractor
endpoints:
- port: telemetry # the named Service port, not the number
path: /metrics
interval: 10s
If your kube-prometheus-stack restricts ServiceMonitor discovery by label (
serviceMonitorSelector), add the release label it expects — e.g.release: <your-stack-release>— to each ServiceMonitor'smetadata.labels.
Key metrics¶
Throughput¶
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
kdbl_files_written_total |
counter | backend, source_id |
Files persisted to the metadata store |
kdbl_files_inserted_total |
counter | backend, source_id |
New files added |
kdbl_files_updated_total |
counter | backend, source_id |
Existing files with changed metadata |
kdbl_files_unchanged_total |
counter | backend, source_id |
Files seen but already up-to-date |
kdbl_bytes_indexed_total |
counter | backend, source_id |
Total bytes covered by indexed files |
Queue depth¶
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
kdbl_queue_depth |
gauge | state (pending, running, done, failed) |
Current size of each queue partition. Watch pending to know when to add workers. |
kdbl_inflight_tasks |
gauge | worker_id |
Per-worker concurrency utilization |
kdbl_tasks_total |
counter | protocol, outcome |
Tasks completed, broken down by result |
Latency¶
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
kdbl_list_seconds |
histogram | protocol |
Time to list one directory / prefix from a source |
kdbl_sink_write_seconds |
histogram | backend |
Time to persist one batch to the metadata store |
kdbl_meta_fetch_seconds |
histogram | protocol |
Time to gather optional enrichments per file |
Metadata enrichment¶
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
kdbl_meta_files_written |
counter | backend, source_id |
Files with enrichment recorded |
kdbl_meta_tasks_total |
counter | protocol, result (ok, failed, skipped, parked) |
Enrichment task outcomes |
kdbl_meta_queue_depth |
gauge | state |
Enrichment queue size |
Content extraction¶
Emitted by the extractor deployments (scraped via their ServiceMonitors — see above). The compute metrics are how you size GPU vs CPU extractor pools.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
kdbl_extract_tasks_total |
counter | protocol, outcome (ok, failed, skipped, unchanged), plugin_version |
Extraction tasks processed, by result. unchanged = the extractor's per-task dedup guard skipped an already-extracted file. |
kdbl_extract_skipped_unchanged_total |
counter | source_id |
Files dropped from the extract enqueue because they were already extracted at their current version. This is the recrawl-dedup payoff — rises on a recrawl of an unchanged source; near-zero means files are changing (or the dedup isn't matching). |
kdbl_extract_seconds |
histogram | protocol |
End-to-end extraction latency per file |
kdbl_extract_plugin_seconds |
histogram | plugin_version |
Plugin-measured compute time per file (the GPU/CPU work itself) |
kdbl_extract_pages_total |
counter | protocol, plugin_version |
Pages/segments the plugin processed |
kdbl_extract_ocr_total |
counter | plugin_version |
Extractions where the plugin ran OCR (the heavy GPU path) |
kdbl_extract_chunks_written_total |
counter | protocol |
Searchable content chunks persisted |
kdbl_extract_bytes_total |
counter | protocol |
Source bytes streamed to the extractor |
kdbl_extract_queue_depth |
gauge | state (pending, running) |
Content-extraction queue depth. The extractor HPA scales off pending. |
Retention¶
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
kdbl_retention_runs_total |
counter | outcome |
Retention sweeps completed |
kdbl_retention_rows_deleted_total |
counter | tenant |
Rows removed by retention |
Logs¶
All services log JSON to stdout. Configure the verbosity with the LOG_LEVEL environment variable:
Each log line carries (where applicable):
tenant_id— owning tenantsource_id— source the work was fortask_id— work-unit identifierworker_id— emitting worker poddur_ms— elapsed milliseconds, on completion linesoutcome— final result tag
Ship logs into whatever aggregation stack you use (Loki, Elasticsearch, CloudWatch). Filtering by tenant_id or source_id is the fastest way to scope an investigation.
Suggested alerts¶
Starting points for production alerting:
| Alert | Condition |
|---|---|
| Queue building | kdbl_queue_depth{state="pending"} rising for >15 minutes and worker CPU at limit |
| Failed tasks rising | rate(kdbl_tasks_total{outcome="failed"}[5m]) > 0.1 |
| Sink writes slow | histogram_quantile(0.95, rate(kdbl_sink_write_seconds_bucket[5m])) > 1 |
| Readiness flapping | kube_pod_status_ready{condition="false"} for any K-Lake pod |
| Source unhealthy | last_error field returned by /api/sources/:id/health is non-null for >1 hour |
Dashboarding¶
The basic operator dashboard is just three panels:
kdbl_queue_depthper state, stackedrate(kdbl_files_written_total[1m])persource_id, top 10histogram_quantile(0.95, rate(kdbl_sink_write_seconds_bucket[5m]))perbackend
Build out from there based on the sources and SLOs that matter to your tenants.