Server metrics¶

sonda-server exposes three Prometheus endpoints. Each one answers a different question. Point your scrape jobs and alerts at the right one — they are not interchangeable.

Endpoint	What it returns	When to scrape
`GET /scenarios/{id}/metrics`	One scenario's series	Per-scenario debugging
`GET /scenarios/metrics`	Aggregate scenario data	Per-process scenario view
`GET /metrics`	Server-process RED + saturation	Operational dashboards and alerts

The URL encodes the distinction. /metrics is metrics about sonda-server — the same Prometheus convention as node_exporter or any other exporter. /scenarios/metrics and /scenarios/{id}/metrics are metrics generated by the scenarios sonda-server is running. /metrics tells you whether the server process is healthy. /scenarios/metrics tells you whether the scenarios are emitting the data you asked for. A saturated server can still serve metrics from a few healthy scenarios; a quiet /scenarios/metrics while /metrics shows zero active scenarios is "no work in flight," not "server broken." Reading the right endpoint is the difference between paging on a real outage and chasing a healthy server.

`GET /metrics`¶

Process-level RED (rate, errors, duration) plus saturation telemetry for the sonda-server process. The Helm chart's ServiceMonitor scrapes this endpoint by default.

curl http://localhost:8080/metrics

Response (text/plain; version=0.0.4; charset=utf-8)

# HELP sonda_server_active_scenarios Scenarios currently held in each operational state.
# TYPE sonda_server_active_scenarios gauge
sonda_server_active_scenarios{state="pending"} 0
sonda_server_active_scenarios{state="running"} 3
sonda_server_active_scenarios{state="paused"} 0
sonda_server_active_scenarios{state="held"} 0
sonda_server_active_scenarios{state="unresolved"} 0
# HELP sonda_server_scenarios_finished_total Scenarios that have reached the Finished state and not yet been deleted.
# TYPE sonda_server_scenarios_finished_total counter
sonda_server_scenarios_finished_total 12
# HELP sonda_server_worker_threads Configured tokio multi-thread worker count.
# TYPE sonda_server_worker_threads gauge
sonda_server_worker_threads 8
# HELP sonda_server_max_scenarios Configured scenario row cap (0 means unlimited).
# TYPE sonda_server_max_scenarios gauge
sonda_server_max_scenarios 500
# HELP sonda_server_requests_total HTTP requests served, per matched route, method, and status.
# TYPE sonda_server_requests_total counter
sonda_server_requests_total{route="/scenarios",method="POST",status="201"} 14
sonda_server_requests_total{route="/scenarios/{id}/stats",method="GET",status="200"} 482
# HELP sonda_server_request_duration_seconds HTTP request duration in seconds, per matched route and method.
# TYPE sonda_server_request_duration_seconds histogram
sonda_server_request_duration_seconds_bucket{route="/scenarios",method="POST",le="0.005"} 0
sonda_server_request_duration_seconds_bucket{route="/scenarios",method="POST",le="+Inf"} 14
sonda_server_request_duration_seconds_sum{route="/scenarios",method="POST"} 0.342
sonda_server_request_duration_seconds_count{route="/scenarios",method="POST"} 14
# HELP sonda_server_sink_errors_total Lifetime sink-write failures, summed across scenarios per sink_type.
# TYPE sonda_server_sink_errors_total counter
sonda_server_sink_errors_total{sink_type="loki"} 4
# HELP sonda_server_uptime_seconds Seconds since the server process started.
# TYPE sonda_server_uptime_seconds gauge
sonda_server_uptime_seconds 18432.4
# HELP sonda_server_build_info Build-time version and git SHA. Constant gauge always set to 1.
# TYPE sonda_server_build_info gauge
sonda_server_build_info{version="1.14.0",git_sha="58f288f47a2c2e91d8c3e8b4f1e6d8a9c2b5f3e7"} 1

The endpoint is idempotent — two back-to-back scrapes return logically equivalent bodies (counters increment, gauges may shift, but the shape and label sets stay stable).

Series reference¶

Nine series. Operator-tense one-liners and the alerts that matter follow.

`sonda_server_active_scenarios`¶

Gauge. Scenarios currently held in each operational state. Always emits all five rows even when the value is zero, so Prometheus alerting stays continuous as scenarios come and go.

Label	Values
`state`	`pending` / `running` / `paused` / `held` / `unresolved`

Finished scenarios are tracked separately by sonda_server_scenarios_finished_total. They still consume a row against --max-scenarios until DELETE /scenarios/{id} removes them.

Alert: a scenario stuck waiting on a cross-POST upstream

sonda_server_active_scenarios{state="unresolved"} > 10

When this fires, the cross-POST resolver has more than 10 downstream scenarios queued without their upstream arriving. Inspect them with GET /scenarios and look at the pending_ref block on GET /scenarios/{id}.

`sonda_server_scenarios_finished_total`¶

Counter. Scenarios that have reached the Finished state and are still resident in the server. They consume a row against --max-scenarios until you DELETE them.

Alert: finished rows not being reaped

sonda_server_scenarios_finished_total > 100

If this counter climbs without a corresponding burst of new scenarios, your automation is forgetting to DELETE /scenarios/{id} after a scenario completes.

`sonda_server_worker_threads`¶

Gauge. The tokio multi-thread worker count configured at startup — that is, the value of --workers or the cgroup-aware default std::thread::available_parallelism().

`sonda_server_max_scenarios`¶

Gauge. The configured value of --max-scenarios. Reports 0 when the cap is disabled (unlimited).

`sonda_server_requests_total`¶

Counter. HTTP requests served by the protected sub-routers, labelled by the matched route template, method, and status.

Label	Values
`route`	The axum route template, e.g. `/scenarios/{id}/stats` — not the concrete URL with the UUID.
`method`	`GET` / `POST` / `DELETE` / etc.
`status`	The numeric status code as a string.

The route label uses the route template (with literal {id} braces), not the concrete request path. This keeps cardinality bounded regardless of how many unique scenario IDs hit the server.

Alert: control-plane saturation hitting the inflight cap

rate(sonda_server_requests_total{status="429"}[5m]) > 0

Sustained 429s mean either the server cap (--max-scenarios) is hit or the inflight-request cap (--max-inflight-requests) is. Cross-reference sonda_server_active_scenarios to tell which.

Alert: payload-size limit being hit

rate(sonda_server_requests_total{status="413"}[5m]) > 0

Some client is sending bodies larger than --max-body-bytes. Either the limit is too tight for legitimate payloads or you have a misbehaving caller.

`sonda_server_request_duration_seconds`¶

Histogram. Request duration in seconds, labelled by matched route and method. Buckets are the Prometheus defaults: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, and +Inf.

Label	Values
`route`	Same matched-path template as `sonda_server_requests_total`.
`method`	Same as `sonda_server_requests_total`.
`le`	Cumulative bucket upper bound.

Alert: control-plane P99 above 1 second

histogram_quantile(0.99,
  sum by (le, route) (rate(sonda_server_request_duration_seconds_bucket[5m]))
) > 1

Use this to catch back-pressure from --max-inflight-requests. tower::limit::ConcurrencyLimitLayer does not return explicit 503s; it back-pressures via Service::poll_ready and the listener queues requests, so the symptom is "P99 climbs" rather than "errors spike."

`sonda_server_sink_errors_total`¶

Counter. Lifetime sink-write failures across every active scenario, summed per sink_type.

Label	Values
`sink_type`	The sink kind, e.g. `loki`, `http_push`, `kafka`, `otlp_grpc`, `tcp`, `stdout`.

Computed at scrape time from each scenario's total_sink_failures counter on GET /scenarios/{id}/stats. The counter resets when scenarios are deleted, which is the intended behaviour — a deleted scenario is no longer your problem.

Alert: sink errors on any sink type in the last 5 minutes

rate(sonda_server_sink_errors_total[5m]) > 0

This is the single most important alert on the page. Sustained sink errors mean some backend is unreachable or rejecting writes. Drill into individual scenarios with GET /scenarios and look at the degraded field, then inspect the offenders with GET /scenarios/{id}/stats for last_sink_error.

`sonda_server_uptime_seconds`¶

Gauge. Seconds since the server process started. Drops to a small value on restart — useful as a crash-loop indicator.

Alert: server restarted in the last 5 minutes

sonda_server_uptime_seconds < 300

`sonda_server_build_info`¶

Constant gauge, always 1. Carries the build version and git SHA as labels.

Label	Values
`version`	Crate version from `Cargo.toml`.
`git_sha`	Git SHA at build time, or `unknown` when `.git/` is absent (Docker source-tarball builds).

Use it to identify which build is running. Join it onto other queries with group_left to see whether a regression coincides with a deploy.

Authentication¶

All three metrics endpoints inherit the same SONDA_API_KEY gate as /scenarios/* and /events. When the env var or --api-key flag is set, requests must include Authorization: Bearer <key>. When the key is unset, the endpoints are public. See Authentication on Deploy as a server for the full setup.

Scraping /metrics with auth

curl -H "Authorization: Bearer my-secret-key" http://localhost:8080/metrics

For Prometheus:

prometheus.yml (with auth)

scrape_configs:
  - job_name: sonda-server
    scrape_interval: 15s
    metrics_path: /metrics
    bearer_token: my-secret-key
    static_configs:
      - targets: ["localhost:8080"]

For the Prometheus Operator, use a bearerTokenSecret on the ServiceMonitor.

Scrape isolation under load¶

The metrics endpoints sit on a separate sub-router from the control plane. --max-inflight-requests, --request-timeout, and --max-body-bytes do not apply to /metrics, /scenarios/metrics, /scenarios/{id}/metrics, or /scenarios/{id}/stats. A saturated server still returns 200 on its observability endpoints, so alerting keeps firing when you need it most. The auth gate still applies.

`GET /scenarios/metrics`¶

Aggregate scenario data — every running scenario's series fused into one Prometheus text response. This is the right endpoint for a regular Prometheus or vmagent scrape job covering every scenario the server is running. See Aggregate Prometheus scrape for the full reference, including the ?label=k:v filter.

`GET /scenarios/{id}/metrics`¶

One scenario's series in Prometheus text format. Use it when each scenario is its own logical target and you know the ID in advance — for example, a scenario loaded from a ConfigMap with a stable name. See Per-scenario metrics for the request shape and response semantics.

Where to next¶

Deploy as a server — flags and resource limits that shape /metrics values.
HTTP API reference — every endpoint and the 429 / 408 / 413 responses that sonda_server_requests_total counts.
Kubernetes — ServiceMonitor configuration for the Prometheus Operator.

Server metrics¶

GET /metrics¶

Series reference¶

sonda_server_active_scenarios¶

sonda_server_scenarios_finished_total¶

sonda_server_worker_threads¶

sonda_server_max_scenarios¶

sonda_server_requests_total¶

sonda_server_request_duration_seconds¶

sonda_server_sink_errors_total¶

sonda_server_uptime_seconds¶

sonda_server_build_info¶