Server API¶

sonda-server exposes a REST API for starting, inspecting, and stopping scenarios over HTTP. Use it to integrate Sonda into CI pipelines, test harnesses, or dashboards without shell access.

Starting the Server¶

# Default port (8080)
cargo run -p sonda-server

# Custom port and bind address
cargo run -p sonda-server -- --port 9090 --bind 127.0.0.1

See CLI Reference: sonda-server for all sonda-server flags. Control log verbosity with the RUST_LOG environment variable (default: info):

RUST_LOG=debug cargo run -p sonda-server -- --port 8080

Press Ctrl+C for graceful shutdown -- the server signals all running scenarios to stop before exiting.

Health Check¶

curl http://localhost:8080/health
# {"status":"ok"}

Start a Scenario¶

Post a v2 scenario YAML or JSON body to POST /scenarios. The server accepts both text/yaml (or application/x-yaml) and application/json content types.

Need just one event?

POST /scenarios is for sustained emission over time. To fire a single log or metric synchronously and block until the sink ACKs, use the Single-Event API (POST /events) instead.

v2 scenarios only

The server only accepts v2 bodies (version: 2 at the top level). Legacy v1 bodies are rejected with 400 Bad Request and a migration hint. See Migrating v1 bodies below.

Sink URLs resolve inside the server's network

POSTed scenarios compile and run inside the sonda-server process. A sink with url: http://localhost:<port> reaches the server container's loopback, not your host. Use the address the server can actually see:

In Docker Compose, use the service name -- http://victoriametrics:8428, http://loki:3100, kafka:9092.
In Kubernetes, use the in-cluster Service DNS -- http://vmsingle:8428 for same-namespace, or http://vmsingle.monitoring.svc.cluster.local:8428 for cross-namespace.

When a POSTed scenario targets localhost, 127.0.0.1, or [::1], the server still returns 201 Created -- the trap is likely a mistake but sometimes legitimate, so the scenario launches regardless. A warnings: [...] field on the response identifies the offending sink and points here. The same message is emitted via tracing::warn! so operators can catch it in server logs:

Response (201 with loopback warning)

{
  "id": "a1b2c3d4-...",
  "name": "up",
  "state": "running",
  "warnings": [
    "scenario entry 'up' sink `http_push` targets `http://localhost:8428/api/v1/write` — this host resolves to the sonda-server container's own loopback, not your host. Use a Docker Compose service name (e.g. `victoriametrics:8428`) or a Kubernetes Service DNS name instead. See docs/deployment/endpoints.md."
  ]
}

The warnings field is omitted entirely when no issues were detected, so existing clients that do not know about the field continue to parse the response unchanged.

See Endpoints & networking for the full reference, ${VAR:-default} interpolation so one file works from both paths, and a sed one-liner for the rewrite-before-POST fallback.

Single-scenario body¶

YAMLJSON

curl -X POST \
  -H "Content-Type: text/yaml" \
  --data-binary @- http://localhost:8080/scenarios <<'EOF'
version: 2

defaults:
  rate: 10
  duration: 30s
  encoder:
    type: prometheus_text
  sink:
    type: stdout

scenarios:
  - id: up
    signal_type: metrics
    name: up
    generator:
      type: constant
      value: 1.0
EOF

Response

{"id":"a1b2c3d4-...","name":"up","state":"running"}

curl -X POST \
  -H "Content-Type: application/json" \
  -d @- http://localhost:8080/scenarios <<'EOF'
{
  "version": 2,
  "defaults": {
    "rate": 10,
    "duration": "30s",
    "encoder": { "type": "prometheus_text" },
    "sink": { "type": "stdout" }
  },
  "scenarios": [
    {
      "id": "up",
      "signal_type": "metrics",
      "name": "up",
      "generator": { "type": "constant", "value": 1.0 }
    }
  ]
}
EOF

The JSON body is transcoded to YAML server-side and compiled through the same v2 pipeline as the YAML path. Any valid v2 scenario file can be posted as JSON by converting the YAML to its JSON equivalent.

The response shape depends on how many entries the compiler produces, not on the request format. A single-entry result returns the flat {"id", "name", "state"} body; anything that compiles to two or more entries (for example, a pack-backed entry that fans out) returns {"scenarios": [...]}. The state field reports the live lifecycle state at response time and takes one of "pending", "running", "paused", or "finished" (see /scenarios/{id}/stats for the full enum and the pending -> paused transition note).

Multi-scenario body¶

Post a v2 file with two or more scenarios: entries to launch them atomically:

YAMLJSON

curl -X POST \
  -H "Content-Type: text/yaml" \
  --data-binary @examples/multi-scenario.yaml \
  http://localhost:8080/scenarios

examples/multi-scenario.yaml

version: 2

defaults:
  rate: 10
  duration: 30s
  encoder:
    type: prometheus_text
  sink:
    type: stdout

scenarios:
  - id: cpu_usage
    signal_type: metrics
    name: cpu_usage
    generator:
      type: sine
      amplitude: 50
      period_secs: 60
      offset: 50

  - id: app_logs
    signal_type: logs
    name: app_logs
    encoder:
      type: json_lines
    log_generator:
      type: template
      templates:
        - message: "Request from {ip} to {endpoint}"
          field_pools:
            ip: ["10.0.0.1", "10.0.0.2"]
            endpoint: ["/api/v1/health", "/api/v1/metrics"]
      seed: 42

curl -X POST \
  -H "Content-Type: application/json" \
  -d @- http://localhost:8080/scenarios <<'EOF'
{
  "version": 2,
  "defaults": {
    "rate": 10,
    "duration": "30s",
    "encoder": { "type": "prometheus_text" },
    "sink": { "type": "stdout" }
  },
  "scenarios": [
    {
      "id": "cpu_usage",
      "signal_type": "metrics",
      "name": "cpu_usage",
      "generator": { "type": "constant", "value": 42.0 }
    },
    {
      "id": "memory_usage",
      "signal_type": "metrics",
      "name": "memory_usage",
      "generator": { "type": "constant", "value": 75.0 }
    }
  ]
}
EOF

The response wraps each launched scenario in a scenarios array:

{
  "scenarios": [
    { "id": "a1b2c3d4-...", "name": "cpu_usage", "state": "running" },
    { "id": "e5f6a7b8-...", "name": "app_logs", "state": "running" }
  ]
}

Each scenario gets its own ID and runs on a separate thread. You manage them individually with GET /scenarios/{id}, DELETE /scenarios/{id}, etc.

Batch error handling is atomic -- if any entry in the batch fails compilation or validation, the entire request is rejected and nothing is launched:

Condition	Status	Behavior
Body is not v2 (`version: 2` missing)	400	Rejected with migration hint
Body parses but compile fails (unknown field, unresolved pack, etc.)	400	Rejected with compiler error detail
Empty `scenarios: []`	400	At least one scenario required
Any entry fails runtime validation	422	Nothing launched, detail identifies the failing entry
All entries valid	201	All scenarios launched and returned

Long-running scenarios

Omit the duration field from your scenario body (or put duration: only inside a single entry and omit it from defaults:) to create a scenario that runs indefinitely. Stop it later with DELETE /scenarios/{id}. The canonical run-until-stopped example is examples/long-running-metrics.yaml -- POST it to start, DELETE to stop, operator owns the lifecycle. See the tutorial Server API page for a full start and stop example.

Phase offsets and after: chains in batch requests

Multi-scenario batches honor phase_offset, clock_group, and after: fields, just like sonda run. This lets you create time-correlated scenarios over the API:

version: 2

defaults:
  rate: 1
  duration: 120s
  encoder:
    type: prometheus_text
  sink:
    type: stdout

scenarios:
  - id: cpu_usage
    signal_type: metrics
    name: cpu_usage
    phase_offset: "0s"
    clock_group: alert-test
    generator:
      type: sequence
      values: [20, 20, 95, 95, 95, 20]
      repeat: true

  - id: memory_usage
    signal_type: metrics
    name: memory_usage
    phase_offset: "3s"
    clock_group: alert-test
    generator:
      type: sequence
      values: [40, 40, 88, 88, 88, 40]
      repeat: true

The memory_usage scenario starts 3 seconds after cpu_usage, simulating a cascading failure for compound alert testing.

Migrating v1 bodies¶

When you POST a pre-v2 body, the server responds with 400 Bad Request and a migration hint in the detail field:

Response (400)

{
  "error": "bad_request",
  "detail": "body is not a v2 scenario. Sonda only accepts v2 scenario bodies (`version: 2` at the top level). Migrate this body to v2 — see docs/configuration/v2-scenarios.md for the migration guide."
}

The same hint appears for bodies that do declare version: 2 but fail to compile (unknown fields, unresolved pack references, malformed after: clauses). In that case the detail carries the compiler's error message instead. See Migrating from v1 for side-by-side shape conversions.

Pack references over HTTP

The server has no filesystem pack catalog. Bodies that reference a named pack (pack: telegraf_snmp_interface) compile against an empty in-memory resolver and fail with a pack-not-found error. For now, pack-backed scenarios must run via the CLI or be expanded into per-metric entries before posting.

Error response reference¶

Status	Condition	Detail field
400 Bad Request	Body is not UTF-8, not valid JSON/YAML, missing `version: 2`, or fails compilation.	Parser or compiler error; v1 bodies include the migration hint.
409 Conflict	The posted body sets a top-level `scenario_name` that matches an active scenario already in the map.	Identifies the duplicate name and lists the conflicting scenarios. See Duplicate scenario_name returns 409.
422 Unprocessable Entity	Body compiles but fails runtime validation (`rate: 0`, zero `duration`, etc.).	Validation error identifying the failing entry.
500 Internal Server Error	Scenario thread could not be spawned, or internal state error.	Short internal error; check server logs.

Duplicate scenario_name returns 409¶

When a posted v2 body sets a top-level scenario_name, the server scans the active scenario map for any handle that already carries the same scenario_name and is in pending, running, or paused state. If at least one match is found the POST is rejected with 409 Conflict; nothing is launched. The contract is explicit: the operator must DELETE the conflicting scenarios first, then re-post. There is no ?force=true override -- the explicit DELETE is the only way to free the name.

Anonymous bodies (no top-level scenario_name) bypass this check entirely — two consecutive POSTs of the same anonymous body both return 201. Finished handles are considered stale and never block a new POST — once every prior cascade with the same name reaches finished state, a new cascade with the same name returns 201.

The conflict check is best-effort: it acquires a read lock, scans the active scenarios, and releases the lock before launching. Two simultaneous POSTs of the same scenario_name can both pass the check if they race within the launch window -- both will register and their Prometheus streams will collide on duplicate timestamps. Workshop-scale and sequential-operator usage are unaffected; high-concurrency callers should serialize POSTs that share a scenario_name.

The 409 body lists every active scenario contributing to the conflict so the operator knows which IDs to DELETE:

Response (409 Conflict)

{
  "error": "scenario_name 'flap-interface' is already running",
  "conflicting_scenarios": [
    {"id": "a1b2c3d4-...", "name": "link_status", "state": "running"}
  ],
  "hint": "DELETE the conflicting scenarios before posting a new cascade with the same scenario_name"
}

Each conflicting_scenarios entry carries the scenario id (use it with DELETE /scenarios/{id}), the per-entry name (the runtime-launched scenario name, not the file-level scenario_name), and the live state (one of pending, running, paused). When the body produced multiple entries (multi-entry POST or pack expansion), each launched handle inherits the same file-level scenario_name and contributes one item to the array.

API Endpoints¶

Method	Path	Description
GET	`/health`	Health check
POST	`/scenarios`	Start one or more scenarios from YAML/JSON body
GET	`/scenarios`	List all running scenarios
GET	`/scenarios/{id}`	Inspect a scenario: config, stats, elapsed
DELETE	`/scenarios/{id}`	Stop and remove a running scenario
GET	`/scenarios/{id}/stats`	Live stats: rate, events, gap/burst state, sink-failure counters. See Self-observability via /stats.
GET	`/scenarios/{id}/metrics`	Latest metrics in Prometheus text format
POST	`/events`	Emit one log or metric event synchronously. See Single-Event API.

Authentication¶

You can protect scenario endpoints with API key authentication. When enabled, all /scenarios/* requests and POST /events must include a bearer token. The /health endpoint is always public, so health probes and load balancer checks work without credentials.

Enabling authentication¶

Pass an API key via the --api-key flag or the SONDA_API_KEY environment variable:

CLI flagEnvironment variable

sonda-server --port 8080 --api-key my-secret-key

SONDA_API_KEY=my-secret-key sonda-server --port 8080

When the server starts with a key configured, you will see:

INFO sonda_server: API key authentication enabled for /scenarios/* and /events endpoints

No key = no auth

When neither --api-key nor SONDA_API_KEY is set, the server runs without authentication and all endpoints are publicly accessible. This preserves full backwards compatibility with existing deployments.

Making authenticated requests¶

Include the key in the Authorization: Bearer <key> header:

# Start a scenario (requires auth)
curl -X POST \
  -H "Authorization: Bearer my-secret-key" \
  -H "Content-Type: text/yaml" \
  --data-binary @examples/basic-metrics.yaml \
  http://localhost:8080/scenarios

# List scenarios (requires auth)
curl -H "Authorization: Bearer my-secret-key" \
  http://localhost:8080/scenarios

# Health check (always public, no header needed)
curl http://localhost:8080/health

Error responses¶

Requests to protected endpoints without a valid key return 401 Unauthorized with a JSON error body:

Condition	Response body
Missing or malformed header	`{"error": "unauthorized", "detail": "missing or malformed Authorization header"}`
Wrong key	`{"error": "unauthorized", "detail": "invalid API key"}`

Protected vs. public endpoints¶

Endpoint	Auth required
`GET /health`	No -- always public
`POST /scenarios`	Yes
`GET /scenarios`	Yes
`GET /scenarios/{id}`	Yes
`DELETE /scenarios/{id}`	Yes
`GET /scenarios/{id}/stats`	Yes
`GET /scenarios/{id}/metrics`	Yes
`POST /events`	Yes

Prometheus scraping with auth

If you enable authentication, your Prometheus scrape config must include the bearer token for /scenarios/{id}/metrics. Add a bearer_token or bearer_token_file field to your scrape_configs entry. See Scrape Integration below.

Kubernetes Secrets

In Kubernetes deployments, store the API key in a Secret and reference it as an environment variable in your Deployment spec:

sonda-secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: sonda-api-key
type: Opaque
stringData:
  api-key: my-secret-key

deployment patch

env:
  - name: SONDA_API_KEY
    valueFrom:
      secretKeyRef:
        name: sonda-api-key
        key: api-key

Apply the secret before deploying:

kubectl apply -f sonda-secret.yaml

See API key authentication in the Kubernetes deployment guide for the full Helm chart setup.

Stopping a Scenario¶

DELETE /scenarios/{id} stops the scenario thread, collects final stats, and removes the scenario from the server. After deletion, the scenario no longer appears in GET /scenarios and its memory is freed.

curl -X DELETE http://localhost:8080/scenarios/<id>
# {"id":"<id>","status":"stopped","total_events":42}

Response codes:

Status	Meaning
200 OK	Scenario stopped and removed. Body includes `id`, `status`, and `total_events`.
404 Not Found	No scenario with that ID exists (already deleted or never created).

DELETE is not idempotent

A successful DELETE removes the scenario entirely. A second DELETE on the same ID returns 404, not 200. If your automation retries deletes, treat 404 as success.

Self-observability via /stats¶

GET /scenarios/{id}/stats returns live runner telemetry. The four sink-failure fields let external monitors spot a wedged runner without parsing logs, and you choose the threshold that counts as "degraded" for your environment.

`/scenarios` list¶

curl -s http://localhost:8080/scenarios | jq .

Response

{
  "scenarios": [
    {
      "id": "a1b2c3d4-...",
      "name": "noisy_logs",
      "state": "running",
      "elapsed_secs": 184.2
    }
  ]
}

Each entry carries id, name, state, and elapsed_secs. The state field takes one of pending, running, paused, or finished (see the state field reference below for what each value means and the transition note for pending -> paused). To see sink health, follow up with GET /scenarios/{id}/stats for the scenario you care about.

`/scenarios/{id}/stats`¶

curl -s http://localhost:8080/scenarios/$ID/stats | jq .

Response

{
  "total_events": 3359,
  "current_rate": 100.4,
  "target_rate": 100.0,
  "bytes_emitted": 1048576,
  "errors": 12,
  "uptime_secs": 184.2,
  "state": "running",
  "in_gap": false,
  "in_burst": false,
  "consecutive_failures": 4,
  "total_sink_failures": 12,
  "last_sink_error": "HTTP 500 from 'http://loki:3100/loki/api/v1/push'",
  "last_successful_write_at": 1714694400000000000
}

Field	Type	Meaning
`total_events`	integer	Total events emitted since the scenario started.
`current_rate`	float	Measured events per second from the runner's rate tracker.
`target_rate`	float	The rate configured in the scenario file.
`bytes_emitted`	integer	Total bytes written to the sink.
`errors`	integer	Encode or sink-write errors observed.
`uptime_secs`	float	Seconds since the scenario was launched.
`state`	string	One of `pending`, `running`, `paused`, `finished`. See the `while:` lifecycle diagram.
`in_gap`	bool	`true` while a gap window is suppressing output.
`in_burst`	bool	`true` while a burst window is elevating the rate.
`consecutive_failures`	integer	Sink errors observed since the most recent successful write. Resets to `0` on the next successful write.
`total_sink_failures`	integer	Lifetime sink-error count. Monotonic.
`last_sink_error`	string \| null	Text of the most recent sink error, or `null` if none has been observed.
`last_successful_write_at`	integer \| null	Wall-clock time of the most recent successful write, expressed as Unix nanoseconds. `null` until the first write succeeds.

The four sink-failure fields are the runtime telemetry surface for the on_sink_error policy. When on_sink_error: warn (the default) is in effect, the runner stays alive on transient sink errors and these counters tell you what's happening; when on_sink_error: fail is set, the thread exits on the first error and state flips to finished.

pending -> paused is a reachable direct transition

A scenario carrying both after: and while: whose after: fires while the gate is closed enters paused directly, skipping running. Clients building a state-machine assertion should not assume pending always precedes running -- watch for paused from the pending state too.

Upgrading from a release without pending?

Earlier Sonda releases reported only running, paused, and finished on /scenarios/{id}/stats. The pending value is new and arrives when a scenario is waiting on after: or on the first eligible upstream tick of a while: gate. Before rolling out, grep your Prometheus recording rules and Grafana dashboards for label matchers like state=~"running|paused|finished" -- exhaustive enumerations silently drop scenarios in pending. Either add pending to the alternation (state=~"pending|running|paused|finished") or rewrite the matcher as a negation (state!="finished") so new lifecycle values surface without another patch.

Detecting a wedged sink

Compute "degraded" yourself by thresholding total_sink_failures and the staleness of last_successful_write_at. Pick a staleness window that fits your scenario's rate and your tolerance for transient blips:

# Flag a scenario as degraded when sink failures have happened and
# no write has succeeded in the last 30 seconds:
curl -sS http://localhost:8080/scenarios/$ID/stats |
  jq 'select(.total_sink_failures > 0 and (.last_successful_write_at == null or (now*1e9 - .last_successful_write_at) > 30e9))'

A non-empty result means the scenario is degraded by your definition. Wire the same expression into a Kubernetes readiness probe, a Prometheus alert query, or a Grafana panel — the operator owns the threshold.

Scrape Integration¶

The GET /scenarios/{id}/metrics endpoint returns recent metric events in Prometheus text exposition format. This enables pull-based integration: start a scenario via POST /scenarios, then configure Prometheus or vmagent to scrape the endpoint directly.

prometheus.yml

scrape_configs:
  - job_name: sonda
    scrape_interval: 15s
    metrics_path: /scenarios/<SCENARIO_ID>/metrics
    static_configs:
      - targets: ["localhost:8080"]

Replace <SCENARIO_ID> with the ID returned by POST /scenarios.

The endpoint accepts an optional ?limit=N query parameter (default 100, max 1000) to control how many recent events are returned per scrape. Each scrape drains the buffer, so events appear once per cycle. If no metrics are available yet, you get 204 No Content. Unknown scenario IDs return 404 Not Found.

Note

The server is also available as a Docker image and Helm chart for containerized deployments.

Server API¶

Starting the Server¶

Health Check¶

Start a Scenario¶

Single-scenario body¶

Multi-scenario body¶

Migrating v1 bodies¶

Error response reference¶

Duplicate scenario_name returns 409¶

API Endpoints¶

Authentication¶

Enabling authentication¶

Making authenticated requests¶

Error responses¶

Protected vs. public endpoints¶

Stopping a Scenario¶

Self-observability via /stats¶

/scenarios list¶

/scenarios/{id}/stats¶

Scrape Integration¶

`/scenarios` list¶

`/scenarios/{id}/stats`¶