Troubleshooting¶

When Sonda is not working as expected, start here. This page covers the most common issues and how to resolve them. The order goes from general diagnostics to specific sink and deployment problems.

First steps¶

Before looking at specific issues, run these quick checks.

Validate your configuration¶

Use --dry-run to parse and validate a scenario without emitting events:

Scenario fileCatalog entry

my-scenario.yaml

sonda --dry-run run my-scenario.yaml

./my-catalog

sonda --dry-run --catalog ./my-catalog run @cpu-spike

If the config is valid, Sonda prints the resolved settings and exits with code 0. If there is an error, it prints the problem to stderr and exits with code 1.

Get diagnostic output¶

Use --verbose to print the resolved config at startup, then run normally. This shows exactly what Sonda parsed before it starts emitting events:

my-scenario.yaml

sonda --verbose run my-scenario.yaml \
  --sink http_push --endpoint http://localhost:8428/api/v1/import/prometheus

Exit codes¶

Code	Meaning
`0`	Success. Scenario completed or `--dry-run` validation passed.
`1`	Runtime error. Invalid config, sink unreachable, or scenario validation failure.
`2`	Argument parse error. Unknown flag or unrecognized subcommand.

A scenario stopped emitting silently¶

A scenario looks alive (the process is running, no error in the foreground) but no data reaches the backend. This usually means the sink is failing on every write. Sonda's default on_sink_error: warn policy keeps the runner alive through transient sink errors. The symptom is degradation rather than a crash.

Confirm the diagnosis from three places:

Stderr [progress] banner. When a runner exits, the progress reporter prints a one-shot STOPPED line that includes the last sink error:
```
[progress] my_scenario  STOPPED (sink: HTTP 500 from 'http://loki:3100/loki/api/v1/push') | events: 3359 | bytes: 1.0 MB | elapsed: 18h59m
```
No parenthetical means the scenario stopped cleanly (duration expired, Ctrl+C, and so on) and the sink was healthy.
GET /scenarios/{id}/stats. A non-zero consecutive_failures with an old last_successful_write_at confirms the sink is stuck. last_sink_error carries the message:
```
curl -s http://localhost:8080/scenarios/$ID/stats | jq '.consecutive_failures, .last_sink_error'
```
See Self-observability via /stats.

total_events is not a delivery signal for batching sinks

Batching sinks (loki, http_push, remote_write, otlp_grpc, kafka) buffer events and deliver them in groups. total_events increases on every buffered write, so a rising counter is not proof that data reaches the backend. Read the delivery-health fields instead: last_successful_write_at (old or null means nothing has arrived) and consecutive_failures (non-zero means a stuck buffer). See What a stuck batching sink looks like for the full timeline.
Use the degraded field for monitoring. GET /scenarios returns a precomputed degraded: bool per scenario. It is true when the scenario has had sink failures and has not delivered in the last 30 seconds. Use it directly for a readiness probe or alert:
```
curl -sS http://localhost:8080/scenarios | jq '.scenarios[] | select(.degraded)'
```
A non-empty result means at least one scenario has stopped delivering. The same expression works as a Kubernetes readiness probe or a Prometheus alert input. If you need a different staleness window than 30 seconds, threshold the raw fields from the per-scenario /stats endpoint. Combine total_sink_failures > 0 with your own staleness check on last_successful_write_at.

Recovery is the default behavior. Under on_sink_error: warn, the runner keeps emitting while you fix the sink (restart Loki, repair DNS, restore the network path). Once the sink accepts a write, consecutive_failures resets to 0 and last_successful_write_at advances. Any threshold you set clears automatically. If you want a sink failure to hard-fail the run, set on_sink_error: fail. See Sink-error policy.

Connection and delivery issues¶

Connection refused¶

You configured a network sink but Sonda reports a connection error.

Symptom	Likely cause	Fix
`connection refused` on HTTP/TCP sink	Backend is not running or not listening on the expected port	Verify the backend is up: `curl -s http://host:port/health`
`connection refused` on gRPC (OTLP)	Collector not running, or wrong port (HTTP vs gRPC)	OTLP gRPC uses port `4317`, not `4318` (HTTP). Check the collector status
DNS resolution failure	Hostname typo or DNS not configured	Test with `dig` or `nslookup`. Use the IP address to isolate DNS
Timeout with no error	Firewall blocking the port	Check firewall rules. Try `nc -zv host port` to test connectivity

Tip

Test connectivity to your backend before running Sonda. A quick curl -s http://localhost:8428/health for VictoriaMetrics or curl -s http://localhost:3100/ready for Loki confirms the backend is reachable.

Data not appearing at the destination¶

Sonda runs without errors but you do not see data in your backend.

Symptom	Likely cause	Fix
No data in VictoriaMetrics	Wrong endpoint path	Use `/api/v1/import/prometheus` for `http_push`, `/api/v1/write` for `remote_write`
No data in Prometheus	Prometheus needs the remote-write receiver enabled	Start Prometheus with `--web.enable-remote-write-receiver`
Encoder/sink mismatch	Using `prometheus_text` encoder with `remote_write` sink (or vice versa)	Match encoder to sink: `remote_write` encoder with `remote_write` sink, `otlp` encoder with `otlp_grpc` sink
HTTP 400 Bad Request	Wrong `content_type` for the endpoint	Use `text/plain` for the VictoriaMetrics import endpoint
POST to `sonda-server` succeeds but no data in the backend	Sink `url: http://localhost:<port>` resolves inside the server container	Use the in-network address (Compose service name `http://victoriametrics:8428`, or Kubernetes Service DNS), or write the URL with `${VAR:-default}` so one file works from both paths. See Endpoints & networking

Batching delays¶

Data arrives in groups, or only appears when the scenario ends.

Symptom	Likely cause	Fix
Stdout output appears in groups	Normal OS-level buffering (~8 KB)	Expected behavior. Data flushes when the buffer fills or the scenario ends
No HTTP POST until the scenario ends	Batch threshold not reached at low rates	Lower `batch_size` (e.g., `512` for `http_push`) or increase the rate. See Sink Batching
Short scenario sends only one batch	Total data smaller than the batch threshold	All data flushes on exit. This is correct behavior for short runs

Info

At 10 events/sec with http_push at the default 4 KiB threshold, around 40 events must accumulate before the first POST. That is about 4 seconds. Set batch_size: 512 for faster feedback. Time-based flushing is tracked in #266.

Sink-specific issues¶

Loki¶

Symptom	Likely cause	Fix
`400 Bad Request` from Loki	Label names contain invalid characters	Loki labels must match `[a-zA-Z_][a-zA-Z0-9_]*`. Avoid dots, dashes, or spaces in label keys
Logs rejected in multi-tenant Loki	Missing tenant header	Add `X-Scope-OrgID` via custom headers on an `http_push` sink, or use the default tenant if Loki is in single-tenant mode
No logs visible in Grafana	Wrong label selector in Explore	Check that your Grafana query matches the labels you set in the scenario

Tip

Sonda sends logs to {url}/loki/api/v1/push. You configure only the base URL (for example, http://localhost:3100), not the full push path.

Kafka¶

Symptom	Likely cause	Fix
Broker connection timeout	Wrong broker address or port	Verify the broker is reachable: `nc -zv broker-host 9092`. Check for TLS port (`9093`) vs plaintext (`9092`)
`UnknownTopicOrPartition`	Topic does not exist and auto-creation is off	Set `auto.create.topics.enable=true` on the broker, or create the topic before running Sonda
Authentication failure with SASL	Wrong mechanism, username, or password	Confirm `sasl.mechanism` matches your broker config. Confluent Cloud uses `PLAIN`, AWS MSK uses `SCRAM-SHA-256`
Data sent but unreadable	Consumer expects a different encoding	Ensure the consumer's deserializer matches Sonda's encoder (for example, `prometheus_text` produces plain text)

Warning

SASL credentials are sent in plaintext if TLS is not enabled. Sonda warns about this at startup. Always enable tls.enabled: true alongside SASL in production.

Remote write¶

Symptom	Likely cause	Fix
HTTP 400 from backend	Wrong endpoint URL for the backend	Each backend has a specific path. See the compatible endpoints table
HTTP 403 or 401	Backend requires authentication headers	Add auth headers via `http_push` with custom `headers` instead

OTLP gRPC¶

Symptom	Likely cause	Fix
gRPC `INVALID_ARGUMENT`	Signal type mismatch between encoder and sink	Set `signal_type` in the sink to match your scenario: `metrics` for metric scenarios, `logs` for log scenarios
Connection refused on port 4318	Using the HTTP port instead of gRPC	OTLP gRPC uses port `4317`. Port `4318` is for OTLP HTTP
`UNAUTHENTICATED`	Collector requires an auth token	Configure the collector to accept unauthenticated connections, or use an `http_push` sink with auth headers instead

Resource issues¶

High memory usage¶

Symptom	Likely cause	Fix
Memory grows during cardinality spikes	Each unique label combination creates a new series in memory	Reduce `cardinality` in the spike config, or use shorter `for` windows
Memory grows during CSV replay	Large CSV file loaded into memory	Use smaller CSV files, or split large files into chunks
Steady memory growth over long runs	Large label sets with many static labels	Reduce the number of labels per metric. Each label adds memory per series

Info

Sonda's baseline memory footprint is roughly 5 MB. Memory scales with the number of unique series generated at the same time. For sizing guidance, see Capacity Planning — Performance baselines.

Configuration mistakes¶

YAML parsing errors¶

Symptom	Likely cause	Fix
`v2 scenario file requires a top-level 'kind:' field`	The file is missing the `kind:` declaration	Add `kind: runnable` (for files you run) or `kind: composable` (for metric packs) at the top of the file, alongside `version: 2`. See Scenario Files.
`unknown kind '<value>': must be 'runnable' or 'composable'`	`kind:` is set to a typo or unsupported value	Use exactly `runnable` or `composable`.
`invalid type` error on a numeric field	Value is quoted as a string in YAML (for example, `rate: "10"`)	Remove quotes from numeric fields: `rate: 10`
`unknown field` error	Typo in a field name, or field placed at the wrong nesting level	Check indentation. `labels` goes at the scenario level, not inside `sink`
`missing field` error	Required field omitted	Run `sonda --dry-run` to see which field is missing

Feature flag errors¶

Some sinks and encoders require Cargo feature flags when building from source. Pre-built release binaries include all features.

Feature	Required for	Build command
`http`	`http_push`, `loki` sinks	`cargo build --features http -p sonda`
`remote-write`	`remote_write` encoder and sink	`cargo build --features remote-write -p sonda`
`otlp`	`otlp` encoder, `otlp_grpc` sink	`cargo build --features otlp -p sonda`
`kafka`	`kafka` sink	`cargo build --features kafka -p sonda`

Tip

Build with all features at once: cargo build --features http,remote-write,otlp,kafka -p sonda

Container and signal handling¶

Sonda flushes all buffered data on clean shutdown (SIGTERM or SIGINT). If the process is killed with SIGKILL, any data still in the buffer is lost.

Symptom	Likely cause	Fix
Partial data loss in Docker	Container stopped with `docker kill` (sends SIGKILL)	Use `docker stop` instead, which sends SIGTERM and waits for graceful shutdown
Data loss in Kubernetes	Pod killed before flush completes	Set `terminationGracePeriodSeconds` to at least 5 seconds in your pod spec
No data flushed on Ctrl+C in script	Script traps signals before Sonda receives them	Ensure SIGTERM/SIGINT reach the Sonda process

SIGKILL bypasses flush

kill -9 (SIGKILL) terminates Sonda immediately with no chance to flush buffered data. Use kill (SIGTERM) or Ctrl+C (SIGINT) for a clean shutdown.

Kubernetes: ensure graceful shutdown

spec:
  terminationGracePeriodSeconds: 10
  containers:
    - name: sonda
      image: ghcr.io/davidban77/sonda:latest

Docker Compose: default stop signal is SIGTERM (correct)

services:
  sonda:
    image: ghcr.io/davidban77/sonda:latest
    # docker compose stop sends SIGTERM by default. No special config needed
    stop_grace_period: 10s

Related pages:

Sinks — sink types, parameters, and retry configuration
Sink Batching — how batching affects data delivery
CLI Reference — all flags for --dry-run, --verbose, and sink options
Capacity Planning — performance baselines and infrastructure sizing