Alert Testing¶
3 a.m. The pager goes off for HighRequestLatency. By the time you log in, latency
is back below threshold and the alert has cleared. You spend an hour reading dashboards
and find nothing -- the spike was real, but it lasted 90 seconds and your for: 5m
clause silently swallowed it. The alert is doing exactly what you told it to. You just
told it the wrong thing.
That whole class of problem -- for: durations that swallow real spikes, gap-fill rules
that fire during scrape outages, compound A AND B rules where the two signals never
overlap -- only shows up in production because nothing else generates the right metric
shape. Sonda does. You write the alert, run a scenario that crosses the threshold for
exactly the duration you care about, and watch whether the alert fires.
This page is the entry point. Five focused sub-pages cover the patterns; the table below maps each common alert shape to the right one.
Pick your pattern¶
| You want to test... | Go to | Generator |
|---|---|---|
A simple > threshold rule |
Threshold and for: duration |
sine |
A short for: clause (≤ 30s) |
Threshold and for: duration |
sequence |
A long for: clause (minutes) |
Threshold and for: duration |
constant |
| Resolution / flapping behavior | Resolution and recovery | any + gaps |
Compound A AND B rules |
Compound and correlated alerts | multi-scenario |
| Cardinality guardrails | Cardinality explosion alerts | any + cardinality_spikes |
| Replaying a known incident | Replaying recorded incidents | sequence or csv_replay |
The pages are written as a tour and link forward to one another, but each one stands on its own -- jump straight to the one that matches the rule you are testing.
The tour¶
- Threshold and
for:duration -- sine for predictable crossings, sequence for exact breach windows, constant for sustained load. - Resolution and recovery -- gap windows that drop the metric so you can confirm the alert clears.
- Compound and correlated alerts --
phase_offsetandclock_groupto overlap two scenarios forA AND Brules. - Cardinality explosion alerts --
cardinality_spikesfor testing series-count guardrails. - Replaying recorded incidents --
sequencefor short patterns,csv_replayfor production exports.
Push to a real backend¶
Once you can shape the alert pattern locally, push it into a real TSDB and verify the
alert fires there. The push-and-query loop -- start the backend, run the scenario,
curl the query API -- is the same one E2E Testing walks through,
with the full coverage matrix of encoder and sink combinations.
For alerting specifically, the two scenarios you will reach for first are
examples/vm-push-scenario.yaml (Prometheus text via http_push) and
examples/remote-write-vm.yaml (remote_write to VictoriaMetrics, vmagent, or
upstream Prometheus). Both land in the stack from
examples/docker-compose-victoriametrics.yml:
# Start the stack
docker compose -f examples/docker-compose-victoriametrics.yml up -d
# Push test data
sonda metrics --scenario examples/vm-push-scenario.yaml
# Verify the metric exists (wait ~15s for ingestion)
curl "http://localhost:8428/api/v1/query?query=cpu_usage"
# Tear down
docker compose -f examples/docker-compose-victoriametrics.yml down -v
| Service | Port | Purpose |
|---|---|---|
| sonda-server | 8080 | REST API for scenario management |
| VictoriaMetrics | 8428 | Time series database |
| vmagent | 8429 | Metrics relay agent |
| Grafana | 3000 | Dashboards (auto-provisioned) |
See Docker Deployment for the full stack configuration.
Close the loop with Alertmanager
This stack verifies that data arrives in VictoriaMetrics, but does not prove alerts fire. To add vmalert, Alertmanager, and a webhook receiver to the stack, see the Alerting Pipeline guide.
Scrape model instead of push¶
If you prefer the Prometheus pull model, sonda-server exposes a scrape endpoint for each running scenario. Start the server and submit a scenario:
cargo run -p sonda-server -- --port 8080
# In another terminal:
curl -X POST -H "Content-Type: text/yaml" \
--data-binary @examples/sine-threshold-test.yaml \
http://localhost:8080/scenarios
The response includes a scenario ID. Configure Prometheus to scrape it:
scrape_configs:
- job_name: sonda
scrape_interval: 15s
static_configs:
- targets: ['localhost:8080']
metrics_path: /scenarios/<scenario-id>/metrics
See Server API for the full API reference.
Quick reference¶
| Pattern | Generator | Example file |
|---|---|---|
| Threshold crossing | sine |
sine-threshold-test.yaml |
| Sustained breach | constant |
constant-threshold-test.yaml |
| Alert resolution via gap | constant + gaps |
gap-alert-test.yaml |
Precise for: duration |
sequence |
for-duration-test.yaml |
| Compound alert | multi-scenario | multi-metric-correlation.yaml |
| Cardinality explosion | any + cardinality_spikes |
cardinality-alert-test.yaml |
| Periodic spike / anomaly | spike |
spike-alert-test.yaml |
| Incident replay (inline) | sequence |
sequence-alert-test.yaml |
| Incident replay (file) | csv_replay |
csv-replay-metrics.yaml |
| Push to VictoriaMetrics | any | vm-push-scenario.yaml |
| Remote write | any | remote-write-vm.yaml |
Next steps¶
Verifying alerts fire end-to-end? See Alerting Pipeline to run vmalert, Alertmanager, and a webhook receiver with Docker Compose.
Validating alert rules in CI? See CI Alert Validation to catch broken rules before they reach production.
Validating a pipeline change? See Pipeline Validation.
Verifying recording rules? Check Recording Rules.
Browsing all example scenarios? See Example Scenarios.