Replaying recorded incidents¶

Synthetic shapes prove the alert path works in the abstract. Replay proves it would have caught the real incident. Two generators handle the replay case: sequence for short hand-crafted patterns, and csv_replay for long recordings exported from your TSDB.

Generator	Best for	Storage
`sequence`	≤ 20 values, hand-tuned	Inline in the YAML
`csv_replay`	Real incidents, long recordings	External CSV file

Hand-crafted patterns with sequence¶

The sequence generator steps through an explicit list of values, perfect for short, deterministic threshold patterns:

sonda metrics --scenario examples/sequence-alert-test.yaml

examples/sequence-alert-test.yaml (key fields)

generator:
  type: sequence
  values: [10, 10, 10, 10, 10, 95, 95, 95, 95, 95, 10, 10, 10, 10, 10, 10]
  repeat: true

With repeat: true, the pattern loops continuously. With repeat: false, the generator holds the last value after the sequence ends -- useful for "the metric pegged at 100 and never recovered" scenarios.

Production replay with csv_replay¶

For replaying real production data, the csv_replay generator reads values from a CSV file. If you have a Grafana dashboard showing the incident, see the Grafana CSV Replay guide for the full export-and-replay workflow.

sonda metrics --scenario examples/csv-replay-metrics.yaml

examples/csv-replay-metrics.yaml (key fields)

generator:
  type: csv_replay
  file: examples/sample-cpu-values.csv
  columns:
    - index: 1
      name: cpu_replay

Parameter	Default	Description
`file`	(required)	Path to the CSV file
`columns`	--	Explicit column specs. When absent, columns are auto-discovered from the header. See Generators.
`repeat`	`true`	Cycle back to the first value after reaching the end

When to use csv_replay vs sequence

Use csv_replay over sequence when you have more than ~20 values. It keeps the YAML clean and makes it easy to update the data by replacing the CSV file -- the scenario stays identical.

Exporting values from VictoriaMetrics

curl -s "http://your-vm:8428/api/v1/query_range?\
query=cpu_usage{instance='prod-01'}&\
start=$(date -d '1 hour ago' +%s)&\
end=$(date +%s)&\
step=10s" \
  | jq -r '["timestamp","cpu_percent"], (.data.result[0].values[] | [.[0], .[1]]) | @csv' \
  > incident-values.csv

Where to go from here¶

Replay closes the loop on the local-testing side. To take any of these patterns to a real backend and prove the alert fires end-to-end, head back to the Alert Testing landing page for the backend handoff, or jump straight to the Alerting Pipeline walkthrough that wires vmalert and Alertmanager into the loop.