Network automation testing¶
This page shows how to verify that a network automation runbook actually runs when an alert fires. You generate synthetic interface and BGP telemetry with Sonda, trigger the alert, and confirm that Ansible EDA, Prefect, or StackStorm executes the matching workflow.
Most teams discover broken automation wiring during a real outage, which is the worst possible time. The scenarios on this page let you test the wiring without waiting for a real failure.
What you need:
- The Alerting Pipeline stack running (Sonda, VictoriaMetrics, vmalert, Alertmanager).
- Familiarity with the Network Device Telemetry scenarios.
- An automation engine installed (Ansible EDA, Prefect, or StackStorm).
curlandjqin PATH.
Build on what exists
This page extends the alerting pipeline guide. If you have not run through the Alerting Pipeline guide yet, start there. It covers the Sonda to VictoriaMetrics to Alertmanager chain in detail.
The full path¶
sonda CLI VictoriaMetrics vmalert Alertmanager Automation Engine
| | | | |
|-- push -------->| | | |
| metrics |<-- evaluate ----| | |
| |--- alert ------>| | |
| | |-- notify ----->| |
| | | |-- webhook ------>|
| | | | |
| | | | trigger runbook
The first four hops are covered by the alerting pipeline guide. This page covers the last hop: receiving the Alertmanager webhook and triggering an automation workflow.
Start the alerting stack¶
If the stack is not already running, start it with the alerting profile:
Verify all services are healthy:
See the Alerting Pipeline guide for the full service table and troubleshooting.
Alert rules for automation testing¶
The network device telemetry guide uses alert rules with production-style for: durations (30 seconds to 5 minutes). For automation testing, you want alerts to fire quickly so you get fast feedback on the wiring.
groups:
- name: network-automation-alerts
interval: 5s
rules:
- alert: InterfaceDown
expr: interface_oper_state{job="snmp"} == 0
for: 10s
labels:
severity: critical
automation: "true"
annotations:
summary: "Interface {{ $labels.ifName }} is down on {{ $labels.device }}"
description: >
{{ $labels.ifAlias }} ({{ $labels.ifName }}) on {{ $labels.device }}
has been operationally down for more than 10 seconds.
runbook_url: "https://runbooks.example.com/network/interface-down"
- alert: BGPSessionDown
expr: bgp_session_state{job="snmp"} == 0
for: 10s
labels:
severity: critical
automation: "true"
annotations:
summary: "BGP session to {{ $labels.bgp_peer }} is down on {{ $labels.device }}"
description: >
BGP session to AS{{ $labels.bgp_asn }} ({{ $labels.bgp_peer }}) on
{{ $labels.device }} has been down for more than 10 seconds.
runbook_url: "https://runbooks.example.com/network/bgp-session-down"
The automation: "true" label lets you route only automation-eligible alerts to your engine, separating them from human-notification routes. The short for: 10s duration means alerts fire within 15 seconds (one evaluation interval plus the pending duration).
To use these rules with the Docker Compose stack, mount them into vmalert alongside or instead of the default rules:
docker compose -f examples/docker-compose-victoriametrics.yml \
--profile alerting down -v
# Copy automation rules alongside the existing rules
cp examples/network-automation-alerts.yaml \
examples/alertmanager/network-automation-alerts.yml
docker compose -f examples/docker-compose-victoriametrics.yml \
--profile alerting up -d
Why copy the file?
The vmalert service mounts examples/alertmanager/alert-rules.yml and evaluates --rule=/rules/*.yml. Placing your file in the same directory makes it available to vmalert without modifying docker-compose-victoriametrics.yml.
Push metrics that trigger alerts¶
The built-in link-failover scenario produces interface_oper_state transitions that trigger InterfaceDown. To also trigger BGPSessionDown, you need a BGP metric.
The failover scenario models an edge router primary link flap (60s up, 30s down, cycling). The backup link saturates after the primary drops. The first flap crosses interface_oper_state == 0 at the one-minute mark, which is enough to trigger the for: 10s rule. For BGP, add a one-liner scenario:
# Terminal 1: run the link-failure scenario from examples/
sonda run examples/network-link-failure.yaml
Sink must target VictoriaMetrics
The example scenarios default to stdout. To push to VictoriaMetrics, change the sink in each scenario entry to http_push. The Network Device Telemetry guide shows the change. For a quick single-metric test, generate a minimal scenario with sonda new --template, change the sink to http_push, then run sonda run your-file.yaml.
Verify the alert fired in vmalert:
curl -s http://localhost:8880/api/v1/alerts \
| jq '.data.alerts[] | select(.labels.alertname == "InterfaceDown") | {state, labels}'
Then confirm Alertmanager received it:
curl -s http://localhost:9093/api/v2/alerts \
| jq '.[] | select(.labels.alertname == "InterfaceDown") | .labels'
Connect the webhook to your automation engine¶
Alertmanager delivers alerts as HTTP POST requests with a JSON payload. Your automation engine needs an endpoint that receives these webhooks and triggers the matching workflow.
Here is the Alertmanager webhook payload structure (simplified):
{
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "InterfaceDown",
"device": "rtr-core-01",
"ifName": "GigabitEthernet0/0/0",
"severity": "critical",
"automation": "true"
},
"annotations": {
"summary": "Interface GigabitEthernet0/0/0 is down on rtr-core-01",
"runbook_url": "https://runbooks.example.com/network/interface-down"
},
"startsAt": "2026-04-04T12:00:00.000Z",
"endsAt": "0001-01-01T00:00:00Z"
}
]
}
Each automation engine consumes this payload differently. Choose your engine below.
Ansible Event-Driven Automation uses rulebooks that map event sources to actions. The alertmanager event source plugin listens for webhook POST requests from Alertmanager.
Rulebook:
---
- name: Network interface remediation
hosts: all
sources:
- ansible.eda.alertmanager:
host: 0.0.0.0
port: 5000
rules:
- name: Remediate interface down
condition: >
event.alert.labels.alertname == "InterfaceDown"
and event.alert.status == "firing"
action:
run_playbook:
name: playbooks/remediate-interface.yml
extra_vars:
device: "{{ event.alert.labels.device }}"
interface: "{{ event.alert.labels.ifName }}"
Alertmanager route (add to your alertmanager.yml):
route:
receiver: webhook # default
routes:
- match:
automation: "true"
receiver: ansible-eda
receivers:
- name: webhook
webhook_configs:
- url: http://webhook-receiver:8080
send_resolved: true
- name: ansible-eda
webhook_configs:
- url: http://eda-server:5000/endpoint
send_resolved: true
Run the rulebook:
When InterfaceDown fires, EDA receives the webhook, matches the condition, and runs playbooks/remediate-interface.yml with the device and interface as extra vars.
Prefect can receive webhooks through Prefect webhooks that trigger flow runs. Create a webhook endpoint that maps Alertmanager payloads to Prefect events.
Flow definition:
from prefect import flow, get_run_logger
@flow(name="remediate-interface-down")
def remediate_interface(device: str, interface: str, alert_status: str):
logger = get_run_logger()
logger.info(f"Remediating {interface} on {device} (status: {alert_status})")
# Your remediation logic here:
# - SSH to device, check interface state
# - Attempt bounce if admin-down
# - Open ticket if hardware failure
logger.info(f"Remediation complete for {interface} on {device}")
Webhook receiver (a small FastAPI app that connects Alertmanager to Prefect):
from fastapi import FastAPI, Request
from flows.remediate_interface import remediate_interface
app = FastAPI()
@app.post("/alertmanager")
async def handle_alert(request: Request):
payload = await request.json()
for alert in payload.get("alerts", []):
if alert["labels"].get("alertname") == "InterfaceDown":
remediate_interface(
device=alert["labels"]["device"],
interface=alert["labels"]["ifName"],
alert_status=alert["status"],
)
return {"status": "ok"}
Point Alertmanager's webhook to http://prefect-receiver:8000/alertmanager.
StackStorm uses sensors and rules to map events to actions. The stackstorm-alertmanager pack provides a webhook sensor for Alertmanager.
Rule definition:
---
name: remediate_interface_down
pack: network_automation
description: "Trigger interface remediation on InterfaceDown alert"
enabled: true
trigger:
type: alertmanager.webhook
parameters: {}
criteria:
trigger.body.alerts[0].labels.alertname:
type: equals
pattern: "InterfaceDown"
trigger.body.alerts[0].status:
type: equals
pattern: "firing"
action:
ref: network_automation.remediate_interface
parameters:
device: "{{ trigger.body.alerts[0].labels.device }}"
interface: "{{ trigger.body.alerts[0].labels.ifName }}"
Alertmanager route:
receivers:
- name: stackstorm
webhook_configs:
- url: http://stackstorm:9102/v1/webhooks/alertmanager
send_resolved: true
Register the rule and verify it is active:
Verify the automation triggers¶
With the alerting stack running and the automation engine connected, push metrics and watch the full chain execute.
Step-by-step verification¶
1. Confirm metrics are flowing:
curl -s "http://localhost:8428/api/v1/query?query=interface_oper_state" \
| jq '.data.result[] | {device: .metric.device, ifName: .metric.ifName, value: .value[1]}'
2. Confirm the alert is firing in vmalert:
curl -s http://localhost:8880/api/v1/alerts \
| jq '.data.alerts[] | select(.labels.alertname == "InterfaceDown") | {state, value}'
3. Confirm Alertmanager received the alert:
curl -s http://localhost:9093/api/v2/alerts \
| jq '.[] | select(.labels.alertname == "InterfaceDown")'
4. Confirm webhook delivery (check the echo server logs for the payload):
docker compose -f examples/docker-compose-victoriametrics.yml \
--profile alerting logs webhook-receiver --tail 20
5. Confirm the automation engine received the event and ran the workflow. This step depends on the engine:
Test flap handling¶
A single interface-down event is the easy case. The harder case is rapid alternation between firing and resolved — an interface that bounces up and down repeatedly. The automation needs to handle this without triggering a remediation storm.
What rapid alternation looks like¶
The link-failover scenario's flap generator produces a simple pattern: 60s up, 30s down, cycling for the duration. Real flaps are faster and less predictable.
Here is a rapid sequence that toggles every 2 to 3 seconds:
generator:
type: sequence
# Rapid flap: toggles every 2-3 seconds over a 20-second window
values: [1, 1, 1, 1, 1,
0, 0,
1, 1, 1,
0, 0, 0,
1, 1,
0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
repeat: true
And a slower variant with longer down windows:
generator:
type: sequence
# Slow flap: 15s up, 10s down, 5s up, 10s down, 20s up
values: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,
0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
repeat: true
What to validate¶
Use these patterns to test that the automation handles each case correctly:
| Flap pattern | Expected behavior | What to check |
|---|---|---|
Single down (10s) with for: 10s |
Alert fires once, resolves once | Runbook triggers exactly once |
| Rapid flap (2 to 3s toggles) | Alert may not fire (down duration < for:) |
Runbook should NOT trigger |
| Slow flap (10s down windows) | Alert fires on each down window | Runbook triggers per event, or is deduplicated |
Tuning for: as a flap filter
The for: duration in an alert rule acts as a debounce. If the interface returns before the for: timer expires, the alert never fires. Increase for: to filter out fast flaps. Decrease it to detect brief outages. Test both extremes with Sonda to find the right balance.
Vary the timing¶
The sequence generator gives precise control over flap timing. At rate: 1, each value in the sequence is one second. To simulate sub-second flaps, increase the rate:
scenarios:
- signal_type: metrics
name: interface_oper_state
rate: 2 # 2 events/second = each sequence value is 500ms
generator:
type: sequence
values: [1, 0, 1, 0, 1, 0, 1, 1, 1, 1]
repeat: true
To simulate longer intervals, decrease the rate:
scenarios:
- signal_type: metrics
name: interface_oper_state
rate: 0.2 # 1 event every 5 seconds = each sequence value is 5s
generator:
type: sequence
values: [1, 0, 0, 1, 1, 1] # 5s up, 10s down, 15s up
repeat: true
Validate remediation workflows end-to-end¶
The final test: does the full chain work from synthetic metric to completed remediation? Here is a checklist for validating the automation workflow against Sonda-generated alerts.
Test matrix¶
| Test case | Sonda scenario | Expected alert | Expected automation |
|---|---|---|---|
| Interface down | @link-failover |
InterfaceDown fires |
Remediation playbook runs |
| Interface recovers | Let the flap cycle back to 1 |
InterfaceDown resolves |
Resolution handler runs (if configured) |
| BGP session down | BGP sequence (see Network Device Telemetry) | BGPSessionDown fires |
BGP remediation runs |
| Rapid flap | Rapid flap sequence (above) | No alert (below for: threshold) |
No automation triggers |
| Slow flap | Slow flap sequence (above) | Multiple InterfaceDown alerts |
Deduplication or rate limiting works |
| Concurrent failures | Run interface + BGP scenarios together | Both alerts fire | Both workflows run without interference |
Resolution events¶
Alertmanager sends a "status": "resolved" webhook when an alert clears. The automation should handle this — for example, closing a ticket or logging the recovery.
The failover scenario produces resolution events naturally. After each 30-second down window, the flap generator returns interface_oper_state to 1, the alert clears, and Alertmanager delivers the resolved webhook. Verify the engine processes it:
# Check webhook logs for resolved status
docker compose -f examples/docker-compose-victoriametrics.yml \
--profile alerting logs webhook-receiver \
| grep -i resolved
Concurrency testing¶
Run multiple Sonda scenarios at the same time to confirm the automation handles concurrent alerts correctly. Create a BGP session scenario file:
version: 2
kind: runnable
defaults:
rate: 1
duration: 120s
encoder:
type: prometheus_text
sink:
type: http_push
url: "http://localhost:8428/api/v1/import/prometheus"
content_type: "text/plain"
scenarios:
- signal_type: metrics
name: bgp_session_state
generator:
type: sequence
# 10s Established, 10s down, 10s Established
values: [1,1,1,1,1,1,1,1,1,1,
0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1]
repeat: true
labels:
device: rtr-core-01
bgp_peer: "192.168.1.1"
bgp_asn: "65001"
job: snmp
Then run both scenarios at the same time:
# Terminal 1: link failure scenario from examples/
sonda run examples/network-link-failure.yaml &
# Terminal 2: BGP session down
sonda run bgp-session-down.yaml
Both InterfaceDown and BGPSessionDown should fire and trigger their workflows without interfering with each other.
Tear down¶
When you are done testing, stop the alerting stack:
If you copied the automation alert rules into examples/alertmanager/, clean them up:
Quick reference¶
| Task | Command |
|---|---|
| Start alerting stack | docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting up -d |
| Run failover scenario | sonda run examples/network-link-failure.yaml |
| Check vmalert for InterfaceDown | curl -s http://localhost:8880/api/v1/alerts \| jq '.data.alerts[]' |
| Check Alertmanager alerts | curl -s http://localhost:9093/api/v2/alerts \| jq '.[].labels' |
| Check webhook delivery | docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting logs webhook-receiver |
| Tear down | docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting down -v |
Related pages¶
- Network Device Telemetry — generating interface and BGP metrics with the sequence generator.
- Alerting Pipeline — full Sonda to VictoriaMetrics to Alertmanager pipeline.
- Alert Testing — generator patterns for testing alert thresholds.
- CI Alert Validation — automating alert rule validation in GitHub Actions.