Network automation testing¶

This page shows how to verify that a network automation runbook actually runs when an alert fires. You generate synthetic interface and BGP telemetry with Sonda, trigger the alert, and confirm that Ansible EDA, Prefect, or StackStorm executes the matching workflow.

Most teams discover broken automation wiring during a real outage, which is the worst possible time. The scenarios on this page let you test the wiring without waiting for a real failure.

What you need:

The Alerting Pipeline stack running (Sonda, VictoriaMetrics, vmalert, Alertmanager).
Familiarity with the Network Device Telemetry scenarios.
An automation engine installed (Ansible EDA, Prefect, or StackStorm).
curl and jq in PATH.

Build on what exists

This page extends the alerting pipeline guide. If you have not run through the Alerting Pipeline guide yet, start there. It covers the Sonda to VictoriaMetrics to Alertmanager chain in detail.

The full path¶

sonda CLI       VictoriaMetrics    vmalert        Alertmanager     Automation Engine
 |                 |                 |                |                  |
 |-- push -------->|                 |                |                  |
 |  metrics        |<-- evaluate ----|                |                  |
 |                 |--- alert ------>|                |                  |
 |                 |                 |-- notify ----->|                  |
 |                 |                 |                |-- webhook ------>|
 |                 |                 |                |                  |
 |                 |                 |                |     trigger runbook

The first four hops are covered by the alerting pipeline guide. This page covers the last hop: receiving the Alertmanager webhook and triggering an automation workflow.

Start the alerting stack¶

If the stack is not already running, start it with the alerting profile:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting up -d

Verify all services are healthy:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting ps

See the Alerting Pipeline guide for the full service table and troubleshooting.

Alert rules for automation testing¶

The network device telemetry guide uses alert rules with production-style for: durations (30 seconds to 5 minutes). For automation testing, you want alerts to fire quickly so you get fast feedback on the wiring.

examples/network-automation-alerts.yaml

groups:
  - name: network-automation-alerts
    interval: 5s
    rules:
      - alert: InterfaceDown
        expr: interface_oper_state{job="snmp"} == 0
        for: 10s
        labels:
          severity: critical
          automation: "true"
        annotations:
          summary: "Interface {{ $labels.ifName }} is down on {{ $labels.device }}"
          description: >
            {{ $labels.ifAlias }} ({{ $labels.ifName }}) on {{ $labels.device }}
            has been operationally down for more than 10 seconds.
          runbook_url: "https://runbooks.example.com/network/interface-down"

      - alert: BGPSessionDown
        expr: bgp_session_state{job="snmp"} == 0
        for: 10s
        labels:
          severity: critical
          automation: "true"
        annotations:
          summary: "BGP session to {{ $labels.bgp_peer }} is down on {{ $labels.device }}"
          description: >
            BGP session to AS{{ $labels.bgp_asn }} ({{ $labels.bgp_peer }}) on
            {{ $labels.device }} has been down for more than 10 seconds.
          runbook_url: "https://runbooks.example.com/network/bgp-session-down"

The automation: "true" label lets you route only automation-eligible alerts to your engine, separating them from human-notification routes. The short for: 10s duration means alerts fire within 15 seconds (one evaluation interval plus the pending duration).

To use these rules with the Docker Compose stack, mount them into vmalert alongside or instead of the default rules:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting down -v

# Copy automation rules alongside the existing rules
cp examples/network-automation-alerts.yaml \
  examples/alertmanager/network-automation-alerts.yml

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting up -d

Why copy the file?

The vmalert service mounts examples/alertmanager/alert-rules.yml and evaluates --rule=/rules/*.yml. Placing your file in the same directory makes it available to vmalert without modifying docker-compose-victoriametrics.yml.

Push metrics that trigger alerts¶

The built-in link-failover scenario produces interface_oper_state transitions that trigger InterfaceDown. To also trigger BGPSessionDown, you need a BGP metric.

The failover scenario models an edge router primary link flap (60s up, 30s down, cycling). The backup link saturates after the primary drops. The first flap crosses interface_oper_state == 0 at the one-minute mark, which is enough to trigger the for: 10s rule. For BGP, add a one-liner scenario:

# Terminal 1: run the link-failure scenario from examples/
sonda run examples/network-link-failure.yaml

Sink must target VictoriaMetrics

The example scenarios default to stdout. To push to VictoriaMetrics, change the sink in each scenario entry to http_push. The Network Device Telemetry guide shows the change. For a quick single-metric test, generate a minimal scenario with sonda new --template, change the sink to http_push, then run sonda run your-file.yaml.

Verify the alert fired in vmalert:

curl -s http://localhost:8880/api/v1/alerts \
  | jq '.data.alerts[] | select(.labels.alertname == "InterfaceDown") | {state, labels}'

Then confirm Alertmanager received it:

curl -s http://localhost:9093/api/v2/alerts \
  | jq '.[] | select(.labels.alertname == "InterfaceDown") | .labels'

Connect the webhook to your automation engine¶

Alertmanager delivers alerts as HTTP POST requests with a JSON payload. Your automation engine needs an endpoint that receives these webhooks and triggers the matching workflow.

Here is the Alertmanager webhook payload structure (simplified):

Alertmanager webhook payload

{
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "InterfaceDown",
        "device": "rtr-core-01",
        "ifName": "GigabitEthernet0/0/0",
        "severity": "critical",
        "automation": "true"
      },
      "annotations": {
        "summary": "Interface GigabitEthernet0/0/0 is down on rtr-core-01",
        "runbook_url": "https://runbooks.example.com/network/interface-down"
      },
      "startsAt": "2026-04-04T12:00:00.000Z",
      "endsAt": "0001-01-01T00:00:00Z"
    }
  ]
}

Each automation engine consumes this payload differently. Choose your engine below.

Ansible EDAPrefectStackStorm

Ansible Event-Driven Automation uses rulebooks that map event sources to actions. The alertmanager event source plugin listens for webhook POST requests from Alertmanager.

Rulebook:

rulebook-interface-down.yml

---
- name: Network interface remediation
  hosts: all
  sources:
    - ansible.eda.alertmanager:
        host: 0.0.0.0
        port: 5000
  rules:
    - name: Remediate interface down
      condition: >
        event.alert.labels.alertname == "InterfaceDown"
        and event.alert.status == "firing"
      action:
        run_playbook:
          name: playbooks/remediate-interface.yml
          extra_vars:
            device: "{{ event.alert.labels.device }}"
            interface: "{{ event.alert.labels.ifName }}"

Alertmanager route (add to your alertmanager.yml):

alertmanager.yml (automation route)

route:
  receiver: webhook  # default
  routes:
    - match:
        automation: "true"
      receiver: ansible-eda

receivers:
  - name: webhook
    webhook_configs:
      - url: http://webhook-receiver:8080
        send_resolved: true
  - name: ansible-eda
    webhook_configs:
      - url: http://eda-server:5000/endpoint
        send_resolved: true

Run the rulebook:

ansible-rulebook --rulebook rulebook-interface-down.yml -i inventory.yml

When InterfaceDown fires, EDA receives the webhook, matches the condition, and runs playbooks/remediate-interface.yml with the device and interface as extra vars.

Prefect can receive webhooks through Prefect webhooks that trigger flow runs. Create a webhook endpoint that maps Alertmanager payloads to Prefect events.

Flow definition:

flows/remediate_interface.py

from prefect import flow, get_run_logger

@flow(name="remediate-interface-down")
def remediate_interface(device: str, interface: str, alert_status: str):
    logger = get_run_logger()
    logger.info(f"Remediating {interface} on {device} (status: {alert_status})")

    # Your remediation logic here:
    # - SSH to device, check interface state
    # - Attempt bounce if admin-down
    # - Open ticket if hardware failure
    logger.info(f"Remediation complete for {interface} on {device}")

Webhook receiver (a small FastAPI app that connects Alertmanager to Prefect):

webhook_receiver.py

from fastapi import FastAPI, Request
from flows.remediate_interface import remediate_interface

app = FastAPI()

@app.post("/alertmanager")
async def handle_alert(request: Request):
    payload = await request.json()
    for alert in payload.get("alerts", []):
        if alert["labels"].get("alertname") == "InterfaceDown":
            remediate_interface(
                device=alert["labels"]["device"],
                interface=alert["labels"]["ifName"],
                alert_status=alert["status"],
            )
    return {"status": "ok"}

Point Alertmanager's webhook to http://prefect-receiver:8000/alertmanager.

StackStorm uses sensors and rules to map events to actions. The stackstorm-alertmanager pack provides a webhook sensor for Alertmanager.

Rule definition:

rules/remediate_interface_down.yaml

---
name: remediate_interface_down
pack: network_automation
description: "Trigger interface remediation on InterfaceDown alert"
enabled: true

trigger:
  type: alertmanager.webhook
  parameters: {}

criteria:
  trigger.body.alerts[0].labels.alertname:
    type: equals
    pattern: "InterfaceDown"
  trigger.body.alerts[0].status:
    type: equals
    pattern: "firing"

action:
  ref: network_automation.remediate_interface
  parameters:
    device: "{{ trigger.body.alerts[0].labels.device }}"
    interface: "{{ trigger.body.alerts[0].labels.ifName }}"

Alertmanager route:

alertmanager.yml (StackStorm route)

receivers:
  - name: stackstorm
    webhook_configs:
      - url: http://stackstorm:9102/v1/webhooks/alertmanager
        send_resolved: true

Register the rule and verify it is active:

st2 rule create rules/remediate_interface_down.yaml
st2 rule list --pack=network_automation

Verify the automation triggers¶

With the alerting stack running and the automation engine connected, push metrics and watch the full chain execute.

Step-by-step verification¶

1. Confirm metrics are flowing:

curl -s "http://localhost:8428/api/v1/query?query=interface_oper_state" \
  | jq '.data.result[] | {device: .metric.device, ifName: .metric.ifName, value: .value[1]}'

2. Confirm the alert is firing in vmalert:

curl -s http://localhost:8880/api/v1/alerts \
  | jq '.data.alerts[] | select(.labels.alertname == "InterfaceDown") | {state, value}'

3. Confirm Alertmanager received the alert:

curl -s http://localhost:9093/api/v2/alerts \
  | jq '.[] | select(.labels.alertname == "InterfaceDown")'

4. Confirm webhook delivery (check the echo server logs for the payload):

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting logs webhook-receiver --tail 20

5. Confirm the automation engine received the event and ran the workflow. This step depends on the engine:

Ansible EDAPrefectStackStorm

# Check EDA logs for the triggered playbook
ansible-rulebook --rulebook rulebook-interface-down.yml -i inventory.yml --verbose

Look for log lines showing the condition matched and the playbook executed.

# Check Prefect flow runs
prefect flow-run ls --flow-name "remediate-interface-down"

Verify the flow run completed with the correct device and interface parameters.

# Check StackStorm execution history
st2 execution list --action=network_automation.remediate_interface

Verify the execution finished with status: succeeded.

Test flap handling¶

A single interface-down event is the easy case. The harder case is rapid alternation between firing and resolved — an interface that bounces up and down repeatedly. The automation needs to handle this without triggering a remediation storm.

What rapid alternation looks like¶

The link-failover scenario's flap generator produces a simple pattern: 60s up, 30s down, cycling for the duration. Real flaps are faster and less predictable.

Here is a rapid sequence that toggles every 2 to 3 seconds:

Rapid flap sequence (inline in a scenario entry)

generator:
  type: sequence
  # Rapid flap: toggles every 2-3 seconds over a 20-second window
  values: [1, 1, 1, 1, 1,
           0, 0,
           1, 1, 1,
           0, 0, 0,
           1, 1,
           0, 0,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  repeat: true

And a slower variant with longer down windows:

Slow flap sequence

generator:
  type: sequence
  # Slow flap: 15s up, 10s down, 5s up, 10s down, 20s up
  values: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
           0,0,0,0,0,0,0,0,0,0,
           1,1,1,1,1,
           0,0,0,0,0,0,0,0,0,0,
           1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
  repeat: true

What to validate¶

Use these patterns to test that the automation handles each case correctly:

Flap pattern	Expected behavior	What to check
Single down (10s) with `for: 10s`	Alert fires once, resolves once	Runbook triggers exactly once
Rapid flap (2 to 3s toggles)	Alert may not fire (down duration < `for:`)	Runbook should NOT trigger
Slow flap (10s down windows)	Alert fires on each down window	Runbook triggers per event, or is deduplicated

Tuning for: as a flap filter

The for: duration in an alert rule acts as a debounce. If the interface returns before the for: timer expires, the alert never fires. Increase for: to filter out fast flaps. Decrease it to detect brief outages. Test both extremes with Sonda to find the right balance.

Vary the timing¶

The sequence generator gives precise control over flap timing. At rate: 1, each value in the sequence is one second. To simulate sub-second flaps, increase the rate:

scenarios:
  - signal_type: metrics
    name: interface_oper_state
    rate: 2          # 2 events/second = each sequence value is 500ms
    generator:
      type: sequence
      values: [1, 0, 1, 0, 1, 0, 1, 1, 1, 1]
      repeat: true

To simulate longer intervals, decrease the rate:

scenarios:
  - signal_type: metrics
    name: interface_oper_state
    rate: 0.2        # 1 event every 5 seconds = each sequence value is 5s
    generator:
      type: sequence
      values: [1, 0, 0, 1, 1, 1]  # 5s up, 10s down, 15s up
      repeat: true

Validate remediation workflows end-to-end¶

The final test: does the full chain work from synthetic metric to completed remediation? Here is a checklist for validating the automation workflow against Sonda-generated alerts.

Test matrix¶

Test case	Sonda scenario	Expected alert	Expected automation
Interface down	`@link-failover`	`InterfaceDown` fires	Remediation playbook runs
Interface recovers	Let the `flap` cycle back to 1	`InterfaceDown` resolves	Resolution handler runs (if configured)
BGP session down	BGP sequence (see Network Device Telemetry)	`BGPSessionDown` fires	BGP remediation runs
Rapid flap	Rapid flap sequence (above)	No alert (below `for:` threshold)	No automation triggers
Slow flap	Slow flap sequence (above)	Multiple `InterfaceDown` alerts	Deduplication or rate limiting works
Concurrent failures	Run interface + BGP scenarios together	Both alerts fire	Both workflows run without interference

Resolution events¶

Alertmanager sends a "status": "resolved" webhook when an alert clears. The automation should handle this — for example, closing a ticket or logging the recovery.

The failover scenario produces resolution events naturally. After each 30-second down window, the flap generator returns interface_oper_state to 1, the alert clears, and Alertmanager delivers the resolved webhook. Verify the engine processes it:

# Check webhook logs for resolved status
docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting logs webhook-receiver \
  | grep -i resolved

Concurrency testing¶

Run multiple Sonda scenarios at the same time to confirm the automation handles concurrent alerts correctly. Create a BGP session scenario file:

bgp-session-down.yaml

version: 2
kind: runnable

defaults:
  rate: 1
  duration: 120s
  encoder:
    type: prometheus_text
  sink:
    type: http_push
    url: "http://localhost:8428/api/v1/import/prometheus"
    content_type: "text/plain"

scenarios:
  - signal_type: metrics
    name: bgp_session_state
    generator:
      type: sequence
      # 10s Established, 10s down, 10s Established
      values: [1,1,1,1,1,1,1,1,1,1,
               0,0,0,0,0,0,0,0,0,0,
               1,1,1,1,1,1,1,1,1,1]
      repeat: true
    labels:
      device: rtr-core-01
      bgp_peer: "192.168.1.1"
      bgp_asn: "65001"
      job: snmp

Then run both scenarios at the same time:

# Terminal 1: link failure scenario from examples/
sonda run examples/network-link-failure.yaml &

# Terminal 2: BGP session down
sonda run bgp-session-down.yaml

Both InterfaceDown and BGPSessionDown should fire and trigger their workflows without interfering with each other.

Tear down¶

When you are done testing, stop the alerting stack:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting down -v

If you copied the automation alert rules into examples/alertmanager/, clean them up:

rm -f examples/alertmanager/network-automation-alerts.yml

Quick reference¶

Task	Command
Start alerting stack	`docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting up -d`
Run failover scenario	`sonda run examples/network-link-failure.yaml`
Check vmalert for InterfaceDown	`curl -s http://localhost:8880/api/v1/alerts \\| jq '.data.alerts[]'`
Check Alertmanager alerts	`curl -s http://localhost:9093/api/v2/alerts \\| jq '.[].labels'`
Check webhook delivery	`docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting logs webhook-receiver`
Tear down	`docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting down -v`

Network Device Telemetry — generating interface and BGP metrics with the sequence generator.
Alerting Pipeline — full Sonda to VictoriaMetrics to Alertmanager pipeline.
Alert Testing — generator patterns for testing alert thresholds.
CI Alert Validation — automating alert rule validation in GitHub Actions.