Skip to content

Network automation testing

This page shows how to verify that a network automation runbook actually runs when an alert fires. You generate synthetic interface and BGP telemetry with Sonda, trigger the alert, and confirm that Ansible EDA, Prefect, or StackStorm executes the matching workflow.

Most teams discover broken automation wiring during a real outage, which is the worst possible time. The scenarios on this page let you test the wiring without waiting for a real failure.

What you need:

  • The Alerting Pipeline stack running (Sonda, VictoriaMetrics, vmalert, Alertmanager).
  • Familiarity with the Network Device Telemetry scenarios.
  • An automation engine installed (Ansible EDA, Prefect, or StackStorm).
  • curl and jq in PATH.

Build on what exists

This page extends the alerting pipeline guide. If you have not run through the Alerting Pipeline guide yet, start there. It covers the Sonda to VictoriaMetrics to Alertmanager chain in detail.

The full path

sonda CLI       VictoriaMetrics    vmalert        Alertmanager     Automation Engine
 |                 |                 |                |                  |
 |-- push -------->|                 |                |                  |
 |  metrics        |<-- evaluate ----|                |                  |
 |                 |--- alert ------>|                |                  |
 |                 |                 |-- notify ----->|                  |
 |                 |                 |                |-- webhook ------>|
 |                 |                 |                |                  |
 |                 |                 |                |     trigger runbook

The first four hops are covered by the alerting pipeline guide. This page covers the last hop: receiving the Alertmanager webhook and triggering an automation workflow.

Start the alerting stack

If the stack is not already running, start it with the alerting profile:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting up -d

Verify all services are healthy:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting ps

See the Alerting Pipeline guide for the full service table and troubleshooting.

Alert rules for automation testing

The network device telemetry guide uses alert rules with production-style for: durations (30 seconds to 5 minutes). For automation testing, you want alerts to fire quickly so you get fast feedback on the wiring.

examples/network-automation-alerts.yaml
groups:
  - name: network-automation-alerts
    interval: 5s
    rules:
      - alert: InterfaceDown
        expr: interface_oper_state{job="snmp"} == 0
        for: 10s
        labels:
          severity: critical
          automation: "true"
        annotations:
          summary: "Interface {{ $labels.ifName }} is down on {{ $labels.device }}"
          description: >
            {{ $labels.ifAlias }} ({{ $labels.ifName }}) on {{ $labels.device }}
            has been operationally down for more than 10 seconds.
          runbook_url: "https://runbooks.example.com/network/interface-down"

      - alert: BGPSessionDown
        expr: bgp_session_state{job="snmp"} == 0
        for: 10s
        labels:
          severity: critical
          automation: "true"
        annotations:
          summary: "BGP session to {{ $labels.bgp_peer }} is down on {{ $labels.device }}"
          description: >
            BGP session to AS{{ $labels.bgp_asn }} ({{ $labels.bgp_peer }}) on
            {{ $labels.device }} has been down for more than 10 seconds.
          runbook_url: "https://runbooks.example.com/network/bgp-session-down"

The automation: "true" label lets you route only automation-eligible alerts to your engine, separating them from human-notification routes. The short for: 10s duration means alerts fire within 15 seconds (one evaluation interval plus the pending duration).

To use these rules with the Docker Compose stack, mount them into vmalert alongside or instead of the default rules:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting down -v

# Copy automation rules alongside the existing rules
cp examples/network-automation-alerts.yaml \
  examples/alertmanager/network-automation-alerts.yml

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting up -d

Why copy the file?

The vmalert service mounts examples/alertmanager/alert-rules.yml and evaluates --rule=/rules/*.yml. Placing your file in the same directory makes it available to vmalert without modifying docker-compose-victoriametrics.yml.

Push metrics that trigger alerts

The built-in link-failover scenario produces interface_oper_state transitions that trigger InterfaceDown. To also trigger BGPSessionDown, you need a BGP metric.

The failover scenario models an edge router primary link flap (60s up, 30s down, cycling). The backup link saturates after the primary drops. The first flap crosses interface_oper_state == 0 at the one-minute mark, which is enough to trigger the for: 10s rule. For BGP, add a one-liner scenario:

# Terminal 1: run the link-failure scenario from examples/
sonda run examples/network-link-failure.yaml

Sink must target VictoriaMetrics

The example scenarios default to stdout. To push to VictoriaMetrics, change the sink in each scenario entry to http_push. The Network Device Telemetry guide shows the change. For a quick single-metric test, generate a minimal scenario with sonda new --template, change the sink to http_push, then run sonda run your-file.yaml.

Verify the alert fired in vmalert:

curl -s http://localhost:8880/api/v1/alerts \
  | jq '.data.alerts[] | select(.labels.alertname == "InterfaceDown") | {state, labels}'

Then confirm Alertmanager received it:

curl -s http://localhost:9093/api/v2/alerts \
  | jq '.[] | select(.labels.alertname == "InterfaceDown") | .labels'

Connect the webhook to your automation engine

Alertmanager delivers alerts as HTTP POST requests with a JSON payload. Your automation engine needs an endpoint that receives these webhooks and triggers the matching workflow.

Here is the Alertmanager webhook payload structure (simplified):

Alertmanager webhook payload
{
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "InterfaceDown",
        "device": "rtr-core-01",
        "ifName": "GigabitEthernet0/0/0",
        "severity": "critical",
        "automation": "true"
      },
      "annotations": {
        "summary": "Interface GigabitEthernet0/0/0 is down on rtr-core-01",
        "runbook_url": "https://runbooks.example.com/network/interface-down"
      },
      "startsAt": "2026-04-04T12:00:00.000Z",
      "endsAt": "0001-01-01T00:00:00Z"
    }
  ]
}

Each automation engine consumes this payload differently. Choose your engine below.

Ansible Event-Driven Automation uses rulebooks that map event sources to actions. The alertmanager event source plugin listens for webhook POST requests from Alertmanager.

Rulebook:

rulebook-interface-down.yml
---
- name: Network interface remediation
  hosts: all
  sources:
    - ansible.eda.alertmanager:
        host: 0.0.0.0
        port: 5000
  rules:
    - name: Remediate interface down
      condition: >
        event.alert.labels.alertname == "InterfaceDown"
        and event.alert.status == "firing"
      action:
        run_playbook:
          name: playbooks/remediate-interface.yml
          extra_vars:
            device: "{{ event.alert.labels.device }}"
            interface: "{{ event.alert.labels.ifName }}"

Alertmanager route (add to your alertmanager.yml):

alertmanager.yml (automation route)
route:
  receiver: webhook  # default
  routes:
    - match:
        automation: "true"
      receiver: ansible-eda

receivers:
  - name: webhook
    webhook_configs:
      - url: http://webhook-receiver:8080
        send_resolved: true
  - name: ansible-eda
    webhook_configs:
      - url: http://eda-server:5000/endpoint
        send_resolved: true

Run the rulebook:

ansible-rulebook --rulebook rulebook-interface-down.yml -i inventory.yml

When InterfaceDown fires, EDA receives the webhook, matches the condition, and runs playbooks/remediate-interface.yml with the device and interface as extra vars.

Prefect can receive webhooks through Prefect webhooks that trigger flow runs. Create a webhook endpoint that maps Alertmanager payloads to Prefect events.

Flow definition:

flows/remediate_interface.py
from prefect import flow, get_run_logger

@flow(name="remediate-interface-down")
def remediate_interface(device: str, interface: str, alert_status: str):
    logger = get_run_logger()
    logger.info(f"Remediating {interface} on {device} (status: {alert_status})")

    # Your remediation logic here:
    # - SSH to device, check interface state
    # - Attempt bounce if admin-down
    # - Open ticket if hardware failure
    logger.info(f"Remediation complete for {interface} on {device}")

Webhook receiver (a small FastAPI app that connects Alertmanager to Prefect):

webhook_receiver.py
from fastapi import FastAPI, Request
from flows.remediate_interface import remediate_interface

app = FastAPI()

@app.post("/alertmanager")
async def handle_alert(request: Request):
    payload = await request.json()
    for alert in payload.get("alerts", []):
        if alert["labels"].get("alertname") == "InterfaceDown":
            remediate_interface(
                device=alert["labels"]["device"],
                interface=alert["labels"]["ifName"],
                alert_status=alert["status"],
            )
    return {"status": "ok"}

Point Alertmanager's webhook to http://prefect-receiver:8000/alertmanager.

StackStorm uses sensors and rules to map events to actions. The stackstorm-alertmanager pack provides a webhook sensor for Alertmanager.

Rule definition:

rules/remediate_interface_down.yaml
---
name: remediate_interface_down
pack: network_automation
description: "Trigger interface remediation on InterfaceDown alert"
enabled: true

trigger:
  type: alertmanager.webhook
  parameters: {}

criteria:
  trigger.body.alerts[0].labels.alertname:
    type: equals
    pattern: "InterfaceDown"
  trigger.body.alerts[0].status:
    type: equals
    pattern: "firing"

action:
  ref: network_automation.remediate_interface
  parameters:
    device: "{{ trigger.body.alerts[0].labels.device }}"
    interface: "{{ trigger.body.alerts[0].labels.ifName }}"

Alertmanager route:

alertmanager.yml (StackStorm route)
receivers:
  - name: stackstorm
    webhook_configs:
      - url: http://stackstorm:9102/v1/webhooks/alertmanager
        send_resolved: true

Register the rule and verify it is active:

st2 rule create rules/remediate_interface_down.yaml
st2 rule list --pack=network_automation

Verify the automation triggers

With the alerting stack running and the automation engine connected, push metrics and watch the full chain execute.

Step-by-step verification

1. Confirm metrics are flowing:

curl -s "http://localhost:8428/api/v1/query?query=interface_oper_state" \
  | jq '.data.result[] | {device: .metric.device, ifName: .metric.ifName, value: .value[1]}'

2. Confirm the alert is firing in vmalert:

curl -s http://localhost:8880/api/v1/alerts \
  | jq '.data.alerts[] | select(.labels.alertname == "InterfaceDown") | {state, value}'

3. Confirm Alertmanager received the alert:

curl -s http://localhost:9093/api/v2/alerts \
  | jq '.[] | select(.labels.alertname == "InterfaceDown")'

4. Confirm webhook delivery (check the echo server logs for the payload):

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting logs webhook-receiver --tail 20

5. Confirm the automation engine received the event and ran the workflow. This step depends on the engine:

# Check EDA logs for the triggered playbook
ansible-rulebook --rulebook rulebook-interface-down.yml -i inventory.yml --verbose

Look for log lines showing the condition matched and the playbook executed.

# Check Prefect flow runs
prefect flow-run ls --flow-name "remediate-interface-down"

Verify the flow run completed with the correct device and interface parameters.

# Check StackStorm execution history
st2 execution list --action=network_automation.remediate_interface

Verify the execution finished with status: succeeded.

Test flap handling

A single interface-down event is the easy case. The harder case is rapid alternation between firing and resolved — an interface that bounces up and down repeatedly. The automation needs to handle this without triggering a remediation storm.

What rapid alternation looks like

The link-failover scenario's flap generator produces a simple pattern: 60s up, 30s down, cycling for the duration. Real flaps are faster and less predictable.

Here is a rapid sequence that toggles every 2 to 3 seconds:

Rapid flap sequence (inline in a scenario entry)
generator:
  type: sequence
  # Rapid flap: toggles every 2-3 seconds over a 20-second window
  values: [1, 1, 1, 1, 1,
           0, 0,
           1, 1, 1,
           0, 0, 0,
           1, 1,
           0, 0,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  repeat: true

And a slower variant with longer down windows:

Slow flap sequence
generator:
  type: sequence
  # Slow flap: 15s up, 10s down, 5s up, 10s down, 20s up
  values: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
           0,0,0,0,0,0,0,0,0,0,
           1,1,1,1,1,
           0,0,0,0,0,0,0,0,0,0,
           1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
  repeat: true

What to validate

Use these patterns to test that the automation handles each case correctly:

Flap pattern Expected behavior What to check
Single down (10s) with for: 10s Alert fires once, resolves once Runbook triggers exactly once
Rapid flap (2 to 3s toggles) Alert may not fire (down duration < for:) Runbook should NOT trigger
Slow flap (10s down windows) Alert fires on each down window Runbook triggers per event, or is deduplicated

Tuning for: as a flap filter

The for: duration in an alert rule acts as a debounce. If the interface returns before the for: timer expires, the alert never fires. Increase for: to filter out fast flaps. Decrease it to detect brief outages. Test both extremes with Sonda to find the right balance.

Vary the timing

The sequence generator gives precise control over flap timing. At rate: 1, each value in the sequence is one second. To simulate sub-second flaps, increase the rate:

scenarios:
  - signal_type: metrics
    name: interface_oper_state
    rate: 2          # 2 events/second = each sequence value is 500ms
    generator:
      type: sequence
      values: [1, 0, 1, 0, 1, 0, 1, 1, 1, 1]
      repeat: true

To simulate longer intervals, decrease the rate:

scenarios:
  - signal_type: metrics
    name: interface_oper_state
    rate: 0.2        # 1 event every 5 seconds = each sequence value is 5s
    generator:
      type: sequence
      values: [1, 0, 0, 1, 1, 1]  # 5s up, 10s down, 15s up
      repeat: true

Validate remediation workflows end-to-end

The final test: does the full chain work from synthetic metric to completed remediation? Here is a checklist for validating the automation workflow against Sonda-generated alerts.

Test matrix

Test case Sonda scenario Expected alert Expected automation
Interface down @link-failover InterfaceDown fires Remediation playbook runs
Interface recovers Let the flap cycle back to 1 InterfaceDown resolves Resolution handler runs (if configured)
BGP session down BGP sequence (see Network Device Telemetry) BGPSessionDown fires BGP remediation runs
Rapid flap Rapid flap sequence (above) No alert (below for: threshold) No automation triggers
Slow flap Slow flap sequence (above) Multiple InterfaceDown alerts Deduplication or rate limiting works
Concurrent failures Run interface + BGP scenarios together Both alerts fire Both workflows run without interference

Resolution events

Alertmanager sends a "status": "resolved" webhook when an alert clears. The automation should handle this — for example, closing a ticket or logging the recovery.

The failover scenario produces resolution events naturally. After each 30-second down window, the flap generator returns interface_oper_state to 1, the alert clears, and Alertmanager delivers the resolved webhook. Verify the engine processes it:

# Check webhook logs for resolved status
docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting logs webhook-receiver \
  | grep -i resolved

Concurrency testing

Run multiple Sonda scenarios at the same time to confirm the automation handles concurrent alerts correctly. Create a BGP session scenario file:

bgp-session-down.yaml
version: 2
kind: runnable

defaults:
  rate: 1
  duration: 120s
  encoder:
    type: prometheus_text
  sink:
    type: http_push
    url: "http://localhost:8428/api/v1/import/prometheus"
    content_type: "text/plain"

scenarios:
  - signal_type: metrics
    name: bgp_session_state
    generator:
      type: sequence
      # 10s Established, 10s down, 10s Established
      values: [1,1,1,1,1,1,1,1,1,1,
               0,0,0,0,0,0,0,0,0,0,
               1,1,1,1,1,1,1,1,1,1]
      repeat: true
    labels:
      device: rtr-core-01
      bgp_peer: "192.168.1.1"
      bgp_asn: "65001"
      job: snmp

Then run both scenarios at the same time:

# Terminal 1: link failure scenario from examples/
sonda run examples/network-link-failure.yaml &

# Terminal 2: BGP session down
sonda run bgp-session-down.yaml

Both InterfaceDown and BGPSessionDown should fire and trigger their workflows without interfering with each other.

Tear down

When you are done testing, stop the alerting stack:

docker compose -f examples/docker-compose-victoriametrics.yml \
  --profile alerting down -v

If you copied the automation alert rules into examples/alertmanager/, clean them up:

rm -f examples/alertmanager/network-automation-alerts.yml

Quick reference

Task Command
Start alerting stack docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting up -d
Run failover scenario sonda run examples/network-link-failure.yaml
Check vmalert for InterfaceDown curl -s http://localhost:8880/api/v1/alerts \| jq '.data.alerts[]'
Check Alertmanager alerts curl -s http://localhost:9093/api/v2/alerts \| jq '.[].labels'
Check webhook delivery docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting logs webhook-receiver
Tear down docker compose -f examples/docker-compose-victoriametrics.yml --profile alerting down -v