Network device telemetry¶

This page shows how to model SNMP-style network telemetry with Sonda. You then validate dashboards and alerts against a synthetic link failure. The page covers four tasks:

Model a router with two uplinks.
Generate the metric streams that an SNMP exporter would produce.
Simulate a primary-link failure cascade.
Run PromQL queries against the synthetic data.

The motivation: a typical lab has a couple of routers and a port-channel that never flaps. The interesting cases will not happen on demand:

A 32-bit counter wrapping at peak traffic.
A primary uplink dropping while the backup saturates.
BGP sessions toggling between Established and Idle.

Asking netops to break a production link to test a dashboard is not a strategy.

Sonda models each interface as its own metric stream with the labels snmp_exporter emits (device, ifName, ifAlias, job=snmp). PromQL written against the synthetic data is the same PromQL you deploy. rate(interface_in_octets[1m]) behaves the same way against a sawtooth-modeled counter as it does against a real SNMP poll. The dashboard you tune against the scenario is the dashboard you deploy.

What you need:

Sonda installed (Getting Started).
Familiarity with SNMP, interface counters, and operational state.

Model a network device¶

A typical network device exposes several metric families per interface, plus system-level gauges. Here is what we model for a core router (rtr-core-01) with two uplinks:

Metric	Type	Generator	Why
`interface_in_octets`	Counter	sawtooth	Monotonically increasing byte counter that resets at the period boundary, mirroring SNMP `ifInOctets`
`interface_out_octets`	Counter	sawtooth	Same pattern for egress traffic
`interface_oper_state`	Gauge	constant / sequence	1 = up, 0 = down. Toggles during failure scenarios
`interface_errors`	Counter	spike	Low baseline with periodic error bursts
`device_cpu_percent`	Gauge	sine	Smooth oscillation that represents normal CPU load
`device_memory_percent`	Gauge	sine	Memory utilization with gentle oscillation

Each interface gets its own labels (device, ifName, ifAlias, job) so the metrics are distinguishable in PromQL, the same as real SNMP-exported data.

Why these generators?¶

Sawtooth for counters. SNMP interface counters are monotonically increasing values that reset at a wrap point (32-bit or 64-bit max). The sawtooth generator increases linearly from min to max and resets — the same pattern you see from ifInOctets between polls. Use rate() in PromQL to derive throughput, the same as with real SNMP data.

Sine for system gauges. CPU and memory utilization on a router moves smoothly with traffic load and routing table churn. The sine generator produces that natural oscillation. Add jitter for realism.

Spike for error counters. Interface errors are typically zero, with occasional bursts during link instability or CRC failures. The spike generator holds at a baseline and periodically emits a burst — useful for testing error-rate alerts.

Sequence for state modeling. When you need precise control over a timeline, the sequence generator steps through an explicit list of values. For example: interface goes down at second 10, returns at second 20. This is how you script failure scenarios.

Generate baseline telemetry¶

The baseline scenario models rtr-core-01 in a healthy state: both uplinks carrying traffic, all interfaces up, steady CPU and memory.

sonda --dry-run run examples/network-device-baseline.yaml

examples/network-device-baseline.yaml (excerpt)

version: 2
kind: runnable

defaults:
  rate: 1
  duration: 120s
  encoder:
    type: prometheus_text
  sink:
    type: stdout

scenarios:
  # Interface traffic counter (sawtooth = monotonic ramp)
  - signal_type: metrics
    name: interface_in_octets
    generator:
      type: sawtooth
      min: 0.0
      max: 500000000.0
      period_secs: 300
    jitter: 1000000.0
    jitter_seed: 10
    labels:
      device: rtr-core-01
      ifName: GigabitEthernet0/0/0
      ifAlias: uplink-isp-a
      job: snmp

  # Interface operational state (1 = up)
  - signal_type: metrics
    name: interface_oper_state
    generator:
      type: constant
      value: 1.0
    labels:
      device: rtr-core-01
      ifName: GigabitEthernet0/0/0
      ifAlias: uplink-isp-a
      job: snmp

  # ... more interfaces, CPU, memory (9 scenarios total)

The full file contains 9 concurrent scenarios:

interface_in_octets and interface_out_octets for both interfaces.
interface_oper_state for both interfaces.
interface_errors for the primary uplink.
device_cpu_percent and device_memory_percent.

Run it:

sonda run examples/network-device-baseline.yaml

Sample output (interleaved from 9 threads)

interface_in_octets{device="rtr-core-01",ifAlias="uplink-isp-a",ifName="GigabitEthernet0/0/0",job="snmp"} 0 1775265944249
interface_oper_state{device="rtr-core-01",ifAlias="uplink-isp-a",ifName="GigabitEthernet0/0/0",job="snmp"} 1 1775265944250
device_cpu_percent{device="rtr-core-01",job="snmp"} 36.42 1775265944251

Each scenario runs on its own thread at 1 event per second, matching a typical SNMP polling interval. The output interleaves across all 9 streams.

Match your polling interval

Set rate: 1 for 1-second resolution. Set rate: 0.2 for a 5-second SNMP polling interval (one event every 5 seconds). The rate controls how many samples Sonda produces per second. Match it to your real collection interval for realistic dashboard testing.

Which failure pattern to choose¶

Two example scenarios model a link failure with different mechanics. Pick the one that matches how you want to reason about time:

Scenario	Mechanic	Best for
`examples/network-link-failure.yaml`	`sequence` generator + `repeat: true`, aligned tick-by-tick across multiple entries	Tight, repeating cycles where every tick's value matters and failures recur on a fixed schedule
`scenarios/link-failover.yaml`	`after:` chains — each signal declares what it waits for; the compiler resolves phase offsets	Once-through causal chains where order matters (primary drops -> backup saturates -> latency rises) but the exact tick does not

Use sequence + repeat when you need hand-authored values at specific ticks and the pattern should loop. This is useful for soak testing and for dashboards that expect a steady rhythm. Use after: when the signals form a cascade and you prefer to declare a causal link. For example: "latency starts degrading when the backup saturates", rather than counting seconds across four entries. Both patterns are runnable example files in the repository. You can mix them in the same scenario when a repeating failure also triggers a cascade.

Simulate a link failover¶

The interesting part: what happens when a primary link drops? Traffic shifts to the backup path, the backup saturates as it absorbs double the load, and latency increases as the backup fills. Testing dashboards and alerts against that cascade is the whole point.

Model the cascade as a 3-signal causal chain. Each signal uses a dedicated generator. The after: field tells Sonda to delay a signal until the one it depends on crosses a threshold:

Signal	Generator	Starts when
`interface_oper_state` (primary)	`flap` — 60s up, 30s down, cycling	`t=0`
`backup_link_utilization`	`saturation` — increases 20% -> 85% over 2m	primary drops below 1 (first flap)
`latency_ms`	`degradation` — rises 5ms -> 150ms over 3m	backup utilization exceeds 70%

Sonda resolves the chain at parse time. The compiler computes a concrete phase_offset for each linked signal, so the signals emit independently but start in the right order. See the after: chain reference for the mechanics.

link-failover.yaml

version: 2
kind: runnable

defaults:
  rate: 1
  duration: 5m
  encoder:
    type: prometheus_text
  sink:
    type: stdout
  labels:
    device: rtr-edge-01
    job: network

scenarios:
  - id: interface_oper_state
    signal_type: metrics
    name: interface_oper_state
    generator:
      type: flap
      up_duration: 60s
      down_duration: 30s
    labels:
      interface: GigabitEthernet0/0/0

  - id: backup_link_utilization
    signal_type: metrics
    name: backup_link_utilization
    generator:
      type: saturation
      baseline: 20
      ceiling: 85
      time_to_saturate: 2m
    labels:
      interface: GigabitEthernet0/1/0
    after:
      ref: interface_oper_state
      op: "<"
      value: 1

  - id: latency_ms
    signal_type: metrics
    name: latency_ms
    generator:
      type: degradation
      baseline: 5
      ceiling: 150
      time_to_degrade: 3m
    labels:
      path: backup
    after:
      ref: backup_link_utilization
      op: ">"
      value: 70

Run the file:

sonda run link-failover.yaml

Use --dry-run to see the phase_offset values Sonda computed from the after: clauses:

sonda --dry-run run link-failover.yaml

Output (abridged)

[config] file: link-failover.yaml (version: 2, 3 scenarios)

[config] [1/3] interface_oper_state
    generator:      flap (up_duration: 60s, down_duration: 30s, up_value: 1, down_value: 0)
    clock_group:    chain_backup_link_utilization (auto)

[config] [2/3] backup_link_utilization
    generator:      saturation (baseline: 20, ceiling: 85, time_to_saturate: 2m)
    phase_offset:   1m
    clock_group:    chain_backup_link_utilization (auto)

[config] [3/3] latency_ms
    generator:      degradation (baseline: 5, ceiling: 150, time_to_degrade: 3m)
    phase_offset:   152.308s
    clock_group:    chain_backup_link_utilization (auto)

Validation: OK (3 scenarios)

The phase_offset: lines show the delays Sonda derived from each after: threshold. The backup saturates 1 minute in, when the primary first goes down. Latency begins degrading about 152 seconds in, when the backup utilization crosses 70%. All three signals share the same auto-assigned clock_group, so the timers start from the same reference.

Why after: instead of aligned sequences?

You can express a link failure with the sequence generator by hand-aligning values across scenarios, which is what examples/network-link-failure.yaml does. after: is the declarative alternative: declare the causal relationship once and let the compiler do the timing math. The Scenario Files reference covers the full surface.

Label design for network metrics¶

Choosing the right labels determines whether your PromQL queries work naturally. The examples on this page use labels that mirror what real SNMP exporters produce:

Label	Purpose	Example
`device`	Hostname or FQDN of the network device	`rtr-core-01`
`ifName`	SNMP ifName (interface identifier)	`GigabitEthernet0/0/0`
`ifAlias`	Human-readable interface description	`uplink-isp-a`
`job`	Prometheus job label for scrape grouping	`snmp`

This matches the label schema used by snmp_exporter and similar tools. Dashboards and alert rules work the same way against synthetic data as they do against real SNMP-exported metrics.

Adding more interfaces

To model a device with more interfaces, duplicate a scenario entry and change the ifName and ifAlias labels. Each entry runs on its own thread, so adding interfaces is linear. 10 interfaces with 4 metrics each is 40 concurrent scenarios. Sonda handles this at low rates (1 per second per metric).

PromQL queries for network monitoring¶

With synthetic data flowing, you can validate the PromQL queries that power your dashboards and alerts. Here are the common network monitoring queries, ready to use with the metrics from the example scenarios.

Interface throughput¶

Derive bits per second from the octets counter:

rate(interface_in_octets{device="rtr-core-01"}[1m]) * 8

This works because the sawtooth generator produces a monotonically increasing counter. rate() computes the per-second derivative, and multiplying by 8 converts octets to bits.

Interface state¶

Detect interfaces that are down:

interface_oper_state{device="rtr-core-01"} == 0

During the link failure scenario, this returns GigabitEthernet0/0/0 for seconds 10 to 19 of each 30-second cycle.

Error rate¶

Alert on sustained interface errors:

rate(interface_errors{device="rtr-core-01"}[5m]) > 0

Traffic shift detection¶

Compare traffic ratios between interfaces to detect redistribution:

  rate(interface_in_octets{device="rtr-core-01",ifName="GigabitEthernet0/0/1"}[1m])
/
  (
    rate(interface_in_octets{device="rtr-core-01",ifName="GigabitEthernet0/0/0"}[1m])
    + rate(interface_in_octets{device="rtr-core-01",ifName="GigabitEthernet0/0/1"}[1m])
  )

Under normal conditions this ratio sits near 0.4 (ISP-B carries less traffic). During a failure on Gi0/0/0, it jumps to 1.0 — all traffic is on the backup link.

Push to a monitoring backend¶

The example scenarios write to stdout for quick iteration. To push metrics into VictoriaMetrics or Prometheus, change the sink in each scenario entry.

VictoriaMetrics (HTTP push)Prometheus (remote write)File (offline analysis)

Replace the sink block in each scenario:

encoder:
  type: prometheus_text
sink:
  type: http_push
  url: "http://localhost:8428/api/v1/import/prometheus"
  content_type: "text/plain"

If you use the project's Docker Compose stack:

docker compose -f examples/docker-compose-victoriametrics.yml up -d

Then change the scenario sinks to point at VictoriaMetrics and run:

sonda run examples/network-device-baseline.yaml

Verify data arrived:

curl -s "http://localhost:8428/api/v1/query?query=interface_in_octets" | jq '.data.result | length'

Use the remote write encoder and sink for native Prometheus ingestion:

encoder:
  type: remote_write
sink:
  type: remote_write
  url: "http://localhost:9090/api/v1/write"
  batch_size: 100

Remote write works with Prometheus, Thanos Receive, Cortex, Mimir, Grafana Cloud, and VictoriaMetrics.

Write to a file for offline inspection or replay:

sink:
  type: file
  path: /tmp/network-metrics.txt

Change the sink in one place

In a scenario file, the defaults: block holds the shared sink (and encoder, rate, duration, labels). Replace the sink there once, and every entry in scenarios: picks it up. Per-entry overrides still win when you need a mixed setup.

Alert rule examples¶

Here are Prometheus and VictoriaMetrics alert rules for network device monitoring. Test them against the link failure scenario to verify they fire and resolve correctly.

network-alert-rules.yaml

groups:
  - name: network-device-alerts
    interval: 10s
    rules:
      - alert: InterfaceDown
        expr: interface_oper_state{job="snmp"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Interface {{ $labels.ifName }} is down on {{ $labels.device }}"
          description: >
            {{ $labels.ifAlias }} ({{ $labels.ifName }}) on {{ $labels.device }}
            has been operationally down for more than 30 seconds.

      - alert: HighInterfaceErrorRate
        expr: rate(interface_errors{job="snmp"}[5m]) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.ifName }}"

      - alert: HighDeviceCPU
        expr: device_cpu_percent{job="snmp"} > 70
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU above 70% on {{ $labels.device }}"

With the link failure scenario running, InterfaceDown fires during each 10-second failure window (after the 30-second for: duration on the first cycle). HighDeviceCPU fires when the CPU value from rerouting stays above 70%.

Validate alerts end-to-end

For a complete alerting pipeline test with vmalert and Alertmanager, see the Alerting Pipeline guide. The network device scenarios work as drop-in replacements for the alert testing examples in that guide.

Extend the model¶

The two example scenarios cover the most common network monitoring patterns. Here are ideas for extending them to match your environment.

More interfaces¶

Duplicate scenario entries with different ifName and ifAlias labels. For a 48-port switch, model only the uplinks and a handful of access ports. You do not need all 48 to validate the dashboards.

BGP session state¶

Use the sequence generator to model BGP session flaps:

- signal_type: metrics
  name: bgp_session_state
  rate: 1
  duration: 120s
  generator:
    type: sequence
    # 1=Established, 0=Idle (flap at second 15, recovers at 25)
    values: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
             0,0,0,0,0,0,0,0,0,0,
             1,1,1,1,1]
    repeat: true
  labels:
    device: rtr-core-01
    bgp_peer: "192.168.1.1"
    bgp_asn: "65001"
    job: snmp
  encoder:
    type: prometheus_text
  sink:
    type: stdout

SNMP counter wraps¶

Real 32-bit SNMP counters wrap at 2^32 (4,294,967,296). The sawtooth generator's max parameter models this directly:

generator:
  type: sawtooth
  min: 0.0
  max: 4294967296.0
  period_secs: 600

A 10-minute period with a high-traffic interface wrapping at the 32-bit boundary lets you test whether your rate() queries handle counter resets correctly.

Temperature and power¶

Model environmental sensors with sine waves:

- signal_type: metrics
  name: device_temperature_celsius
  rate: 1
  duration: 120s
  generator:
    type: sine
    amplitude: 5.0
    period_secs: 3600
    offset: 45.0
  jitter: 0.5
  jitter_seed: 70
  labels:
    device: rtr-core-01
    sensor: intake
    job: snmp
  encoder:
    type: prometheus_text
  sink:
    type: stdout

Quick reference¶

Task	Command
Validate baseline scenario	`sonda --dry-run run examples/network-device-baseline.yaml`
Run baseline (stdout)	`sonda run examples/network-device-baseline.yaml`
Validate failover scenario	`sonda --dry-run run link-failover.yaml`
Run failover simulation	`sonda run link-failover.yaml`

Generators — full reference for sawtooth, sequence, sine, spike, and jitter.
Scenario Fields — multi-scenario YAML format and field reference.
Alert Testing — threshold and compound alert testing patterns.
Alerting Pipeline — full alerting path with vmalert and Alertmanager.
Example Scenarios — all example scenario files.