Skip to content

Synthetic monitoring

This page shows how to run sonda-server as a long-lived synthetic monitoring source on Kubernetes. The server emits a known baseline of metrics that Prometheus scrapes. The result lets you distinguish "no data" from "data stopped arriving".

The page covers four tasks:

  • Deploy sonda-server with the included Helm chart.
  • Submit scenarios that run for hours or days.
  • Scrape the generated metrics with Prometheus.
  • Build Grafana dashboards and watchdog alerts that monitor both the data and Sonda itself.

What you need:

  • A Kubernetes cluster, local or remote.
  • kubectl and helm installed.
  • curl and jq for API calls.
  • Familiarity with Prometheus scraping and Grafana dashboards.

Set up a local Kubernetes cluster

If you already have a cluster (EKS, GKE, AKS, or a local one), skip to Deploy sonda-server.

For local testing, you need a small Kubernetes distribution. The table lists the common options:

Tool Best for Runs on
kind CI pipelines, fast disposable clusters Linux, macOS, Windows (WSL2)
k3d k3s in Docker, built-in registry support Linux, macOS, Windows (WSL2)
minikube Broad driver support, add-on ecosystem Linux, macOS, Windows (WSL2)
OrbStack Native macOS experience, low resource usage macOS only

All four need Docker or a compatible container runtime installed and running.

kind runs Kubernetes nodes as Docker containers. It starts in under 30 seconds and is the lightest option.

# Install (macOS/Linux)
brew install kind

# Or download the binary directly
# https://kind.sigs.k8s.io/docs/user/quick-start/#installation

# Create a cluster
kind create cluster --name sonda-lab

# Verify
kubectl cluster-info --context kind-sonda-lab

Port mapping for kind

kind does not expose container ports to the host by default. If you need NodePort access (for example, Prometheus or Grafana outside the cluster), create the cluster with a config:

kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      - containerPort: 30080
        hostPort: 30080
        protocol: TCP
kind create cluster --name sonda-lab --config kind-config.yaml

k3d wraps k3s (Rancher's lightweight Kubernetes) inside Docker. It supports port mapping and a local image registry by default.

# Install (macOS/Linux)
brew install k3d

# Create a cluster with port mapping
k3d cluster create sonda-lab -p "8080:80@loadbalancer"

# Verify
kubectl cluster-info

minikube is the most established option. It supports Docker, Hyperkit, Hyper-V, and other drivers.

# Install (macOS/Linux)
brew install minikube

# Start with Docker driver (recommended)
minikube start --driver=docker --profile sonda-lab

# Verify
kubectl cluster-info --context sonda-lab

Windows WSL2

On Windows, install minikube inside your WSL2 distribution and use the Docker driver. Make sure Docker Desktop's WSL2 backend is enabled. The same commands apply inside the WSL2 terminal.

OrbStack provides a native macOS Kubernetes experience with low resource usage. It runs a single-node cluster that starts automatically.

# Install
brew install orbstack

# Kubernetes is enabled by default — verify it
kubectl cluster-info

When kubectl get nodes shows a Ready node, the cluster is ready.

Deploy sonda-server

Sonda includes a Helm chart that deploys sonda-server as a Kubernetes Deployment. The chart configures health probes, a ClusterIP Service, and optional scenario injection through a ConfigMap.

helm install sonda ./helm/sonda

Wait for the pod to become ready:

kubectl get pods -l app.kubernetes.io/name=sonda -w

You should see 1/1 Running within 15 to 20 seconds. The Deployment configures liveness and readiness probes against GET /health. Kubernetes restarts the pod automatically if the server stops responding.

Customizing the deployment

Override common settings with --set:

# Pin a specific image version
helm install sonda ./helm/sonda --set image.tag=0.4.0

# Custom port and resource limits
helm install sonda ./helm/sonda \
  --set server.port=9090 \
  --set resources.requests.cpu=200m \
  --set resources.limits.memory=512Mi

See Kubernetes deployment for the full chart reference.

Verify the server is healthy by port-forwarding to it:

kubectl port-forward svc/sonda 8080:8080 &
curl http://localhost:8080/health
# {"status":"ok"}

The server is ready to accept scenarios.

Submit long-running scenarios

A long-running scenario is a scenario YAML without a duration field. It runs until you stop it with DELETE /scenarios/{id}.

examples/long-running-metrics.yaml
version: 2
kind: runnable

defaults:
  rate: 10
  encoder:
    type: prometheus_text
  sink:
    type: stdout

scenarios:
  - signal_type: metrics
    name: continuous_cpu
    generator:
      type: sine
      amplitude: 50.0
      period_secs: 60
      offset: 50.0
    labels:
      instance: api-server-01
      job: sonda

Submit it to the server:

ID=$(curl -s -X POST -H "Content-Type: text/yaml" \
  --data-binary @examples/long-running-metrics.yaml \
  http://localhost:8080/scenarios | jq -r '.id')

echo "Scenario started: $ID"

The scenario runs on a background thread inside the server. You can submit many scenarios. Each one gets its own thread and scrape endpoint.

Multiple scenarios for richer coverage

Submit several scenarios with different patterns to simulate a realistic environment. For example: a sine wave for CPU, a step counter for requests, and a constant for an up gauge. Prometheus can scrape each /scenarios/{id}/metrics endpoint independently.

To verify the scenario is running:

# List all running scenarios
curl -s http://localhost:8080/scenarios | jq '.[] | {id, name, status}'

# Check live stats for your scenario
curl -s http://localhost:8080/scenarios/$ID/stats | jq .

For the full API reference, see Server API.

Scrape metrics with Prometheus

sonda-server exposes two scrape endpoints.

  • GET /scenarios/metrics is the aggregate view. Every running scenario appears in one Prometheus text response. Use ?label=k:v to filter by labels set on each scenario.
  • GET /scenarios/{id}/metrics is the per-scenario view. It returns the current value of every series for one scenario.

Both endpoints are idempotent snapshots. They return one sample per (name, labels) series with no timestamp, exactly like a node_exporter scrape.

For Prometheus, vmagent, and VictoriaMetrics jobs, use the aggregate endpoint. One job covers every scenario without knowing IDs in advance, and the endpoint behaves like a normal exporter. See Aggregate Prometheus scrape for the full reference.

Aggregate scrape config

prometheus-scrape.yaml
scrape_configs:
  - job_name: sonda
    scrape_interval: 15s
    metrics_path: /scenarios/metrics
    static_configs:
      - targets: ["sonda.default.svc:8080"]

The target address uses the Kubernetes Service DNS name (sonda.<namespace>.svc). Add params: {label: ["device:srl1"]} to filter the job to one device's metrics. Repeat the label value to AND-combine selectors.

Per-scenario scrape config

If you want one scrape job per scenario, point metrics_path at the scenario ID. This works when each scenario is its own logical target:

prometheus-scrape.yaml
scrape_configs:
  - job_name: sonda-id
    scrape_interval: 15s
    metrics_path: /scenarios/<SCENARIO_ID>/metrics
    static_configs:
      - targets: ["sonda.default.svc:8080"]

Replace <SCENARIO_ID> with the UUID returned by POST /scenarios.

Prometheus ServiceMonitor

If you use the Prometheus Operator (kube-prometheus-stack), create a ServiceMonitor to auto-discover sonda-server. The Sonda Helm chart does not include a ServiceMonitor template today. Create one manually:

sonda-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sonda
  labels:
    release: prometheus  # must match your Prometheus Operator's selector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: sonda
  endpoints:
    - port: http
      interval: 15s
      path: /scenarios/<SCENARIO_ID>/metrics
kubectl apply -f sonda-servicemonitor.yaml

One path per ServiceMonitor endpoint

Each ServiceMonitor endpoint scrapes a single path. With multiple running scenarios, you need one endpoints entry per scenario ID, each with a different path. For dynamic discovery, use a relabeling rule or a script that reads GET /scenarios and updates the scrape config.

Using vmagent instead of Prometheus

vmagent supports the same scrape_configs format. Point it at sonda-server with a static scrape config. If you already run the VictoriaMetrics Docker Compose stack, add sonda-server as a scrape target in the vmagent config.

Build Grafana dashboards

When Prometheus is scraping the synthetic metrics, you can visualize them in Grafana.

Sonda includes a Sonda Overview dashboard at docker/grafana/dashboards/sonda-overview.json. It shows metric values, event rates, and gap or burst indicators. Import it into any Grafana instance connected to a Prometheus-compatible datasource.

Import the included dashboard

  1. Open Grafana and go to Dashboards > Import.
  2. Upload docker/grafana/dashboards/sonda-overview.json or paste its contents.
  3. Select your Prometheus datasource when prompted.
  4. The dashboard uses template variables $datasource and $job. Set $job to sonda or the job label your scenarios use.

Build a custom panel

For a focused monitoring panel, create a new dashboard with a time series visualization. Query your synthetic metric directly:

continuous_cpu{job="sonda", instance="api-server-01"}

Add a second panel showing the emission rate over time:

rate(continuous_cpu{job="sonda"}[1m])

Threshold lines

Add a fixed threshold line in the Grafana panel options. For example, set it to 90 for a CPU alert threshold. This gives you a visual reference for when the sine wave crosses the alert boundary.

With dashboards in place, you can see the synthetic data flowing. The next section covers how to monitor Sonda itself.

Monitor sonda-server health

The stats API tells you whether each scenario is emitting as expected. Poll it periodically or build monitoring around it.

Health endpoint

This is the simplest check. Kubernetes already uses it for liveness and readiness probes:

curl http://localhost:8080/health
# {"status":"ok"}

Per-scenario stats

GET /scenarios/{id}/stats returns live stats. Fields include event counts, current emission rate, bytes emitted, error counts, and gap or burst state:

curl -s http://localhost:8080/scenarios/$ID/stats | jq .

Key fields to watch:

Field What it tells you
total_events Running count of emitted events. Should increase steadily. For batching sinks (loki, http_push, remote_write, otlp_grpc, kafka) this counts buffered writes, not deliveries. Use the fields below to confirm data is actually arriving.
current_rate Actual emission rate. Compare against the scenario's rate.
errors Error count. Should be 0 for healthy scenarios.
uptime Time since the scenario started. Confirms it has not restarted.
last_successful_write_at Wall-clock time (Unix nanoseconds) of the most recent successful delivery. null means nothing has ever arrived. A stale value means the sink is stuck.
consecutive_failures Failure streak since the last successful delivery. Resets to 0 on the next successful flush. A non-zero value with a stale last_successful_write_at is the stuck-sink signature.
total_sink_failures Lifetime sink-error count. Monotonic. Useful as a Prometheus alert input (increase(...)[5m]).

The full reference for these fields, including last_sink_error text and the state, gap, and burst flags, is in Self-observability via /stats.

If you only check one signal across the whole server, check degraded on GET /scenarios. It combines the three sink-failure fields above into a single boolean per scenario. The value is true when delivery has stalled for more than 30 seconds. The script below uses it directly.

List all scenarios

Check that all submitted scenarios are still running:

curl -s http://localhost:8080/scenarios | jq '.[] | {name, status}'

If a scenario shows status: "stopped" unexpectedly, submit it again.

Scripting a health check

Wrap the check in a script that fails when any scenario stops delivering. Read degraded from GET /scenarios. A total of total_events would miss a stuck batching sink, because buffered writes still increment the counter while nothing reaches the backend.

check-sonda.sh
#!/bin/bash
set -euo pipefail
SONDA_URL="${SONDA_URL:-http://localhost:8080}"

# Pull the list once and read the precomputed degraded flag per scenario.
bad=$(curl -sS "$SONDA_URL/scenarios" |
      jq -r '.scenarios[] | select(.degraded) | "\(.name) (\(.id))"')

if [[ -n "$bad" ]]; then
  echo "Degraded scenarios:"
  echo "$bad"
  exit 1
fi

echo "All scenarios delivering."

Exit code 1 makes this drop-in for a Kubernetes readiness probe, a cron alert, or a CI smoke step. If you need the raw counters (per-scenario rate, failure streak, last delivery timestamp), follow up with GET /scenarios/$id/stats for each degraded ID.

Rotate scenarios

Test patterns change over time. You might start with a sine wave to validate dashboards, then switch to a sequence generator to test alert thresholds. Scenario rotation is direct: stop the old scenario and start a new one.

Stop and replace

# Stop the running scenario
curl -s -X DELETE http://localhost:8080/scenarios/$ID | jq .
# {"id":"...","status":"stopped","total_events":12345}

# Submit a new scenario
NEW_ID=$(curl -s -X POST -H "Content-Type: text/yaml" \
  --data-binary @examples/sequence-alert-test.yaml \
  http://localhost:8080/scenarios | jq -r '.id')

echo "New scenario: $NEW_ID"

Scrape config update required

When you replace a scenario, the new scenario gets a different UUID. If your Prometheus scrape config uses the scenario ID in metrics_path, update it to the new ID.

Scripted rotation

For scheduled rotations, wrap the stop-and-start sequence in a cron job or Kubernetes CronJob. For example: different patterns during business hours versus overnight.

rotate-scenario.sh
#!/bin/bash
SONDA_URL="http://localhost:8080"
SCENARIO_FILE="$1"

# Stop all running scenarios
for id in $(curl -s "$SONDA_URL/scenarios" | jq -r '.[].id'); do
  curl -s -X DELETE "$SONDA_URL/scenarios/$id" > /dev/null
done

# Start the new scenario
curl -s -X POST -H "Content-Type: text/yaml" \
  --data-binary "@$SCENARIO_FILE" \
  "$SONDA_URL/scenarios" | jq .
# Rotate to a new pattern
./rotate-scenario.sh examples/long-running-metrics.yaml

Alert on Sonda itself

Synthetic monitoring is only useful if you know when it breaks. If Sonda stops emitting, your dashboards go silent. You need to tell "Sonda died" from "real outage".

Detect missing synthetic data

Create an alert rule that fires when the synthetic metric disappears. The rule uses the absent() function in PromQL:

sonda-watchdog-rules.yaml
groups:
  - name: sonda-watchdog
    interval: 30s
    rules:
      - alert: SondaSyntheticDataMissing
        expr: absent(continuous_cpu{job="sonda"})
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Synthetic monitoring data missing"
          description: >
            The metric continuous_cpu from Sonda has not been seen for 2 minutes.
            Either sonda-server is down or the scenario has stopped.

The rule fires if continuous_cpu{job="sonda"} has not been scraped for 2 minutes. Adjust the for: duration to match your scrape interval and tolerance for gaps.

Monitor the pod

sonda-server runs as a Kubernetes Deployment with health probes. Standard kube-state-metrics alerts cover pod-level failures:

- alert: SondaPodNotReady
  expr: kube_pod_status_ready{pod=~"sonda.*", condition="true"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Sonda pod is not ready"

Layer your alerting

A robust setup uses both layers:

Layer What it catches Alert
Pod health Server crash, OOM kill, image pull failure SondaPodNotReady
Metric presence Scenario stopped, scrape misconfigured, data pipeline broken SondaSyntheticDataMissing

The pod alert fires fast and signals an infrastructure issue. The metric-absent alert fires when the data pipeline is broken anywhere between Sonda and Prometheus. This is the case synthetic monitoring exists to detect.

Testing these alerts with Sonda

You can validate these watchdog rules using the patterns from Alert Testing and Alerting Pipeline. Submit a scenario, verify the alert stays silent, then DELETE the scenario and watch the absent() alert fire.

Quick reference

Task Command
Deploy sonda-server helm install sonda ./helm/sonda
Submit a scenario curl -X POST -H "Content-Type: text/yaml" --data-binary @scenario.yaml http://localhost:8080/scenarios
List running scenarios curl http://localhost:8080/scenarios
Check scenario stats curl http://localhost:8080/scenarios/<id>/stats
Scrape metrics curl http://localhost:8080/scenarios/<id>/metrics
Stop a scenario curl -X DELETE http://localhost:8080/scenarios/<id>
Health check curl http://localhost:8080/health