edge Intermediate 20 min read

"Health Monitoring and Prometheus Export"

Deep dive into the /health and /metrics endpoints, Prometheus export, Grafana dashboards, and Alertmanager rules for the Pyvorin Edge Agent.

Published Jun 2, 2026

Introduction

Every edge deployment needs observability. The Pyvorin Edge Agent exposes two built-in HTTP endpoints—/health and /metrics—that provide a real-time view of pipeline health, resource utilisation, and cloud sync state. This article explains the structure of both endpoints, shows how to convert the JSON output into Prometheus exposition format, provides a complete Grafana dashboard, and supplies ready-to-use Alertmanager rules.

The /health Endpoint

The /health endpoint is served by _HealthHandler in edge_runtime/pyv_edge_agent/main.py. It returns a single JSON document with nested objects for agent state, system metrics, cloud queue depth, privacy configuration, and ingest adapters.


curl -s http://localhost:8080/health | python3 -m json.tool
  

A typical response looks like this:


{
  "status": "healthy",
  "timestamp": 1717000000.0,
  "metrics": {
    "cpu_percent": 12.5,
    "ram_percent": 34.0,
    "disk_percent": 45.2,
    "thermal_celsius": 42.0,
    "uptime_seconds": 86400.0,
    "timestamp": 1717000000.0
  },
  "agent": {
    "running": true,
    "buffer_count": 4,
    "readings_processed": 150000,
    "events_triggered": 23
  },
  "cloud": {
    "queue_depth": 12,
    "last_flush_time": 1716999900.0,
    "messages_sent_today": 1440,
    "endpoint": "https://api.pyvorin.com/v1/ingest"
  },
  "privacy": {
    "enabled": true,
    "rules_active": 3,
    "fields_redacted": ["patient_id"],
    "fields_hashed": ["device_uuid"]
  },
  "ingest": {
    "adapters_connected": ["simulator", "mqtt"],
    "devices_configured": 4
  }
}
  
KeySourceDescription
statusEdgeAgent.is_running"healthy" if the agent loop is active.
metricsSystemMetrics.to_dict()CPU, RAM, disk, thermal, and uptime.
agent.buffer_countlen(self._buffers)Number of active ring buffers.
agent.readings_processedself._readings_processedLifetime counter of ingested readings.
agent.events_triggeredself._events_triggeredLifetime counter of fired rule events.
cloud.queue_depthCloudSyncQueue.pending_count()Items waiting for upstream upload.
cloud.messages_sent_todayself._cloud.messages_sent_todayDaily egress counter (resets at midnight).
privacy.rules_activelen(self._privacy.rules)Number of privacy rules currently loaded.
ingest.adapters_connectedself._adapter_types.values()List of active adapter type names.

The /metrics Endpoint

The /metrics endpoint returns the raw output of SystemMetrics().to_dict() from edge_runtime/pyv_edge_agent/health_monitor/metrics.py. This is the lowest-overhead way to pull system telemetry because it bypasses the agent state object entirely.


curl -s http://localhost:8080/metrics | python3 -m json.tool
  

{
  "cpu_percent": 12.5,
  "ram_percent": 34.0,
  "disk_percent": 45.2,
  "thermal_celsius": 42.0,
  "uptime_seconds": 86400.0,
  "timestamp": 1717000000.0
}
  

Prometheus Metrics Export Format

Prometheus does not natively understand JSON. You need a small bridge script that polls /metrics and translates the dictionary into the Prometheus text exposition format. The script below can be run as a sidecar or cron job.


#!/usr/bin/env python3
"""Prometheus bridge for Pyvorin Edge /metrics."""

import json
import urllib.request
from pathlib import Path

METRICS_URL = "http://localhost:8080/metrics"
OUTPUT_PATH = Path("/var/lib/node_exporter/textfile_collector/pyvorin_edge.prom")

PROM_TEMPLATE = """\
# HELP pyvorin_edge_cpu_percent CPU utilisation percentage.
# TYPE pyvorin_edge_cpu_percent gauge
pyvorin_edge_cpu_percent {cpu_percent}
# HELP pyvorin_edge_ram_percent RAM utilisation percentage.
# TYPE pyvorin_edge_ram_percent gauge
pyvorin_edge_ram_percent {ram_percent}
# HELP pyvorin_edge_disk_percent Disk utilisation percentage.
# TYPE pyvorin_edge_disk_percent gauge
pyvorin_edge_disk_percent {disk_percent}
# HELP pyvorin_edge_thermal_celsius SoC temperature in Celsius.
# TYPE pyvorin_edge_thermal_celsius gauge
pyvorin_edge_thermal_celsius {thermal_celsius}
# HELP pyvorin_edge_uptime_seconds System uptime in seconds.
# TYPE pyvorin_edge_uptime_seconds counter
pyvorin_edge_uptime_seconds {uptime_seconds}
"""


def fetch():
    with urllib.request.urlopen(METRICS_URL, timeout=5) as resp:
        return json.loads(resp.read().decode("utf-8"))


def write_prom(data: dict):
    OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
        f.write(PROM_TEMPLATE.format(
            cpu_percent=data.get("cpu_percent", 0.0),
            ram_percent=data.get("ram_percent", 0.0),
            disk_percent=data.get("disk_percent", 0.0),
            thermal_celsius=data.get("thermal_celsius", 0.0),
            uptime_seconds=data.get("uptime_seconds", 0.0),
        ))


if __name__ == "__main__":
    write_prom(fetch())
  

Complete Grafana Dashboard JSON

Import the following dashboard into Grafana. It assumes Prometheus is scraping the textfile metrics above, plus a second job that hits /health and exposes pyvorin_edge_queue_depth via a similar bridge.


{
  "dashboard": {
    "id": null,
    "title": "Pyvorin Edge Health",
    "tags": ["edge", "pyvorin"],
    "timezone": "utc",
    "panels": [
      {
        "id": 1,
        "title": "CPU %",
        "type": "stat",
        "targets": [
          {
            "expr": "pyvorin_edge_cpu_percent",
            "legendFormat": "CPU"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "RAM %",
        "type": "stat",
        "targets": [
          {
            "expr": "pyvorin_edge_ram_percent",
            "legendFormat": "RAM"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "SoC Temperature",
        "type": "stat",
        "targets": [
          {
            "expr": "pyvorin_edge_thermal_celsius",
            "legendFormat": "°C"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "celsius",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 60},
                {"color": "red", "value": 75}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Cloud Queue Depth",
        "type": "timeseries",
        "targets": [
          {
            "expr": "pyvorin_edge_queue_depth",
            "legendFormat": "Pending items"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {"drawStyle": "line"}
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
      },
      {
        "id": 5,
        "title": "Disk Usage %",
        "type": "gauge",
        "targets": [
          {
            "expr": "pyvorin_edge_disk_percent",
            "legendFormat": "Disk"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
      }
    ]
  }
}
  

Alertmanager Rules

The following Prometheus Alertmanager rules trigger on resource exhaustion, thermal throttling risk, and cloud sync backlog. Save them as /etc/prometheus/alerts/pyvorin_edge.yml.


groups:
  - name: pyvorin_edge
    rules:
      - alert: EdgeHighCPU
        expr: pyvorin_edge_cpu_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU has been above 85% for more than 5 minutes."

      - alert: EdgeHighRAM
        expr: pyvorin_edge_ram_percent > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High RAM on {{ $labels.instance }}"
          description: "RAM usage is above 90%. OOM kills are likely."

      - alert: EdgeHighThermal
        expr: pyvorin_edge_thermal_celsius > 75
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Thermal throttling risk on {{ $labels.instance }}"
          description: "SoC temperature is above 75 °C. Performance will degrade."

      - alert: EdgeDiskFull
        expr: pyvorin_edge_disk_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling on {{ $labels.instance }}"
          description: "Disk usage is above 85%. SQLite WAL may fail to grow."

      - alert: EdgeSyncBacklog
        expr: pyvorin_edge_queue_depth > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cloud sync backlog on {{ $labels.instance }}"
          description: "More than 1000 items are queued. Check connectivity."
  

SystemMetrics API Usage

If you need to collect metrics inside your own Python script rather than via HTTP, use the SystemMetrics class directly.


from pyv_edge_agent.health_monitor.metrics import SystemMetrics, MetricsSnapshot

metrics = SystemMetrics()

# Individual accessors
print(f"CPU:   {metrics.cpu_percent():.1f}%")
print(f"RAM:   {metrics.ram_percent():.1f}%")
print(f"Disk:  {metrics.disk_percent('/var/lib/pyvorin'):.1f}%")
print(f"Thermal: {metrics.thermal_celsius()}°C")
print(f"Uptime: {metrics.uptime_seconds():.0f}s")

# Full snapshot
snapshot: MetricsSnapshot = metrics.snapshot()
print(snapshot.to_dict())
  

Summary

You now have full visibility into the Edge Agent's health. The /health endpoint gives you operational state, /metrics gives you system telemetry, the Prometheus bridge converts JSON into scrapable text format, and the Grafana dashboard plus Alertmanager rules turn raw numbers into actionable alerts.