"Health Monitoring and Prometheus Export"

Introduction

Tested with: Python 3.12.3, GCC 13.3.0, Pyvorin Edge SDK 1.0.5-edge, Ubuntu 24.04 LTS (x86_64 & ARM64). Run python3 --version and gcc --version to verify your environment.

Every edge deployment needs observability. The Pyvorin Edge Agent exposes two built-in HTTP endpoints—/health and /metrics—that provide a real-time view of pipeline health, resource utilisation, and cloud sync state. This article explains the structure of both endpoints, shows how to convert the JSON output into Prometheus exposition format, provides a complete Grafana dashboard, and supplies ready-to-use Alertmanager rules.

The /health Endpoint

The /health endpoint is served by _HealthHandler in edge_runtime/pyv_edge_agent/main.py. It returns a single JSON document with nested objects for agent state, system metrics, cloud queue depth, privacy configuration, and ingest adapters.


curl -s http://localhost:8080/health | python3 -m json.tool

A typical response looks like this:


{
  "status": "healthy",
  "timestamp": 1717000000.0,
  "metrics": {
    "cpu_percent": 12.5,
    "ram_percent": 34.0,
    "disk_percent": 45.2,
    "thermal_celsius": 42.0,
    "uptime_seconds": 86400.0,
    "timestamp": 1717000000.0
  },
  "agent": {
    "running": true,
    "buffer_count": 4,
    "readings_processed": 150000,
    "events_triggered": 23
  },
  "cloud": {
    "queue_depth": 12,
    "last_flush_time": 1716999900.0,
    "messages_sent_today": 1440,
    "endpoint": "https://api.pyvorin.com/v1/ingest"
  },
  "privacy": {
    "enabled": true,
    "rules_active": 3,
    "fields_redacted": ["patient_id"],
    "fields_hashed": ["device_uuid"]
  },
  "ingest": {
    "adapters_connected": ["simulator", "mqtt"],
    "devices_configured": 4
  }
}

Key	Source	Description
`status`	`EdgeAgent.is_running`	`"healthy"` if the agent loop is active.
`metrics`	`SystemMetrics.to_dict()`	CPU, RAM, disk, thermal, and uptime.
`agent.buffer_count`	`len(self._buffers)`	Number of active ring buffers.
`agent.readings_processed`	`self._readings_processed`	Lifetime counter of ingested readings.
`agent.events_triggered`	`self._events_triggered`	Lifetime counter of fired rule events.
`cloud.queue_depth`	`CloudSyncQueue.pending_count()`	Items waiting for upstream upload.
`cloud.messages_sent_today`	`self._cloud.messages_sent_today`	Daily egress counter (resets at midnight).
`privacy.rules_active`	`len(self._privacy.rules)`	Number of privacy rules currently loaded.
`ingest.adapters_connected`	`self._adapter_types.values()`	List of active adapter type names.

The /metrics Endpoint

The /metrics endpoint returns the raw output of SystemMetrics().to_dict() from edge_runtime/pyv_edge_agent/health_monitor/metrics.py. This is the lowest-overhead way to pull system telemetry because it bypasses the agent state object entirely.


curl -s http://localhost:8080/metrics | python3 -m json.tool


{
  "cpu_percent": 12.5,
  "ram_percent": 34.0,
  "disk_percent": 45.2,
  "thermal_celsius": 42.0,
  "uptime_seconds": 86400.0,
  "timestamp": 1717000000.0
}

Prometheus Metrics Export Format

Prometheus does not natively understand JSON. You need a small bridge script that polls /metrics and translates the dictionary into the Prometheus text exposition format. The script below can be run as a sidecar or cron job.


#!/usr/bin/env python3
"""Prometheus bridge for Pyvorin Edge /metrics."""

import json
import urllib.request
from pathlib import Path

METRICS_URL = "http://localhost:8080/metrics"
OUTPUT_PATH = Path("/var/lib/node_exporter/textfile_collector/pyvorin_edge.prom")

PROM_TEMPLATE = """\
# HELP pyvorin_edge_cpu_percent CPU utilisation percentage.
# TYPE pyvorin_edge_cpu_percent gauge
pyvorin_edge_cpu_percent {cpu_percent}
# HELP pyvorin_edge_ram_percent RAM utilisation percentage.
# TYPE pyvorin_edge_ram_percent gauge
pyvorin_edge_ram_percent {ram_percent}
# HELP pyvorin_edge_disk_percent Disk utilisation percentage.
# TYPE pyvorin_edge_disk_percent gauge
pyvorin_edge_disk_percent {disk_percent}
# HELP pyvorin_edge_thermal_celsius SoC temperature in Celsius.
# TYPE pyvorin_edge_thermal_celsius gauge
pyvorin_edge_thermal_celsius {thermal_celsius}
# HELP pyvorin_edge_uptime_seconds System uptime in seconds.
# TYPE pyvorin_edge_uptime_seconds counter
pyvorin_edge_uptime_seconds {uptime_seconds}
"""


def fetch():
    with urllib.request.urlopen(METRICS_URL, timeout=5) as resp:
        return json.loads(resp.read().decode("utf-8"))


def write_prom(data: dict):
    OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
        f.write(PROM_TEMPLATE.format(
            cpu_percent=data.get("cpu_percent", 0.0),
            ram_percent=data.get("ram_percent", 0.0),
            disk_percent=data.get("disk_percent", 0.0),
            thermal_celsius=data.get("thermal_celsius", 0.0),
            uptime_seconds=data.get("uptime_seconds", 0.0),
        ))


if __name__ == "__main__":
    write_prom(fetch())

If you are running the Prometheus Node Exporter, place the output file in its textfile_collector directory. The metrics will be scraped automatically on the next collection cycle.

Complete Grafana Dashboard JSON

Import the following dashboard into Grafana. It assumes Prometheus is scraping the textfile metrics above, plus a second job that hits /health and exposes pyvorin_edge_queue_depth via a similar bridge.


{
  "dashboard": {
    "id": null,
    "title": "Pyvorin Edge Health",
    "tags": ["edge", "pyvorin"],
    "timezone": "utc",
    "panels": [
      {
        "id": 1,
        "title": "CPU %",
        "type": "stat",
        "targets": [
          {
            "expr": "pyvorin_edge_cpu_percent",
            "legendFormat": "CPU"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "RAM %",
        "type": "stat",
        "targets": [
          {
            "expr": "pyvorin_edge_ram_percent",
            "legendFormat": "RAM"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "SoC Temperature",
        "type": "stat",
        "targets": [
          {
            "expr": "pyvorin_edge_thermal_celsius",
            "legendFormat": "°C"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "celsius",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 60},
                {"color": "red", "value": 75}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Cloud Queue Depth",
        "type": "timeseries",
        "targets": [
          {
            "expr": "pyvorin_edge_queue_depth",
            "legendFormat": "Pending items"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "custom": {"drawStyle": "line"}
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
      },
      {
        "id": 5,
        "title": "Disk Usage %",
        "type": "gauge",
        "targets": [
          {
            "expr": "pyvorin_edge_disk_percent",
            "legendFormat": "Disk"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 85}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
      }
    ]
  }
}

Alertmanager Rules

The following Prometheus Alertmanager rules trigger on resource exhaustion, thermal throttling risk, and cloud sync backlog. Save them as /etc/prometheus/alerts/pyvorin_edge.yml.


groups:
  - name: pyvorin_edge
    rules:
      - alert: EdgeHighCPU
        expr: pyvorin_edge_cpu_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU has been above 85% for more than 5 minutes."

      - alert: EdgeHighRAM
        expr: pyvorin_edge_ram_percent > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High RAM on {{ $labels.instance }}"
          description: "RAM usage is above 90%. OOM kills are likely."

      - alert: EdgeHighThermal
        expr: pyvorin_edge_thermal_celsius > 75
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Thermal throttling risk on {{ $labels.instance }}"
          description: "SoC temperature is above 75 °C. Performance will degrade."

      - alert: EdgeDiskFull
        expr: pyvorin_edge_disk_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling on {{ $labels.instance }}"
          description: "Disk usage is above 85%. SQLite WAL may fail to grow."

      - alert: EdgeSyncBacklog
        expr: pyvorin_edge_queue_depth > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cloud sync backlog on {{ $labels.instance }}"
          description: "More than 1000 items are queued. Check connectivity."

SystemMetrics API Usage

If you need to collect metrics inside your own Python script rather than via HTTP, use the SystemMetrics class directly.


from pyv_edge_agent.health_monitor.metrics import SystemMetrics, MetricsSnapshot

metrics = SystemMetrics()

# Individual accessors
print(f"CPU:   {metrics.cpu_percent():.1f}%")
print(f"RAM:   {metrics.ram_percent():.1f}%")
print(f"Disk:  {metrics.disk_percent('/var/lib/pyvorin'):.1f}%")
print(f"Thermal: {metrics.thermal_celsius()}°C")
print(f"Uptime: {metrics.uptime_seconds():.0f}s")

# Full snapshot
snapshot: MetricsSnapshot = metrics.snapshot()
print(snapshot.to_dict())

cpu_percent() reads from /proc/stat and requires a delta between two calls. The first call always returns 0.0; the second call returns the true utilisation over the sampling window.

Summary

You now have full visibility into the Edge Agent's health. The /health endpoint gives you operational state, /metrics gives you system telemetry, the Prometheus bridge converts JSON into scrapable text format, and the Grafana dashboard plus Alertmanager rules turn raw numbers into actionable alerts.