Offline-First Operation

The Myth of Always-On

Tested with: Python 3.12.3, GCC 13.3.0, Pyvorin Edge SDK 1.0.5-edge, Ubuntu 24.04 LTS (x86_64 & ARM64). Run python3 --version and gcc --version to verify your environment.

Cloud-native software is built on the assumption that the network is reliable, fast, and cheap. At the edge, none of these assumptions hold. A temperature sensor on a fishing trawler may be offline for weeks. A pressure monitor in an underground mine may have connectivity only when the lift cage reaches the surface. A smart-meter in a rural village may share a single 2G tower with five hundred other devices. Designing for these environments requires an offline-first mindset: the system must function correctly when disconnected, and it must synchronise gracefully when connectivity returns.

This article explains the architectural patterns that make Pyvorin Edge resilient to intermittent connectivity. We cover SQLite Write-Ahead Logging (WAL) as a buffering strategy, the retry and circuit-breaker policies in cloud_sync/retry.py, graceful degradation modes, techniques for detecting connectivity loss, and the mechanisms that ensure queue persistence across reboots.

SQLite WAL Buffering

The foundation of offline-first operation is local storage. The Pyvorin Edge Runtime uses SQLite with WAL (Write-Ahead Logging) mode for all persistent state: sensor readings, events, audit records, and the cloud sync queue. WAL mode was chosen because it offers three properties that are critical at the edge:

Readers do not block writers. While the main thread is appending new readings to the database, a background flush thread can read the queue and attempt uploads without acquiring exclusive locks.
Crash recovery is automatic. If the device loses power mid-transaction, the WAL file contains the uncommitted changes. On the next boot, SQLite replays the WAL and restores consistency without manual intervention.
Checkpointing is configurable. By default, SQLite auto-checkpoints when the WAL reaches 1000 pages. On devices with flash storage, frequent checkpoints cause wear. The runtime sets PRAGMA synchronous = NORMAL and lets the WAL grow during outages, checkpointing only when the flush cycle succeeds.


from pyv_edge_agent.local_store.sqlite_store import SQLiteStore

# The store automatically enables WAL mode on every connection.
store = SQLiteStore(db_path="/var/lib/pyvorin/edge_store.db", pool_size=4)

# Store a reading while offline
from pyv_edge_agent.types import SensorReading
reading = SensorReading("temp.hold", 1717000000.0, 4.2, "celsius", {"vessel": "trawler-7"})
store.store_reading(reading)

# The data is now safe in SQLite, even if the device reboots before upload.

The SQLiteStore manages a connection pool (default size 4) and applies PRAGMA journal_mode = WAL on every new connection. It also tracks WAL size via get_stats(), so you can monitor disk usage and trigger manual checkpoints if the WAL grows too large during extended outages.

Retry with Exponential Backoff

When connectivity returns, the device must not immediately flood the network with every pending item. The ExponentialBackoff and with_retry utilities in cloud_sync/retry.py provide configurable retry policies that respect both the device and the upstream server.


from pyv_edge_agent.cloud_sync.retry import RetryPolicy, with_retry, ExponentialBackoff

policy = RetryPolicy(
    max_attempts=5,
    base_delay=2.0,
    max_delay=300.0,
    exponential_base=2.0,
    jitter=True,
)

backoff = ExponentialBackoff(policy)
for attempt in range(policy.max_attempts):
    delay = backoff.calculate_next_delay(attempt)
    print(f"Attempt {attempt + 1}: wait {delay:.1f}s before retry")

With jitter enabled, the actual delay is a random value between 50% and 100% of the computed exponential delay. This prevents synchronised retry storms across a fleet of devices that all lost connectivity at the same time (for example, during a cell tower maintenance window).


from pyv_edge_agent.cloud_sync.retry import with_retry, RetryPolicy

@with_retry(policy=RetryPolicy(max_attempts=3, base_delay=1.0))
def upload_to_satellite(items):
    # This function will be retried up to 3 times with delays of ~1s and ~2s.
    ...

Circuit Breaker for Upstream Protection

If the upstream server is down or returning 500 errors, blind retries waste battery and bandwidth. The CircuitBreaker class tracks failures and opens the circuit after a threshold, causing all subsequent calls to fail fast until a recovery timeout expires.


from pyv_edge_agent.cloud_sync.retry import CircuitBreaker, CircuitState

breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=60.0,
)

def fragile_upload(items):
    # Simulate an unreliable endpoint
    import random
    if random.random() < 0.8:
        raise ConnectionError("Upstream unreachable")
    return "ok"

for _ in range(20):
    try:
        result = breaker.call(fragile_upload, [])
        print("Upload succeeded")
    except RuntimeError as exc:
        print(f"Circuit state: {breaker.state.value} — {exc}")
    except ConnectionError as exc:
        print(f"Upload failed: {exc}")

A circuit breaker should wrap the uploader in any deployment where the upstream is shared (e.g., a single API gateway serving thousands of edge devices). When the circuit is open, the queue continues to accumulate items locally; no data is lost. When the circuit transitions to HALF_OPEN, a single probe request is allowed. If it succeeds, the circuit closes and normal flushing resumes.

Graceful Degradation

Offline-first does not mean "do everything locally forever." It means the system degrades gracefully along a continuum of functionality:

Full connectivity: All features active — real-time dashboards, remote configuration, cloud analytics, OTA updates.
Degraded connectivity: High-latency links or occasional packet loss. The agent increases batch sizes, extends timeouts, and reduces heartbeat frequency to conserve bandwidth.
No connectivity: The agent continues ingestion, privacy filtering, and local rule evaluation. Events are stored in SQLite. The cloud sync queue grows. Critical local actions (e.g., shutting down a motor when vibration exceeds a safety threshold) continue to execute because they do not depend on the cloud.
Extended outage (days): The queue may approach storage limits. The agent begins TTL-based shedding, prioritising critical and anomaly data over routine telemetry and logs. Local summaries (hourly aggregates) are computed and stored, while raw high-frequency readings are purged to reclaim space.


from pyv_edge_agent.local_store.sqlite_store import SQLiteStore
from pyv_edge_agent.cloud_sync.queue import CloudSyncQueue, Priority

store = SQLiteStore("edge_store.db")
queue = CloudSyncQueue("sync_queue.db")

stats = store.get_stats()
wal_mb = stats["wal_size_bytes"] / (1024 * 1024)

if wal_mb > 500:
    # Emergency shedding: purge raw readings older than 6 hours
    store.purge_old(hours=6)
    # Drop low-priority queue items
    # (Implementation would use a custom SQL DELETE on sync_queue)

Detecting Connectivity Loss

The runtime does not assume the presence of a network interface; it probes for actual end-to-end connectivity. The simplest probe is a health check against the configured cloud endpoint. HTTPCloudClient in cloud_sync/http_client.py provides a health_check() method for this purpose.


from pyv_edge_agent.cloud_sync.http_client import HTTPCloudClient

client = HTTPCloudClient()
is_online = client.health_check("https://api.pyvorin.com/v1/health", timeout=5)

if is_online:
    print("Cloud reachable — flushing queue")
    queue.maybe_flush(uploader=uploader)
else:
    print("Cloud unreachable — buffering locally")

For environments where the health endpoint itself may be unreliable, you can implement a layered connectivity test:

Interface check: Does eth0 or wlan0 have a carrier? (os.path.exists("/sys/class/net/eth0/carrier"))
Gateway check: Can we reach the default gateway via ICMP? (subprocess.run(["ping", "-c", "1", gateway]))
DNS check: Can we resolve the cloud endpoint hostname? (socket.getaddrinfo("api.pyvorin.com", None))
Application check: Does the health endpoint return 2xx?

Only if all four layers pass should the agent consider itself online. If layer 1 or 2 fails, the outage is local (cable unplugged). If layer 3 fails but layer 2 passes, DNS is broken. If layer 4 fails but layer 3 passes, the upstream server is down. Each layer suggests a different remediation strategy.

Queue Persistence Across Reboots

Because CloudSyncQueue uses SQLite, its contents survive reboots, power failures, and application crashes. There is no in-memory buffer that can be lost. When the device boots, the queue is in exactly the same state it was in when power was lost: items waiting to be dequeued, retry counts intact, next_retry_at timestamps unchanged.

To ensure this property holds in practice, follow these deployment guidelines:

Place the database on persistent storage. Do not store sync_queue.db on a tmpfs mount (e.g., /tmp or /run). Use /var/lib/pyvorin/ on the root filesystem or a dedicated partition.
Enable filesystem barriers. SQLite relies on the underlying filesystem to honour fsync(). Mount the data partition with barrier enabled (the default on ext4) so that power loss cannot corrupt committed transactions.
Use a systemd service with Restart=always. If the agent crashes, systemd should restart it within seconds. The queue will be intact and flushing will resume.
Graceful shutdown handling. Trap SIGTERM in your agent and call queue.pending_count() to log the number of unsent items before exiting. This aids post-mortem analysis.


import signal
import sys

def on_sigterm(signum, frame):
    pending = queue.pending_count()
    logger.info("Received SIGTERM. %d items remain in queue.", pending)
    store.close()
    sys.exit(0)

signal.signal(signal.SIGTERM, on_sigterm)

Architecture Summary

The offline-first architecture of Pyvorin Edge can be summarised as follows:

Ingestion is decoupled from egress. Sensor adapters write to the local store; the cloud sync queue drains asynchronously. Neither blocks the other.
All state is SQLite-backed. WAL mode provides crash recovery and concurrent readers/writers.
Retries are exponential with jitter. This prevents thundering herds and respects shared infrastructure.
Circuit breakers protect upstreams. Fast failure during outages conserves local resources.
Connectivity is layered. The agent probes from the physical interface up to the application endpoint, distinguishing local faults from remote ones.
Graceful degradation preserves safety. Local rule evaluation and critical actuation continue even when the cloud is unreachable.

Summary

Offline-first is not a feature; it is a design philosophy. The Pyvorin Edge Runtime embodies this philosophy through SQLite WAL buffering, configurable exponential backoff, circuit breakers, layered connectivity detection, and queue persistence across reboots. Whether your device is offline for milliseconds or weeks, the pipeline continues to collect, filter, and protect data. When connectivity returns, synchronisation happens automatically, safely, and without operator intervention.