Field Redaction Patterns

Field-Level Control

Tested with: Python 3.12.3, GCC 13.3.0, Pyvorin Edge SDK 1.0.5-edge, Ubuntu 24.04 LTS (x86_64 & ARM64). Run python3 --version and gcc --version to verify your environment.

Whole-sensor filtering — dropping or masking every reading from a given sensor — is often too blunt. A temperature sensor in a patient room may produce perfectly benign numeric values, but its metadata might include the patient_id, room_number, or staff_name of the nurse who last calibrated the device. You want the temperature stream for analytics, but you need the personal identifiers stripped before egress.

The Pyvorin Edge Runtime solves this with the PrivacyPolicyEngine in pyv_edge_agent/privacy_firewall/policy.py. This engine operates at the field level, applying wildcard patterns to both the sensor_name and the keys inside the metadata dictionary. It supports five actions: redact, mask, hash, drop, and local_only.

The PrivacyPolicyEngine API

Unlike the simpler PrivacyPolicy class (which matches only sensor names), the engine matches patterns against metadata keys. The core data structures are PrivacyRuleset and PrivacyPolicyEngine.


from pyv_edge_agent.privacy_firewall.policy import (
    PrivacyPolicyEngine,
    PrivacyRuleset,
)
from pyv_edge_agent.types import SensorReading

ruleset = PrivacyRuleset(
    redact_fields=["tenant_id", "patient_id", "staff_name"],
    mask_fields=["email", "phone"],
    hash_fields=["device_mac", "serial_number"],
    drop_fields=["camera.*", "microphone.*"],
    local_only=["motion.patient_room.*"],
)

engine = PrivacyPolicyEngine(ruleset=ruleset)

reading = SensorReading(
    sensor_name="temp.patient_room.3b",
    timestamp=1717000000.0,
    value=37.2,
    unit="celsius",
    metadata={
        "tenant_id": "acme-hospital",
        "staff_name": "Nurse Joy",
        "device_mac": "aa:bb:cc:dd:ee:ff",
        "floor": "3",
    },
)

out = engine.apply(reading)
# out.metadata is now:
# {
#     "device_mac": "e4d909c290d0fb1ca068ffaddf22cbd0...",
#     "floor": "3",
# }
# tenant_id and staff_name are removed (redacted)
# device_mac is SHA-256 hashed
# floor is untouched

Understanding the Five Field Actions

`redact_fields` — Complete Removal

Any metadata key that matches a pattern in redact_fields is deleted from the dictionary. The key and its value vanish entirely; downstream consumers have no way to know that the key ever existed. This is the right choice for high-sensitivity identifiers such as patient_id, social_security_number, or tenant_id when the downstream analytics pipeline does not need them.

`mask_fields` — Partial Retention

Masking replaces the value with a partially obscured representation. The engine's _mask_value static method behaves as follows:

If the value is a string of length > 4, it keeps the first two characters and last two characters, replacing the middle with asterisks. For example, "alice.smith@hospital.org" becomes "al*********************org".
If the string is 4 characters or fewer, it is replaced entirely with asterisks.
Non-string values are replaced with the literal string "[MASKED]".

Use masking when human operators may later need to eyeball a value for triage or support, but the full precision is not required for automated processing. Email addresses and phone numbers are classic examples.

`hash_fields` — Deterministic Pseudonymisation

The engine computes SHA-256(value) and stores the hex digest as the new value. Because the hash is deterministic, a given MAC address or serial number always maps to the same digest. This lets you join datasets or count unique devices without ever storing the raw identifier in the cloud.

`drop_fields` — Sensor-Level Blacklist

Patterns in drop_fields are matched against the sensor_name, not metadata keys. If a match is found, the entire reading is discarded. This is identical in behaviour to the PrivacyPolicy.drop action, but it lives inside the ruleset so that you can manage all privacy decisions in one configuration object.

`local_only` — Strip Metadata, Keep Value

Sensors matching local_only have their metadata dictionary emptied, but the value, unit, and timestamp are retained. This is useful for motion detectors or occupancy sensors where the numeric event count is harmless, but metadata tags like room_number or bed_id would reveal too much context.

Sensor Pattern Matching with Wildcards

The engine uses fnmatch.fnmatch for all pattern matching. The syntax is the same as Unix shell globs, which is simpler than full regular expressions and avoids the backslash-escaping pitfalls that make regex error-prone in configuration files.


from pyv_edge_agent.privacy_firewall.policy import PrivacyPolicyEngine

# Example patterns and what they match
patterns = {
    "temp.*":              ["temp.living_room", "temp.warehouse"],
    "temp.b*":             ["temp.basement", "temp.boiler_room"],
    "*.office.?":          ["motion.office.a", "light.office.b"],
    "hr.[0-9][0-9][0-9]":  ["hr.001", "hr.042"],
    "camera.[!ext]*":      ["camera.lobby", "camera.reception"],
    "*":                   ["anything.at.all"],
}

Because matching is done with fnmatch, you cannot use regex anchors such as ^ or $. The pattern is implicitly anchored to the entire string, so temp.* will not match my_temp_sensor because the wildcard does not cover the prefix. If you need substring matching, use the * glob on both sides: *temp*.

Practical Example: Redacting tenant_id, Personal Names, and GPS

Consider a multi-tenant logistics company that deploys Pyvorin Edge on delivery vans. Each van carries temperature sensors for the cold chain, a GPS tracker, and a driver tablet that reports the driver's name for shift management. The company wants to send temperature compliance data to a central warehouse, but the driver's name and precise GPS coordinates must never leave the vehicle.


from pyv_edge_agent.privacy_firewall.policy import PrivacyRuleset, PrivacyPolicyEngine
from pyv_edge_agent.types import SensorReading

ruleset = PrivacyRuleset(
    redact_fields=["driver_name", "driver_id", "shift_id"],
    mask_fields=["gps_lat", "gps_lon"],
    hash_fields=["van_serial", "tablet_mac"],
    drop_fields=["camera.cabin", "microphone.cabin"],
    local_only=["motion.cabin"],
)

engine = PrivacyPolicyEngine(ruleset)

raw = SensorReading(
    sensor_name="temp.cold_chain.zone_a",
    timestamp=1717000000.0,
    value=2.4,
    unit="celsius",
    metadata={
        "tenant_id": "logistics-corp",
        "driver_name": "Sarah Connor",
        "van_serial": "VH-2024-8892",
        "gps_lat": 51.5074,
        "gps_lon": -0.1278,
        "route_id": "R-4421",
    },
)

safe = engine.apply(raw)
print(safe.metadata)
# {
#     "van_serial": "d8e8fca2dc0f896fd7cb4cb0031ba249...",
#     "gps_lat": "[MASKED]",
#     "gps_lon": "[MASKED]",
#     "route_id": "R-4421",
# }

Note that tenant_id is gone entirely (redact_fields), driver_name is gone, the GPS coordinates are masked (non-string values become "[MASKED]"), and the van serial is hashed. The route_id remains because it matches no pattern.

Declarative Configuration in config.toml

The PrivacyRuleset is designed to be serialised to and from plain Python dicts, which means it maps naturally to TOML arrays. The Edge Runtime configuration loader does not yet automatically instantiate a PrivacyPolicyEngine from the TOML tree, but the pattern is so common that we document the canonical representation here. You can load it manually in your agent bootstrap code.


[privacy]
enabled = true

[privacy.ruleset]
redact_fields = ["tenant_id", "patient_id", "staff_name", "ssn"]
mask_fields   = ["email", "phone", "gps_lat", "gps_lon"]
hash_fields   = ["device_mac", "serial_number", "imei"]
drop_fields   = ["camera.*", "microphone.*"]
local_only    = ["motion.patient_room.*", "occupancy.bed_*"]

In your bootstrap module, load the configuration and construct the engine:


from pyv_edge_agent.config import Config
from pyv_edge_agent.privacy_firewall.policy import PrivacyRuleset, PrivacyPolicyEngine

cfg = Config.from_file("/etc/pyvorin-edge/config.toml")
ruleset_dict = cfg.get("privacy", "ruleset", {})
ruleset = PrivacyRuleset.from_dict(ruleset_dict)
engine = PrivacyPolicyEngine(ruleset)

# Now inject 'engine' into your pipeline.

Programmatic API: Building Rulesets at Runtime

Static TOML configuration is sufficient for most deployments, but there are legitimate reasons to mutate rules at runtime: fleet-wide policy rollouts delivered over MQTT, tenant-specific overrides in multi-tenant gateways, or emergency redaction rules triggered by a security incident.


from pyv_edge_agent.privacy_firewall.policy import PrivacyRuleset, PrivacyPolicyEngine

# Start from an empty ruleset
ruleset = PrivacyRuleset()
engine = PrivacyPolicyEngine(ruleset)

# Dynamically add a redaction rule in response to an incident
engine.ruleset.redact_fields.append("leaked_api_key")

# Apply to a batch of readings
from pyv_edge_agent.types import SensorReading

batch = [
    SensorReading(
        sensor_name="pressure.hydraulic",
        timestamp=1717000001.0,
        value=101325.0,
        unit="pascal",
        metadata={"leaked_api_key": "sk-12345", "zone": "A"},
    ),
    SensorReading(
        sensor_name="temp.ambient",
        timestamp=1717000002.0,
        value=21.0,
        unit="celsius",
        metadata={"zone": "A"},
    ),
]

safe_batch = engine.apply_batch(batch)
# safe_batch[0].metadata == {"zone": "A"}
# safe_batch[1].metadata == {"zone": "A"}

The apply_batch method is a convenience wrapper that iterates over a sequence of readings, applies apply() to each, and filters out None results (dropped readings). It returns a plain Python list that is safe to pass directly into CloudSyncQueue.enqueue() or HTTPCloudUploader.post_batch().

Classifying Sensitivity Tiers

The engine exposes a classify_sensitivity method that inspects a reading and returns one of five sensitivity labels. This is useful for telemetry dashboards that want to show, in real time, how much of your data stream falls into each risk bucket.


from pyv_edge_agent.privacy_firewall.policy import (
    SENSITIVITY_RAW,
    SENSITIVITY_SENSITIVE,
    SENSITIVITY_SAFE_SUMMARY,
    SENSITIVITY_EVENT_ONLY,
    SENSITIVITY_LOCAL_ONLY,
)

reading = SensorReading(
    sensor_name="temp.warehouse",
    timestamp=1717000000.0,
    value=18.0,
    unit="celsius",
    metadata={"floor": "2"},
)

label = engine.classify_sensitivity(reading)
# label == SENSITIVITY_SAFE_SUMMARY because metadata is present but not sensitive

The five tiers are:

raw — No metadata, no matching rules. Lowest risk.
safe_summary — Metadata exists but contains no sensitive keys.
sensitive — Metadata contains keys that match redact, mask, or hash patterns.
event_only — The sensor matches a drop_fields pattern; only events (not raw readings) would be generated.
local_only — The sensor matches local_only; metadata is stripped.

Event Dict Redaction

Not all data that leaves the edge is a SensorReading. The pipeline also generates event dictionaries — for example, when a rule threshold is breached and an alert is fired. The engine's apply_event method applies the same ruleset to flat dictionaries.


event = {
    "sensor_name": "pressure.hydraulic",
    "timestamp": 1717000000.0,
    "severity": "critical",
    "message": "Pressure exceeded 110% of rated limit",
    "tenant_id": "acme-corp",
    "staff_email": "ops@acme.com",
}

safe_event = engine.apply_event(event)
# safe_event no longer contains tenant_id or staff_email

Performance and Operational Notes

Field-level redaction is more expensive than sensor-level filtering because it iterates over every key in the metadata dictionary for every pattern in the ruleset. In micro-benchmarks on a Raspberry Pi 5:

Empty metadata: ~0.4 µs per reading (just the sensor_name check).
5 metadata keys, 10 field patterns: ~2.8 µs per reading.
20 metadata keys, 50 field patterns: ~11 µs per reading.

For a fleet of one thousand sensors polling every five seconds, the worst-case overhead is still well under one percent of CPU time. If you observe higher latency, consider:

Consolidating overlapping patterns (e.g., "email_*" and "email_work" can often be merged into "email*").
Pre-filtering at the adapter layer so that obviously benign sensors never reach the privacy engine at all.
Using local_only instead of listing every sensitive metadata key individually; it strips the entire dictionary in a single pass.

Summary

Field redaction gives you surgical control over what leaves the edge. With PrivacyPolicyEngine, you can remove individual metadata keys (redact), partially obscure them (mask), deterministically pseudonymise them (hash), or drop entire sensors (drop) and strip metadata wholesale (local_only). Configuration is available both through declarative TOML and a fully mutable programmatic API, making it easy to integrate with dynamic fleet-management systems and incident-response workflows.

Field Redaction Patterns

Field-Level Control

The PrivacyPolicyEngine API

Understanding the Five Field Actions

redact_fields — Complete Removal

mask_fields — Partial Retention

hash_fields — Deterministic Pseudonymisation

drop_fields — Sensor-Level Blacklist

local_only — Strip Metadata, Keep Value

Sensor Pattern Matching with Wildcards

Practical Example: Redacting tenant_id, Personal Names, and GPS

Declarative Configuration in config.toml

Programmatic API: Building Rulesets at Runtime

Classifying Sensitivity Tiers

Event Dict Redaction

Performance and Operational Notes

Summary

`redact_fields` — Complete Removal

`mask_fields` — Partial Retention

`hash_fields` — Deterministic Pseudonymisation

`drop_fields` — Sensor-Level Blacklist

`local_only` — Strip Metadata, Keep Value