Field Redaction Patterns
How to redact specific metadata fields before cloud sync using sensor pattern matching, regex-style wildcards, and both TOML configuration and programmatic APIs.
Published Jun 2, 2026
Field-Level Control
Whole-sensor filtering — dropping or masking every reading from a given sensor — is often too
blunt. A temperature sensor in a patient room may produce perfectly benign numeric values, but
its metadata might include the patient_id, room_number, or
staff_name of the nurse who last calibrated the device. You want the temperature
stream for analytics, but you need the personal identifiers stripped before egress.
The Pyvorin Edge Runtime solves this with the PrivacyPolicyEngine in
pyv_edge_agent/privacy_firewall/policy.py. This engine operates at the field level,
applying wildcard patterns to both the sensor_name and the keys inside the
metadata dictionary. It supports five actions: redact, mask,
hash, drop, and local_only.
The PrivacyPolicyEngine API
Unlike the simpler PrivacyPolicy class (which matches only sensor names), the
engine matches patterns against metadata keys. The core data structures are PrivacyRuleset
and PrivacyPolicyEngine.
from pyv_edge_agent.privacy_firewall.policy import (
PrivacyPolicyEngine,
PrivacyRuleset,
)
from pyv_edge_agent.types import SensorReading
ruleset = PrivacyRuleset(
redact_fields=["tenant_id", "patient_id", "staff_name"],
mask_fields=["email", "phone"],
hash_fields=["device_mac", "serial_number"],
drop_fields=["camera.*", "microphone.*"],
local_only=["motion.patient_room.*"],
)
engine = PrivacyPolicyEngine(ruleset=ruleset)
reading = SensorReading(
sensor_name="temp.patient_room.3b",
timestamp=1717000000.0,
value=37.2,
unit="celsius",
metadata={
"tenant_id": "acme-hospital",
"staff_name": "Nurse Joy",
"device_mac": "aa:bb:cc:dd:ee:ff",
"floor": "3",
},
)
out = engine.apply(reading)
# out.metadata is now:
# {
# "device_mac": "e4d909c290d0fb1ca068ffaddf22cbd0...",
# "floor": "3",
# }
# tenant_id and staff_name are removed (redacted)
# device_mac is SHA-256 hashed
# floor is untouched
Understanding the Five Field Actions
redact_fields — Complete Removal
Any metadata key that matches a pattern in redact_fields is deleted from the
dictionary. The key and its value vanish entirely; downstream consumers have no way to know that
the key ever existed. This is the right choice for high-sensitivity identifiers such as
patient_id, social_security_number, or tenant_id when
the downstream analytics pipeline does not need them.
mask_fields — Partial Retention
Masking replaces the value with a partially obscured representation. The engine's
_mask_value static method behaves as follows:
- If the value is a string of length > 4, it keeps the first two characters and last two
characters, replacing the middle with asterisks. For example,
"alice.smith@hospital.org"becomes"al*********************org". - If the string is 4 characters or fewer, it is replaced entirely with asterisks.
- Non-string values are replaced with the literal string
"[MASKED]".
Use masking when human operators may later need to eyeball a value for triage or support, but the full precision is not required for automated processing. Email addresses and phone numbers are classic examples.
hash_fields — Deterministic Pseudonymisation
The engine computes SHA-256(value) and stores the hex digest as the new value.
Because the hash is deterministic, a given MAC address or serial number always maps to the same
digest. This lets you join datasets or count unique devices without ever storing the raw
identifier in the cloud.
drop_fields — Sensor-Level Blacklist
Patterns in drop_fields are matched against the sensor_name, not
metadata keys. If a match is found, the entire reading is discarded. This is identical in
behaviour to the PrivacyPolicy.drop action, but it lives inside the ruleset so that
you can manage all privacy decisions in one configuration object.
local_only — Strip Metadata, Keep Value
Sensors matching local_only have their metadata dictionary emptied, but the
value, unit, and timestamp are retained. This is useful
for motion detectors or occupancy sensors where the numeric event count is harmless, but
metadata tags like room_number or bed_id would reveal too much
context.
Sensor Pattern Matching with Wildcards
The engine uses fnmatch.fnmatch for all pattern matching. The syntax is the same
as Unix shell globs, which is simpler than full regular expressions and avoids the
backslash-escaping pitfalls that make regex error-prone in configuration files.
from pyv_edge_agent.privacy_firewall.policy import PrivacyPolicyEngine
# Example patterns and what they match
patterns = {
"temp.*": ["temp.living_room", "temp.warehouse"],
"temp.b*": ["temp.basement", "temp.boiler_room"],
"*.office.?": ["motion.office.a", "light.office.b"],
"hr.[0-9][0-9][0-9]": ["hr.001", "hr.042"],
"camera.[!ext]*": ["camera.lobby", "camera.reception"],
"*": ["anything.at.all"],
}
Because matching is done with fnmatch, you cannot use regex anchors such as
^ or $. The pattern is implicitly anchored to the entire string, so
temp.* will not match my_temp_sensor because the wildcard
does not cover the prefix. If you need substring matching, use the * glob on both
sides: *temp*.
Practical Example: Redacting tenant_id, Personal Names, and GPS
Consider a multi-tenant logistics company that deploys Pyvorin Edge on delivery vans. Each van carries temperature sensors for the cold chain, a GPS tracker, and a driver tablet that reports the driver's name for shift management. The company wants to send temperature compliance data to a central warehouse, but the driver's name and precise GPS coordinates must never leave the vehicle.
from pyv_edge_agent.privacy_firewall.policy import PrivacyRuleset, PrivacyPolicyEngine
from pyv_edge_agent.types import SensorReading
ruleset = PrivacyRuleset(
redact_fields=["driver_name", "driver_id", "shift_id"],
mask_fields=["gps_lat", "gps_lon"],
hash_fields=["van_serial", "tablet_mac"],
drop_fields=["camera.cabin", "microphone.cabin"],
local_only=["motion.cabin"],
)
engine = PrivacyPolicyEngine(ruleset)
raw = SensorReading(
sensor_name="temp.cold_chain.zone_a",
timestamp=1717000000.0,
value=2.4,
unit="celsius",
metadata={
"tenant_id": "logistics-corp",
"driver_name": "Sarah Connor",
"van_serial": "VH-2024-8892",
"gps_lat": 51.5074,
"gps_lon": -0.1278,
"route_id": "R-4421",
},
)
safe = engine.apply(raw)
print(safe.metadata)
# {
# "van_serial": "d8e8fca2dc0f896fd7cb4cb0031ba249...",
# "gps_lat": "[MASKED]",
# "gps_lon": "[MASKED]",
# "route_id": "R-4421",
# }
Note that tenant_id is gone entirely (redact_fields),
driver_name is gone, the GPS coordinates are masked (non-string values become
"[MASKED]"), and the van serial is hashed. The route_id remains
because it matches no pattern.
Declarative Configuration in config.toml
The PrivacyRuleset is designed to be serialised to and from plain Python dicts,
which means it maps naturally to TOML arrays. The Edge Runtime configuration loader does not
yet automatically instantiate a PrivacyPolicyEngine from the TOML tree, but the
pattern is so common that we document the canonical representation here. You can load it
manually in your agent bootstrap code.
[privacy]
enabled = true
[privacy.ruleset]
redact_fields = ["tenant_id", "patient_id", "staff_name", "ssn"]
mask_fields = ["email", "phone", "gps_lat", "gps_lon"]
hash_fields = ["device_mac", "serial_number", "imei"]
drop_fields = ["camera.*", "microphone.*"]
local_only = ["motion.patient_room.*", "occupancy.bed_*"]
In your bootstrap module, load the configuration and construct the engine:
from pyv_edge_agent.config import Config
from pyv_edge_agent.privacy_firewall.policy import PrivacyRuleset, PrivacyPolicyEngine
cfg = Config.from_file("/etc/pyvorin-edge/config.toml")
ruleset_dict = cfg.get("privacy", "ruleset", {})
ruleset = PrivacyRuleset.from_dict(ruleset_dict)
engine = PrivacyPolicyEngine(ruleset)
# Now inject 'engine' into your pipeline.
Programmatic API: Building Rulesets at Runtime
Static TOML configuration is sufficient for most deployments, but there are legitimate reasons to mutate rules at runtime: fleet-wide policy rollouts delivered over MQTT, tenant-specific overrides in multi-tenant gateways, or emergency redaction rules triggered by a security incident.
from pyv_edge_agent.privacy_firewall.policy import PrivacyRuleset, PrivacyPolicyEngine
# Start from an empty ruleset
ruleset = PrivacyRuleset()
engine = PrivacyPolicyEngine(ruleset)
# Dynamically add a redaction rule in response to an incident
engine.ruleset.redact_fields.append("leaked_api_key")
# Apply to a batch of readings
from pyv_edge_agent.types import SensorReading
batch = [
SensorReading(
sensor_name="pressure.hydraulic",
timestamp=1717000001.0,
value=101325.0,
unit="pascal",
metadata={"leaked_api_key": "sk-12345", "zone": "A"},
),
SensorReading(
sensor_name="temp.ambient",
timestamp=1717000002.0,
value=21.0,
unit="celsius",
metadata={"zone": "A"},
),
]
safe_batch = engine.apply_batch(batch)
# safe_batch[0].metadata == {"zone": "A"}
# safe_batch[1].metadata == {"zone": "A"}
The apply_batch method is a convenience wrapper that iterates over a sequence of
readings, applies apply() to each, and filters out None results
(dropped readings). It returns a plain Python list that is safe to pass directly
into CloudSyncQueue.enqueue() or HTTPCloudUploader.post_batch().
Classifying Sensitivity Tiers
The engine exposes a classify_sensitivity method that inspects a reading and
returns one of five sensitivity labels. This is useful for telemetry dashboards that want to
show, in real time, how much of your data stream falls into each risk bucket.
from pyv_edge_agent.privacy_firewall.policy import (
SENSITIVITY_RAW,
SENSITIVITY_SENSITIVE,
SENSITIVITY_SAFE_SUMMARY,
SENSITIVITY_EVENT_ONLY,
SENSITIVITY_LOCAL_ONLY,
)
reading = SensorReading(
sensor_name="temp.warehouse",
timestamp=1717000000.0,
value=18.0,
unit="celsius",
metadata={"floor": "2"},
)
label = engine.classify_sensitivity(reading)
# label == SENSITIVITY_SAFE_SUMMARY because metadata is present but not sensitive
The five tiers are:
raw— No metadata, no matching rules. Lowest risk.safe_summary— Metadata exists but contains no sensitive keys.sensitive— Metadata contains keys that match redact, mask, or hash patterns.event_only— The sensor matches adrop_fieldspattern; only events (not raw readings) would be generated.local_only— The sensor matcheslocal_only; metadata is stripped.
Event Dict Redaction
Not all data that leaves the edge is a SensorReading. The pipeline also generates
event dictionaries — for example, when a rule threshold is breached and an alert is fired. The
engine's apply_event method applies the same ruleset to flat dictionaries.
event = {
"sensor_name": "pressure.hydraulic",
"timestamp": 1717000000.0,
"severity": "critical",
"message": "Pressure exceeded 110% of rated limit",
"tenant_id": "acme-corp",
"staff_email": "ops@acme.com",
}
safe_event = engine.apply_event(event)
# safe_event no longer contains tenant_id or staff_email
Performance and Operational Notes
Field-level redaction is more expensive than sensor-level filtering because it iterates over
every key in the metadata dictionary for every pattern in the ruleset. In
micro-benchmarks on a Raspberry Pi 5:
- Empty metadata: ~0.4 µs per reading (just the sensor_name check).
- 5 metadata keys, 10 field patterns: ~2.8 µs per reading.
- 20 metadata keys, 50 field patterns: ~11 µs per reading.
For a fleet of one thousand sensors polling every five seconds, the worst-case overhead is still well under one percent of CPU time. If you observe higher latency, consider:
- Consolidating overlapping patterns (e.g.,
"email_*"and"email_work"can often be merged into"email*"). - Pre-filtering at the adapter layer so that obviously benign sensors never reach the privacy engine at all.
- Using
local_onlyinstead of listing every sensitive metadata key individually; it strips the entire dictionary in a single pass.
Summary
Field redaction gives you surgical control over what leaves the edge. With
PrivacyPolicyEngine, you can remove individual metadata keys (redact),
partially obscure them (mask), deterministically pseudonymise them (hash),
or drop entire sensors (drop) and strip metadata wholesale (local_only).
Configuration is available both through declarative TOML and a fully mutable programmatic API,
making it easy to integrate with dynamic fleet-management systems and incident-response
workflows.