ARM64 NEON Vectorization in Pyvorin Edge
A primer on ARM64 NEON SIMD, how Pyvorin uses it for vector math, automatic vs manual kernel dispatch, and benchmarking on Raspberry Pi 5.
Published Jun 2, 2026
What Is SIMD?
Single Instruction, Multiple Data (SIMD) is a processor feature that allows one instruction to operate on multiple data elements simultaneously. On ARM64 processors like the Raspberry Pi 5's BCM2712, the NEON extension provides 128-bit vector registers that can hold four 32-bit floats, two 64-bit doubles, or sixteen 8-bit integers. A single FADD instruction can add four pairs of floats in the same cycle count as one scalar addition.
For edge workloads — sensor averaging, RMS calculations, threshold checks over time-series buffers — SIMD can yield 2× to 8× speedups depending on data width and loop structure. On battery-powered devices, this also means lower energy per computation because the CPU finishes faster and returns to idle.
ARM64 NEON Primer
NEON has 32 vector registers (V0–V31), each 128 bits wide. Key instruction categories include:
- Arithmetic:
FADD,FSUB,FMUL,FDIV,FMLA(fused multiply-add) - Load/Store:
LD1,ST1for contiguous vector loads and stores - Reduction:
FADDPfor pairwise addition (used insum) - Comparison:
FCMEQ,FCMGTfor vector compares
NEON is deterministic: unlike GPUs, there are no warp schedulers or shared-memory bank conflicts. This makes NEON ideal for real-time edge pipelines where worst-case latency matters more than average throughput.
How Pyvorin Uses NEON
The Pyvorin Edge compiler routes kernel workloads (detected by CompilerBridge._detect_workload_type()) to the CKernelBackend. When the backend recognizes a reduction or element-wise loop over a homogeneous float array, it generates NEON vectorized C code. The generated loop structure looks like this:
// Pseudocode of generated NEON kernel
float reduce_sum(const float* __restrict data, size_t n) {
float32x4_t acc = vdupq_n_f32(0.0f);
size_t i = 0;
// Vectorized main loop — 4 elements per iteration
for (; i + 4 <= n; i += 4) {
float32x4_t vec = vld1q_f32(&data[i]);
acc = vaddq_f32(acc, vec);
}
// Horizontal reduction of the 4-lane accumulator
float32x2_t low = vget_low_f32(acc);
float32x2_t high = vget_high_f32(acc);
float32x2_t sum2 = vadd_f32(low, high);
float result = vget_lane_f32(vpadd_f32(sum2, sum2), 0);
// Scalar tail loop for remaining elements
for (; i < n; ++i) {
result += data[i];
}
return result;
}
The backend ensures alignment and emits scalar tail handling so that arrays of any length produce correct results. It also marks pointers with __restrict to help the C compiler's auto-vectorizer when Pyvorin falls back to generic C generation.
Automatic Vectorization vs Manual Kernels
Pyvorin provides two ways to leverage NEON:
- Automatic vectorization: Write plain Python loops. The
CKernelBackendanalyzes the AST, builds aTypedKernelPlan, and emits vectorized C automatically. This is the recommended path for 95% of users. - Manual kernels: Advanced users can write C intrinsics directly and load them via
ModuleLoaderand anABIContract. This bypasses the compiler entirely but requires deep NEON expertise.
# Automatic vectorization — user writes Python, Pyvorin emits NEON
def rms_signal(samples: list[float]) -> float:
total = 0.0
for s in samples:
total = total + s * s
return (total / len(samples)) ** 0.5
# Manual kernel — user writes C, loads via ABIContract
# (only recommended for custom algorithms not expressible in Python)
Function Multi-Versioning (FMV) Dispatch
Not all edge devices have identical NEON capabilities. The Raspberry Pi 4 supports NEON and VFPv4; the Pi 5 adds SVE2 (Scalable Vector Extension 2) support in some configurations. Pyvorin uses Function Multi-Versioning (FMV) to compile multiple variants of the same kernel and dispatch at load time based on getauxval(AT_HWCAP).
The loader in edge_runtime/pyv_edge_agent/module_host/loader.py checks CPU features before selecting the widest vector path. If SVE2 is absent, it falls back to the 128-bit NEON variant; if NEON is absent, it falls back to scalar C. This means your deployment package contains one .so but multiple code paths inside it.
Benchmarking NEON vs Scalar on Pi 5
To verify that vectorization is working on your hardware, run a micro-benchmark comparing interpreted Python, scalar C, and NEON C for the same algorithm:
import time
import numpy as np
from pyvorin_edge.compiler_bridge import CompilerBridge
bridge = CompilerBridge()
# Generate synthetic sensor data
samples = np.random.randn(100_000).astype(np.float32).tolist()
# Python reference
def sum_python(data: list[float]) -> float:
total = 0.0
for d in data:
total = total + d
return total
# Compile to native (automatically routes to kernel backend)
source = '''
def sum_kernel(data: list[float]) -> float:
total = 0.0
for d in data:
total = total + d
return total
'''
so_path = bridge.compile_hotpath(source, "sum_kernel", "/tmp/sum_kernel.so")
# Load compiled module
from pyv_edge_agent.module_host.loader import ModuleLoader
from pyvorin_edge.compiler_bridge import CompilerBridge
abi = bridge.generate_abi(so_path, "sum_kernel")
loader = ModuleLoader()
loader.load(str(so_path), abi)
# Benchmark
def bench(fn, data, iterations=100):
start = time.perf_counter()
for _ in range(iterations):
fn(data)
return (time.perf_counter() - start) * 1000.0
py_ms = bench(sum_python, samples)
native_ms = bench(lambda d: loader.call("sum_kernel", d), samples)
print(f"Python: {py_ms:.2f} ms")
print(f"Native: {native_ms:.2f} ms")
print(f"Speedup: {py_ms / native_ms:.1f}x")
On a Raspberry Pi 5 at 2.4 GHz, you should see a 4× to 6× speedup for float summation and up to 8× for fused multiply-add kernels. If the speedup is below 2×, check that the kernel backend is being used (log output should say "Compiled module via CKernelBackend") and that your array is a list[float] with at least 10,000 elements — below that, setup overhead dominates.
Detecting Vectorization at Runtime
The compiler bridge does not expose a direct "is vectorized" flag, but you can infer it from the compiled module's symbol table and performance characteristics. A NEON kernel will typically show:
- Symbol names containing
_neonor_vecin the disassembly - Performance scaling with array length (linear with a shallow slope)
- No Python stack frames in
cProfileoutput for the compiled function
# Inspect symbols in the compiled .so
objdump -d /tmp/sum_kernel.so | grep -i "fadd\|vadd\|ld1"
# Look for NEON vector instructions in the disassembly
Limitations of NEON in Pyvorin
- Integer arrays: NEON is most effective on
float32andfloat64. Integer kernels vectorize but may see smaller gains due to narrower lanes. - Branches inside loops: If your loop contains
ifstatements, the backend may emit scalar predication instead of vector compares, reducing speedup. - Non-contiguous data: The kernel backend assumes contiguous arrays (
contiguous=TrueinArrayArg). Strided or ragged data falls back to scalar loops. - String data: NEON is not used for string operations;
c_char_parguments always follow the scalar path.
Best Practices for NEON-Friendly Kernels
- Keep loops simple: one induction variable, no nested loops, no early
break. - Use
float(32-bit) for sensor data when precision allows — it fills four NEON lanes versus two fordouble. - Avoid function calls inside the loop body; inline all arithmetic.
- Prefer reductions (
sum,min,max) over per-element output arrays for the highest memory-bandwidth efficiency. - Batch small arrays into larger ones before calling the compiled kernel; overhead is amortized across more elements.