Benchmarking Correctly
Accurate benchmarking is essential for honest performance claims. Pyvorin separates compile time, first-run latency, and warm-run throughput so you can evaluate real-world impact.
Cold vs Warm Runs
| Metric | What it measures | Why it matters |
|---|---|---|
compile_time_ms | Time to compile AST to native code | Affects the first invocation only. |
first_run_ms | First execution after compilation | Includes cache warm-up and library loading. |
warm_run_avg_ms | Average of subsequent runs | The steady-state throughput users experience. |
Always report all three. A fast warm run with a 500 ms compile time may not improve total latency for short scripts.
Wall-Clock vs CPU-Hours
- Wall-clock time is what your users feel. Use it for end-to-end comparisons.
- CPU-hours matter for batch jobs and cloud billing. Pyvorin's native code typically uses fewer CPU-hours than CPython because it avoids interpreter overhead.
- Do not confuse reduced wall-clock time with increased CPU utilisation. Both can improve simultaneously.
Why Microbenchmarks Mislead
Microbenchmarks (tight loops of a few operations) can show extreme speedups that do not translate to production:
- Amortisation: A 1,000× speedup on a 1 ms operation saves less than 1 second. A 10× speedup on a 5-minute ETL job saves 4.5 minutes.
- Memory effects: Microbenchmarks often fit entirely in L1 cache. Real workloads have cache misses, branch mispredictions, and GC pressure.
- Compilation overhead: Microbenchmarks rarely include compile time. Production jobs do.
- Input sensitivity: Speedups vary with input size, type stability, and data distribution.
Recommended Benchmarking Workflow
- Use the built-in
pyvorin-thin benchmarkcommand, which records CPython and Pyvorin timings together. - Run at least 5 warm iterations and take the median, not the mean.
- Restart the process between cold and warm measurements to avoid cache contamination.
- Run on isolated hardware or cloud instances with pinned CPU frequencies.
- Validate correctness hashes for every run. A fast wrong answer is worthless.
Using the CLI Benchmark Command
pyvorin-thin benchmark script.py --function benchmark --runs 5
This produces:
- CPython baseline timing
- Pyvorin compile + run timing (when local native compiler is available)
- Speedup ratio
- Benchmark event written to local JSON for audit
Workload-Specific Caveats
All benchmark claims must be caveated as workload-specific. A finance kernel may see 100× speedup while a JSON parser sees 5×. Report the workload category, input size, and hardware platform alongside every number.
Anti-Cheat Checks
Pyvorin's benchmark suite includes anti-cheat validation:
- Results are checked against CPython ground truth.
- Output hashes must match; mismatches are flagged as failures regardless of speed.
- Fallback runs are annotated so you know whether the speedup came from native code or CPython.