Benchmarking Correctly

Accurate benchmarking is essential for honest performance claims. Pyvorin separates compile time, first-run latency, and warm-run throughput so you can evaluate real-world impact.

Cold vs Warm Runs

Metric	What it measures	Why it matters
`compile_time_ms`	Time to compile AST to native code	Affects the first invocation only.
`first_run_ms`	First execution after compilation	Includes cache warm-up and library loading.
`warm_run_avg_ms`	Average of subsequent runs	The steady-state throughput users experience.

Always report all three. A fast warm run with a 500 ms compile time may not improve total latency for short scripts.

Wall-Clock vs CPU-Hours

Wall-clock time is what your users feel. Use it for end-to-end comparisons.
CPU-hours matter for batch jobs and cloud billing. Pyvorin's native code typically uses fewer CPU-hours than CPython because it avoids interpreter overhead.
Do not confuse reduced wall-clock time with increased CPU utilisation. Both can improve simultaneously.

Why Microbenchmarks Mislead

Microbenchmarks (tight loops of a few operations) can show extreme speedups that do not translate to production:

Amortisation: A 1,000× speedup on a 1 ms operation saves less than 1 second. A 10× speedup on a 5-minute ETL job saves 4.5 minutes.
Memory effects: Microbenchmarks often fit entirely in L1 cache. Real workloads have cache misses, branch mispredictions, and GC pressure.
Compilation overhead: Microbenchmarks rarely include compile time. Production jobs do.
Input sensitivity: Speedups vary with input size, type stability, and data distribution.

Recommended Benchmarking Workflow

Use the built-in pyvorin-thin benchmark command, which records CPython and Pyvorin timings together.
Run at least 5 warm iterations and take the median, not the mean.
Restart the process between cold and warm measurements to avoid cache contamination.
Run on isolated hardware or cloud instances with pinned CPU frequencies.
Validate correctness hashes for every run. A fast wrong answer is worthless.

Using the CLI Benchmark Command

pyvorin-thin benchmark script.py --function benchmark --runs 5

This produces:

CPython baseline timing
Pyvorin compile + run timing (when local native compiler is available)
Speedup ratio
Benchmark event written to local JSON for audit

Workload-Specific Caveats

All benchmark claims must be caveated as workload-specific. A finance kernel may see 100× speedup while a JSON parser sees 5×. Report the workload category, input size, and hardware platform alongside every number.

Anti-Cheat Checks

Pyvorin's benchmark suite includes anti-cheat validation:

Results are checked against CPython ground truth.
Output hashes must match; mismatches are flagged as failures regardless of speed.
Fallback runs are annotated so you know whether the speedup came from native code or CPython.