guides
Batch Processing and ETL
Accelerate data pipelines, ETL jobs, and batch computation workloads.
Published May 30, 2026
ETL Pipeline Acceleration
Pyvorin excels at CPU-bound data transformation. Typical ETL stages that benefit:
- Data cleaning and validation
- Type conversion and normalisation
- Aggregation and grouping
- Deduplication logic
Example: CSV Processing
def process_csv(rows):
results = []
for row in rows:
if validate_row(row):
transformed = {
'id': int(row['id']),
'amount': float(row['amount']) * 1.08,
'category': row['category'].upper(),
}
results.append(transformed)
return results
Batch Size Tuning
- Too small: compilation overhead dominates.
- Too large: memory pressure from buffering.
- Sweet spot: 1,000–10,000 rows per batch depending on row width.
Streaming with Fallback
For streaming pipelines, compile the transformation function and call it per-batch:
for batch in stream_batches(source):
compiled_transform(batch) # native
sink.write(batch) # CPython I/O fallback