bghira · bghira · Apr 26, 2026 · Apr 26, 2026
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -1,112 +1,137 @@
-# Competitive Performance Benchmarks
+# Performance Benchmarks
 
-This library includes comprehensive benchmarks against industry-standard libraries (OpenCV, NumPy) to ensure competitive performance for real-world SFT (Supervised Fine-Tuning) workloads.
+TrainingSample includes benchmarks for common preprocessing operations: crop, resize, luminance, resize-plus-luminance pipelines, and video frame resizing. The benchmarks are meant to catch regressions and provide workload-specific guidance, not to guarantee universal speedups over OpenCV or NumPy.
 
-## Benchmark Categories
+## Running Benchmarks
 
-### 🖼️ High-Resolution Image Processing
+Use the repository virtual environment when available:
 
-**Target Workload**: 5120×5120 → 1024×1024 image processing pipeline
-- **Input**: 5120×5120×3 images (26.2M pixels, ~78MB each)
-- **Pipeline**: Center crop → Resize → Luminance calculation
-- **Batch sizes**: 2-4 images (memory constrained)
+```bash
+.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s
+```
 
-### 📊 Performance Targets
+To run every Python test and benchmark marker in the repo:
 
-| Operation | Input Size | Target Performance | Baseline |
-|-----------|------------|-------------------|----------|
-| **Resize** | 5120×5120 → 1024×1024 | Match OpenCV bilinear | `cv2.resize()` |
-| **Center Crop** | 5120×5120 → 2048×2048 | Match/exceed NumPy | Array slicing |
-| **Luminance** | 1024×1024 | 1.5x+ faster than NumPy | Vectorized math |
-| **Full Pipeline** | 5120×5120 → 1024×1024 | >0.5 images/sec | Combined ops |
+```bash
+.venv/bin/python -m pytest -q
+```
 
-### 🎯 Quality Targets
+For a fresh source build before measuring:
 
-- **Resize Quality**: PSNR >30dB vs OpenCV (excellent similarity)
-- **Crop Accuracy**: Bit-exact match with NumPy center crop
-- **Luminance Precision**: <0.1 difference vs NumPy reference
+```bash
+env -u OPENCV_LINK_LIBS -u OPENCV_LINK_PATHS -u OPENCV_INCLUDE_PATHS \
+    -u LIBCLANG_PATH -u LLVM_CONFIG_PATH \
+    .venv/bin/maturin develop --release
+```
 
-## Running Benchmarks
+The OpenCV Rust binding needs a discoverable OpenCV and Clang installation. On this development host, stale macOS-style OpenCV and LLVM environment variables had to be unset before the build could probe the system OpenCV installation.
 
-### Local Development
-```bash
-# Install dependencies
-pip install opencv-python pytest-benchmark psutil
+## Current Local Snapshot
 
-# Build with optimizations
-maturin develop --release --features "python-bindings,simd"
+Last measured command:
 
-# Run competitive benchmarks
-./scripts/run_competitive_benchmarks.sh
+```bash
+.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s
 ```
 
-### CI/CD Integration
+Environment:
 
-Benchmarks run automatically in CI for:
-- **Pull requests**: Performance regression detection
-- **Main branch**: Performance tracking over time
-- **Weekly schedule**: Long-term performance monitoring
+- Linux x86_64
+- CPython 3.13
+- NumPy 2.3.4
+- system OpenCV 4.11 via the Rust `opencv` crate
+- release build installed with `maturin develop --release`
 
-## Benchmark Architecture
+Point-in-time scenario timings from the benchmark output:
 
-### Memory Efficiency
-- Monitors RSS memory usage vs OpenCV
-- Tests batch processing memory scaling
-- Validates no memory leaks in pipelines
+| Scenario | Before optimization | After optimization | Comparison after optimization |
+|----------|---------------------|--------------------|-------------------------------|
+| Crop batch, 16 images | 22.9 ms | 0.4 ms | NumPy slicing was still faster because it returns views |
+| Mixed-shape crop, 8 images | 50.2 ms | 3.3 ms | NumPy slicing loop was near-zero because it returns views |
+| Resize, 4 mixed-size images | 4.1 ms | 0.4 ms | OpenCV loop: 2.6 ms |
+| Luminance, 4 mixed-size images | 10.4 ms | 0.6 ms | OpenCV loop: 0.9 ms |
+| Resize + luminance pipeline, 4 images | 5.9 ms | 0.6 ms | OpenCV loop: 2.1 ms |
+| Mixed-shape luminance, 6 images | 78.3 ms | 3.3 ms | NumPy loop: 19.4 ms |
 
-### SIMD Optimization Validation
-- Compares SIMD-enabled vs scalar fallback performance
-- Tests x86-64 AVX2/AVX-512 and ARM64 NEON paths
-- Validates CPU feature detection accuracy
+Pytest-benchmark means from the same focused run:
 
-### Real-World Scenarios
-- **SFT Data Processing**: High-res → training resolution pipeline
-- **Batch Processing**: Multiple images with different operations
-- **Memory Constraints**: Large images with limited RAM
+| Benchmark | Mean |
+|-----------|------|
+| Center crop | 55.2 us |
+| Resize operations | 353.1 us |
+| Luminance calculation | 417.2 us |
+| Crop operations | 583.8 us |
+| Pipeline | 3.44 ms |
+| Video processing | 2.85 ms |
 
-## Performance Philosophy
+A full `pytest -q` run also passed and produced similar benchmark ordering, with normal run-to-run variance.
 
-### Why These Benchmarks Matter
+## What Changed in the Latest Optimization
 
-1. **Real-World Relevance**: SFT workloads use 5120×5120+ images, not toy 224×224
-2. **Competitive Pressure**: OpenCV and NumPy are highly optimized incumbents
-3. **User Experience**: Poor performance = adoption barriers
-4. **Resource Efficiency**: Training infrastructure costs scale with throughput
+- Owned Rust `ndarray` outputs are transferred into NumPy with `from_owned_array_bound`, avoiding an additional copy in Python-facing result conversion.
+- Contiguous luminance inputs use a channel-sum fast path. Instead of computing weighted luminance per pixel, it sums R, G, and B separately and applies the weights once at the end.
+- Non-contiguous arrays still use the general ndarray path for correctness.
 
-### Performance vs Quality Tradeoffs
+## Benchmark Categories
+
+### Image Operations
+
+- `batch_crop_images`
+- `batch_center_crop_images`
+- `batch_random_crop_images`
+- `batch_resize_images`
+- `batch_calculate_luminance`
+
+### Pipeline Operations
+
+- resize followed by luminance
+- crop followed by resize
+- mixed input sizes and output sizes
+
+### Video Operations
 
-- **Resize**: Bilinear interpolation for speed, good quality balance
-- **SIMD**: Aggressive optimization while maintaining numerical accuracy
-- **Memory**: Batch processing for throughput vs memory pressure balance
+- `batch_resize_videos` with frame batches shaped `(T, H, W, 3)`
 
 ## Interpreting Results
 
-### Good Performance Indicators
-- ✅ Resize: 1-2 images/sec for 5120×5120 → 1024×1024
-- ✅ Crop: 10+ images/sec for 5120×5120 → 2048×2048
-- ✅ Luminance: 1.5x+ faster than NumPy with SIMD
-- ✅ Pipeline: >0.5 complete transformations/sec
+Use these benchmarks to answer practical questions:
+
+- Is a change adding extra Rust-to-NumPy copies?
+- Are contiguous arrays staying on the fast path?
+- Is resize dominated by OpenCV work or Python binding overhead?
+- Does a mixed-shape batch still behave reasonably?
+- Is a video processing change accidentally introducing per-frame Python overhead?
+
+Some comparisons need context:
+
+- NumPy crop by slicing often returns a view, so it can be much faster than any function that returns owned cropped arrays.
+- Very small images can be dominated by Python call overhead.
+- Large images can be dominated by memory bandwidth rather than arithmetic.
+- OpenCV performance varies by build options, CPU features, and linked libraries.
+
+## Quality Checks
+
+The tests validate basic output behavior alongside timing:
 
-### Red Flags
-- ❌ Slower than OpenCV resize (indicates poor SIMD utilization)
-- ❌ Slower than NumPy crop (indicates unnecessary overhead)
-- ❌ Memory usage >2x OpenCV (indicates memory leaks/inefficiency)
-- ❌ Quality degradation (PSNR <30dB vs reference)
+- Crop outputs have expected shape and match NumPy slicing where ownership differences do not matter.
+- Resize outputs have expected shape and are close to OpenCV output for the configured interpolation.
+- Luminance stays within a small tolerance of NumPy/OpenCV-style references.
+- Non-contiguous arrays are accepted by safe luminance paths and rejected by strict zero-copy crop/resize paths.
 
-## Future Enhancements
+## Regression Signals
 
-### Planned Improvements
-- GPU acceleration benchmarks (Metal/CUDA)
-- More interpolation methods (bicubic, lanczos)
-- Video processing pipeline benchmarks
-- Multi-threaded batch processing optimization
+Investigate if a change causes:
 
-### Performance Tracking
-- Historical performance database
-- Regression detection and alerting
-- Performance comparison across different hardware configurations
-- Automated performance optimization recommendations
+- Public batch crop to return to multi-millisecond timings for small batches.
+- Luminance on contiguous RGB arrays to lose the channel-sum fast path.
+- Resize benchmarks to add large overhead beyond OpenCV work.
+- Video resizing to scale with per-frame Python object churn.
+- Memory usage to grow unexpectedly for repeated batch calls.
 
----
+## Future Benchmark Work
 
-**Goal**: Be the fastest, highest-quality image processing library for ML/SFT workloads while maintaining competitive memory usage and numerical accuracy.
+- Store historical benchmark results by commit and host.
+- Add explicit memory allocation tracking for Python-facing APIs.
+- Separate view-returning crop comparisons from owned-output crop comparisons.
+- Add more video pipeline benchmarks.
+- Document hardware and OpenCV build details in benchmark artifacts.