diff --git a/BENCHMARKS.md b/BENCHMARKS.md index 4bbd7c4..287b28a 100644 --- a/BENCHMARKS.md +++ b/BENCHMARKS.md @@ -1,112 +1,137 @@ -# Competitive Performance Benchmarks +# Performance Benchmarks -This library includes comprehensive benchmarks against industry-standard libraries (OpenCV, NumPy) to ensure competitive performance for real-world SFT (Supervised Fine-Tuning) workloads. +TrainingSample includes benchmarks for common preprocessing operations: crop, resize, luminance, resize-plus-luminance pipelines, and video frame resizing. The benchmarks are meant to catch regressions and provide workload-specific guidance, not to guarantee universal speedups over OpenCV or NumPy. -## Benchmark Categories +## Running Benchmarks -### πŸ–ΌοΈ High-Resolution Image Processing +Use the repository virtual environment when available: -**Target Workload**: 5120Γ—5120 β†’ 1024Γ—1024 image processing pipeline -- **Input**: 5120Γ—5120Γ—3 images (26.2M pixels, ~78MB each) -- **Pipeline**: Center crop β†’ Resize β†’ Luminance calculation -- **Batch sizes**: 2-4 images (memory constrained) +```bash +.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s +``` -### πŸ“Š Performance Targets +To run every Python test and benchmark marker in the repo: -| Operation | Input Size | Target Performance | Baseline | -|-----------|------------|-------------------|----------| -| **Resize** | 5120Γ—5120 β†’ 1024Γ—1024 | Match OpenCV bilinear | `cv2.resize()` | -| **Center Crop** | 5120Γ—5120 β†’ 2048Γ—2048 | Match/exceed NumPy | Array slicing | -| **Luminance** | 1024Γ—1024 | 1.5x+ faster than NumPy | Vectorized math | -| **Full Pipeline** | 5120Γ—5120 β†’ 1024Γ—1024 | >0.5 images/sec | Combined ops | +```bash +.venv/bin/python -m pytest -q +``` -### 🎯 Quality Targets +For a fresh source build before measuring: -- **Resize Quality**: PSNR >30dB vs OpenCV (excellent similarity) -- **Crop Accuracy**: Bit-exact match with NumPy center crop -- **Luminance Precision**: <0.1 difference vs NumPy reference +```bash +env -u OPENCV_LINK_LIBS -u OPENCV_LINK_PATHS -u OPENCV_INCLUDE_PATHS \ + -u LIBCLANG_PATH -u LLVM_CONFIG_PATH \ + .venv/bin/maturin develop --release +``` -## Running Benchmarks +The OpenCV Rust binding needs a discoverable OpenCV and Clang installation. On this development host, stale macOS-style OpenCV and LLVM environment variables had to be unset before the build could probe the system OpenCV installation. -### Local Development -```bash -# Install dependencies -pip install opencv-python pytest-benchmark psutil +## Current Local Snapshot -# Build with optimizations -maturin develop --release --features "python-bindings,simd" +Last measured command: -# Run competitive benchmarks -./scripts/run_competitive_benchmarks.sh +```bash +.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s ``` -### CI/CD Integration +Environment: -Benchmarks run automatically in CI for: -- **Pull requests**: Performance regression detection -- **Main branch**: Performance tracking over time -- **Weekly schedule**: Long-term performance monitoring +- Linux x86_64 +- CPython 3.13 +- NumPy 2.3.4 +- system OpenCV 4.11 via the Rust `opencv` crate +- release build installed with `maturin develop --release` -## Benchmark Architecture +Point-in-time scenario timings from the benchmark output: -### Memory Efficiency -- Monitors RSS memory usage vs OpenCV -- Tests batch processing memory scaling -- Validates no memory leaks in pipelines +| Scenario | Before optimization | After optimization | Comparison after optimization | +|----------|---------------------|--------------------|-------------------------------| +| Crop batch, 16 images | 22.9 ms | 0.4 ms | NumPy slicing was still faster because it returns views | +| Mixed-shape crop, 8 images | 50.2 ms | 3.3 ms | NumPy slicing loop was near-zero because it returns views | +| Resize, 4 mixed-size images | 4.1 ms | 0.4 ms | OpenCV loop: 2.6 ms | +| Luminance, 4 mixed-size images | 10.4 ms | 0.6 ms | OpenCV loop: 0.9 ms | +| Resize + luminance pipeline, 4 images | 5.9 ms | 0.6 ms | OpenCV loop: 2.1 ms | +| Mixed-shape luminance, 6 images | 78.3 ms | 3.3 ms | NumPy loop: 19.4 ms | -### SIMD Optimization Validation -- Compares SIMD-enabled vs scalar fallback performance -- Tests x86-64 AVX2/AVX-512 and ARM64 NEON paths -- Validates CPU feature detection accuracy +Pytest-benchmark means from the same focused run: -### Real-World Scenarios -- **SFT Data Processing**: High-res β†’ training resolution pipeline -- **Batch Processing**: Multiple images with different operations -- **Memory Constraints**: Large images with limited RAM +| Benchmark | Mean | +|-----------|------| +| Center crop | 55.2 us | +| Resize operations | 353.1 us | +| Luminance calculation | 417.2 us | +| Crop operations | 583.8 us | +| Pipeline | 3.44 ms | +| Video processing | 2.85 ms | -## Performance Philosophy +A full `pytest -q` run also passed and produced similar benchmark ordering, with normal run-to-run variance. -### Why These Benchmarks Matter +## What Changed in the Latest Optimization -1. **Real-World Relevance**: SFT workloads use 5120Γ—5120+ images, not toy 224Γ—224 -2. **Competitive Pressure**: OpenCV and NumPy are highly optimized incumbents -3. **User Experience**: Poor performance = adoption barriers -4. **Resource Efficiency**: Training infrastructure costs scale with throughput +- Owned Rust `ndarray` outputs are transferred into NumPy with `from_owned_array_bound`, avoiding an additional copy in Python-facing result conversion. +- Contiguous luminance inputs use a channel-sum fast path. Instead of computing weighted luminance per pixel, it sums R, G, and B separately and applies the weights once at the end. +- Non-contiguous arrays still use the general ndarray path for correctness. -### Performance vs Quality Tradeoffs +## Benchmark Categories + +### Image Operations + +- `batch_crop_images` +- `batch_center_crop_images` +- `batch_random_crop_images` +- `batch_resize_images` +- `batch_calculate_luminance` + +### Pipeline Operations + +- resize followed by luminance +- crop followed by resize +- mixed input sizes and output sizes + +### Video Operations -- **Resize**: Bilinear interpolation for speed, good quality balance -- **SIMD**: Aggressive optimization while maintaining numerical accuracy -- **Memory**: Batch processing for throughput vs memory pressure balance +- `batch_resize_videos` with frame batches shaped `(T, H, W, 3)` ## Interpreting Results -### Good Performance Indicators -- βœ… Resize: 1-2 images/sec for 5120Γ—5120 β†’ 1024Γ—1024 -- βœ… Crop: 10+ images/sec for 5120Γ—5120 β†’ 2048Γ—2048 -- βœ… Luminance: 1.5x+ faster than NumPy with SIMD -- βœ… Pipeline: >0.5 complete transformations/sec +Use these benchmarks to answer practical questions: + +- Is a change adding extra Rust-to-NumPy copies? +- Are contiguous arrays staying on the fast path? +- Is resize dominated by OpenCV work or Python binding overhead? +- Does a mixed-shape batch still behave reasonably? +- Is a video processing change accidentally introducing per-frame Python overhead? + +Some comparisons need context: + +- NumPy crop by slicing often returns a view, so it can be much faster than any function that returns owned cropped arrays. +- Very small images can be dominated by Python call overhead. +- Large images can be dominated by memory bandwidth rather than arithmetic. +- OpenCV performance varies by build options, CPU features, and linked libraries. + +## Quality Checks + +The tests validate basic output behavior alongside timing: -### Red Flags -- ❌ Slower than OpenCV resize (indicates poor SIMD utilization) -- ❌ Slower than NumPy crop (indicates unnecessary overhead) -- ❌ Memory usage >2x OpenCV (indicates memory leaks/inefficiency) -- ❌ Quality degradation (PSNR <30dB vs reference) +- Crop outputs have expected shape and match NumPy slicing where ownership differences do not matter. +- Resize outputs have expected shape and are close to OpenCV output for the configured interpolation. +- Luminance stays within a small tolerance of NumPy/OpenCV-style references. +- Non-contiguous arrays are accepted by safe luminance paths and rejected by strict zero-copy crop/resize paths. -## Future Enhancements +## Regression Signals -### Planned Improvements -- GPU acceleration benchmarks (Metal/CUDA) -- More interpolation methods (bicubic, lanczos) -- Video processing pipeline benchmarks -- Multi-threaded batch processing optimization +Investigate if a change causes: -### Performance Tracking -- Historical performance database -- Regression detection and alerting -- Performance comparison across different hardware configurations -- Automated performance optimization recommendations +- Public batch crop to return to multi-millisecond timings for small batches. +- Luminance on contiguous RGB arrays to lose the channel-sum fast path. +- Resize benchmarks to add large overhead beyond OpenCV work. +- Video resizing to scale with per-frame Python object churn. +- Memory usage to grow unexpectedly for repeated batch calls. ---- +## Future Benchmark Work -**Goal**: Be the fastest, highest-quality image processing library for ML/SFT workloads while maintaining competitive memory usage and numerical accuracy. +- Store historical benchmark results by commit and host. +- Add explicit memory allocation tracking for Python-facing APIs. +- Separate view-returning crop comparisons from owned-output crop comparisons. +- Add more video pipeline benchmarks. +- Document hardware and OpenCV build details in benchmark artifacts. diff --git a/README.md b/README.md index 80de67b..da285c0 100644 --- a/README.md +++ b/README.md @@ -4,102 +4,61 @@ [![PyPI](https://img.shields.io/pypi/v/trainingsample.svg)](https://pypi.org/project/trainingsample/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -**πŸ† Industry-Leading Computer Vision Library - FASTER than cv2** +TrainingSample provides Rust-backed Python bindings for common image and video preprocessing operations used in ML data pipelines. It combines OpenCV-backed resizing with Rust implementations for batching, cropping, luminance calculation, format conversion, and video helpers. -The only Python library that **beats opencv-python (cv2) performance** by leveraging OpenCV's C++ power with zero-copy Rust optimizations and intelligent auto-batching. +The project is designed for workloads where Python-side loops and repeated boundary crossings become visible. It is not a blanket replacement for all of `cv2`, and performance depends on image size, batch shape, CPU, OpenCV build, and memory bandwidth. ## install ```bash -# python (recommended) +# python pip install trainingsample # rust cargo add trainingsample ``` -## πŸš€ Why TrainingSample Leads the Industry - -**BREAKTHROUGH: We leverage OpenCV's C++ power to beat opencv-python (cv2) by eliminating Python binding overhead.** - -### ⚑ Performance That Redefines Possible -- **Single images**: **1.12x FASTER** than `cv2.resize()` - the "impossible" achievement -- **Batch processing**: **2.4x faster** than OpenCV individual calls -- **Zero-copy iteration**: True lazy conversion with **17,204 images/sec** throughput -- **Intelligent dispatch**: Seamless auto-batching with zero wrapper overhead - -### πŸ”₯ What Makes Us Different -- **Leverages OpenCV C++**: Direct OpenCV C++ access to beat opencv-python binding overhead -- **Zero wrapper overhead**: Eliminated 76% of artificial performance losses in Python bindings -- **True zero-copy**: Raw OpenCV Mat β†’ numpy array, no intermediate conversions -- **Intelligent API**: Same function handles single images + batch processing seamlessly -- **Buffer pooling**: Memory reuse across operations eliminates allocation bottlenecks -- **Adaptive threading**: Sequential for small batches, parallel for large batches - -**We unleash OpenCV's full C++ power without Python binding limitations.** - -## 🎯 Ultimate Performance APIs +## python usage ```python import numpy as np import trainingsample as tsr -# SINGLE IMAGE - FASTER than cv2.resize()! -img = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8) -result = tsr.batch_resize_images_zero_copy(img, (256, 256)) # 1.12x FASTER than OpenCV! +images = [ + np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8) + for _ in range(8) +] -# BATCH PROCESSING - 2.4x faster than OpenCV individual calls -images = [np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8) for _ in range(10)] -results = tsr.batch_resize_images_zero_copy(images, [(256, 256)] * 10) +crop_boxes = [(50, 50, 200, 200)] * len(images) +cropped = tsr.batch_crop_images(images, crop_boxes) -# MEMORY-EFFICIENT ITERATION - True zero-copy lazy conversion -for result in tsr.batch_resize_images_iterator(images, [(256, 256)] * 10): - process(result) # Convert only when accessed, supports early termination +target_sizes = [(224, 224)] * len(images) +resized = tsr.batch_resize_images(images, target_sizes) -# ZERO-COPY BATCH OPERATIONS -cropped = tsr.batch_crop_images_zero_copy(images, [(50, 50, 200, 200)] * 10) # 4x faster -luminances = tsr.batch_calculate_luminance_zero_copy(images) # 8x faster -center_cropped = tsr.batch_center_crop_images_zero_copy(images, [(224, 224)] * 10) # 3x faster +luminances = tsr.batch_calculate_luminance(resized) ``` -### πŸ“Š Performance Comparison -```python -import time -import cv2 +OpenCV-compatible helpers are also exported for common operations: -# Single image resize comparison -img = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8) - -# OpenCV (industry standard) -start = time.perf_counter() -cv2_result = cv2.resize(img, (256, 256)) -opencv_time = time.perf_counter() - start - -# TrainingSample (industry leader) -start = time.perf_counter() -tsr_result = tsr.batch_resize_images_zero_copy(img, (256, 256)) -tsr_time = time.perf_counter() - start - -print(f"OpenCV: {opencv_time*1000:.3f}ms") -print(f"TSR: {tsr_time*1000:.3f}ms") -print(f"TSR is {opencv_time/tsr_time:.2f}x FASTER!") # Typical: 1.12x faster +```python +decoded = tsr.imdecode(image_bytes, tsr.IMREAD_COLOR) +gray = tsr.cvt_color(decoded, tsr.COLOR_RGB2GRAY) +edges = tsr.canny(decoded, threshold1=50, threshold2=150) +resized = tsr.resize(decoded, (224, 224), interpolation=tsr.INTER_LINEAR) ``` ## rust usage ```rust +use ndarray::Array3; use trainingsample::{ - batch_crop_image_arrays, batch_resize_image_arrays, - batch_calculate_luminance_arrays + batch_calculate_luminance_arrays, batch_crop_image_arrays, batch_resize_image_arrays, }; -use ndarray::Array3; -// create some test data let images: Vec> = (0..10) .map(|_| Array3::zeros((480, 640, 3))) .collect(); -// batch operations let crop_boxes = vec![(50, 50, 200, 200); 10]; // (x, y, width, height) let cropped = batch_crop_image_arrays(&images, &crop_boxes); @@ -111,181 +70,111 @@ let luminances = batch_calculate_luminance_arrays(&images); ## api reference -### python functions - -#### `batch_crop_images(images, crop_boxes)` +### `batch_crop_images(images, crop_boxes)` -- `images`: list of numpy arrays (H, W, 3) uint8 -- `crop_boxes`: list of (x, y, width, height) tuples -- returns: list of cropped numpy arrays -- **implementation**: TSR-optimized for mixed-shape batching +- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data +- `crop_boxes`: list of `(x, y, width, height)` tuples +- returns: list of cropped NumPy arrays +- notes: output arrays are owned by NumPy without an extra copy from the owned Rust array -#### `batch_center_crop_images(images, target_sizes)` +### `batch_center_crop_images(images, target_sizes)` -- `images`: list of numpy arrays (H, W, 3) uint8 -- `target_sizes`: list of (width, height) tuples -- returns: list of center-cropped numpy arrays -- **implementation**: TSR-optimized for mixed-shape batching +- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data +- `target_sizes`: list of `(width, height)` tuples +- returns: list of center-cropped NumPy arrays -#### `batch_random_crop_images(images, target_sizes)` +### `batch_random_crop_images(images, target_sizes)` -- `images`: list of numpy arrays (H, W, 3) uint8 -- `target_sizes`: list of (width, height) tuples -- returns: list of randomly cropped numpy arrays -- **implementation**: TSR-optimized for mixed-shape batching +- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data +- `target_sizes`: list of `(width, height)` tuples +- returns: list of randomly cropped NumPy arrays -#### `batch_resize_images(images, target_sizes)` +### `batch_resize_images(images, target_sizes)` -- `images`: list of numpy arrays (H, W, 3) uint8 -- `target_sizes`: list of (width, height) tuples -- returns: list of resized numpy arrays -- **implementation**: OpenCV for optimal performance +- `images`: list of NumPy arrays shaped `(H, W, 3)` with `uint8` data +- `target_sizes`: list of `(width, height)` tuples +- returns: list of resized NumPy arrays +- implementation: OpenCV-backed resize with Rust/PyO3 conversion handling -#### `batch_calculate_luminance(images)` +### `batch_calculate_luminance(images)` -- `images`: list of numpy arrays (H, W, 3) uint8 +- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data - returns: list of float luminance values -- **implementation**: TSR SIMD-optimized (10-35x faster than NumPy) - -#### `batch_resize_videos(videos, target_sizes)` - -- `videos`: list of numpy arrays (T, H, W, 3) uint8 -- `target_sizes`: list of (width, height) tuples -- returns: list of resized video numpy arrays - -### rust functions - -same signatures but with `ndarray::Array3` and `ndarray::Array4` instead of numpy arrays. check the docs for details. - -## architecture - -TSR uses a **best-of-breed hybrid approach** for optimal performance: - -### operation selection - -- **cropping operations**: TSR implementation - - mixed-shape batching (8 different input shapes β†’ 7 different output shapes) - - single API call: `tsr.batch_crop_images(mixed_images, mixed_crops)` - - vs competitor: individual loops required for each shape combination +- notes: contiguous RGB/RGBA-like arrays use a channel-sum fast path; strided arrays fall back to the general ndarray path -- **luminance calculation**: TSR SIMD implementation - - **18x faster** than NumPy for mixed-shape batches - - **35x faster** than NumPy for uniform batches - - vectorized across different image sizes in single batch call +### `batch_resize_videos(videos, target_sizes)` -- **resize operations**: OpenCV implementation - - industry-standard performance and quality - - highly optimized C++ implementations - - **7-25x faster** than TSR resize implementations +- `videos`: list of NumPy arrays shaped `(T, H, W, 3)` with `uint8` data +- `target_sizes`: list of `(width, height)` tuples +- returns: list of resized video NumPy arrays -### static wheel distribution +## current benchmark snapshot -- OpenCV **statically linked** into wheel (no external dependencies) -- single `pip install trainingsample` - no opencv-python conflicts -- consistent performance across platforms -- ~50MB wheel includes all optimizations +These numbers are from the local benchmark run after the latest Python-interface optimizations: -## features - -- **hybrid architecture**: best implementation for each operation -- parallel processing with rayon (actually uses your cores) -- zero-copy numpy integration via rust-numpy -- proper error handling (no silent failures) -- **static OpenCV** bundled (no external dependencies) -- no python threading nonsense, GIL is released -- memory efficient batch operations -- supports both images and videos - -## πŸ† Industry-Leading Performance - -**BREAKTHROUGH ACHIEVEMENT: First library to beat cv2 by eliminating Python binding overhead while leveraging OpenCV's full C++ power** - -### πŸ₯‡ vs. opencv-python (cv2) - -| Operation | cv2 (opencv-python) | TSR (OpenCV+Rust) | TSR Speedup | Achievement | -|-----------|---------------------|-------------------|-------------|-------------| -| **Single Resize** | 0.134ms | **0.120ms** | **1.12x FASTER** | πŸ† **Beats cv2 bindings** | -| **Batch Resize (8)** | 1.10ms | **0.47ms** | **2.4x FASTER** | πŸ† **Leverages OpenCV C++** | -| **Crop Operations** | 1.40ms | **0.34ms** | **4.1x FASTER** | πŸ† **Zero-copy optimization** | -| **Luminance Calc** | 4.38ms | **0.55ms** | **8.0x FASTER** | πŸ† **SIMD + OpenCV power** | - -### πŸš€ Peak Performance Numbers -- **17,204 images/sec** - Batch resize throughput -- **Zero wrapper overhead** - Eliminated 76% of artificial performance losses -- **True zero-copy** - Raw pointer β†’ numpy conversion on-demand -- **Intelligent dispatch** - Same API for single + batch with optimal performance - -### 🎯 Real-World Advantages - -#### How We Achieve This -1. **Direct OpenCV C++**: Bypass cv2's Python binding overhead entirely -2. **Zero artificial overhead**: Direct Mat headers, no intermediate conversions -3. **Buffer pooling**: Memory reuse eliminates allocation bottlenecks that plague Python bindings -4. **Adaptive threading**: Smart parallelization leveraging Rust's superior threading -5. **Intelligent API**: Seamless auto-batching with optimal performance dispatch - -#### Industry Impact -- **Computer Vision**: First library to beat cv2 by leveraging OpenCV's full C++ power -- **Machine Learning**: Faster preprocessing = faster training pipelines -- **Real-time Applications**: Sub-millisecond image processing capabilities -- **Memory Efficiency**: True zero-copy iteration for large datasets - -**Bottom Line**: We leverage OpenCV's C++ excellence to eliminate the performance bottlenecks in Python bindings. - -## Apple Silicon Performance (M3 Max) - -Optimized SIMD implementations with concrete benchmarks: - -| Operation | Algorithm | Implementation | Speedup | Performance | -|-----------|-----------|----------------|---------|-------------| -| **Image Resize** | Bilinear | Multi-core NEON | **10.2x** | 1,412 MPx/s | -| **Image Resize** | Lanczos4 | Metal GPU | **11.8x** | 112 MPx/s | -| **Format Conversion** | RGBβ†’RGBA | Portable SIMD | **4.4x** | 1,500 MPx/s | -| **Format Conversion** | RGBAβ†’RGB | Portable SIMD | **2.6x** | 1,651 MPx/s | -| **Luminance Calc** | RGBβ†’Y | NEON SIMD | **4.7x** | 545 images/sec | +```bash +.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s +``` -**Key Insights:** +Environment: Linux x86_64, CPython 3.13, NumPy 2.3.4, system OpenCV 4.11 through the Rust `opencv` crate. Treat these as a point-in-time reference, not a cross-machine guarantee. -- **CPU SIMD** (multi-core NEON) optimal for memory-bound operations like bilinear resize -- **GPU Metal** dominates compute-intensive algorithms like Lanczos4 interpolation -- **Unified memory** architecture enables zero-copy GPU operations -- **Automatic selection** between CPU/GPU based on algorithm characteristics +| Benchmark | Before | After | Notes | +|-----------|--------|-------|-------| +| Crop batch, 16 images | 22.9 ms | 0.4 ms | Public `batch_crop_images` path | +| Mixed-shape crop, 8 images | 50.2 ms | 3.3 ms | Mixed input and output sizes | +| Luminance batch, 4 mixed images | 10.4 ms | 0.6 ms | Now faster than the OpenCV comparison in this run | +| Mixed-shape luminance, 6 images | 78.3 ms | 3.3 ms | NumPy comparison was 19.4 ms in this run | +| Complete resize + luminance pipeline | 5.9 ms | 0.6 ms | Four mixed-size inputs to 224x224 | -Tested on Apple Silicon M3 Max (12 P-cores, 38-core GPU, 400 GB/s unified memory). +Pytest-benchmark means from the same suite: -## why this hybrid approach +| Benchmark | Mean after | +|-----------|------------| +| Center crop | 55.2 us | +| Resize operations | 353.1 us | +| Luminance calculation | 417.2 us | +| Crop operations | 583.8 us | +| Pipeline | 3.44 ms | +| Video processing | 2.85 ms | -### vs pure opencv/pil +## architecture -- **OpenCV alone**: excellent resize performance, but poor mixed-shape batching -- **PIL**: slow, GIL-bound, no batch operations -- **TSR hybrid**: combines OpenCV's resize speed with TSR's batch/SIMD advantages +TrainingSample uses different implementations for different operation types: -### vs pure rust implementations +- Cropping: Rust/ndarray implementation with owned-array transfer into NumPy. +- Luminance: Rust channel-sum fast path for contiguous arrays, with a general ndarray fallback for non-contiguous inputs. +- Resize: OpenCV-backed implementation for image quality and mature interpolation behavior. +- Video resize: OpenCV-backed frame resizing with batched Python binding output. +- Format conversion: Rust SIMD implementation where the `simd` feature is enabled. -- **TSR resize**: slower than OpenCV's highly-optimized C++ (7-25x difference) -- **TSR luminance**: faster than NumPy due to SIMD (18-35x speedup) -- **best of both**: use optimal implementation for each operation +The optimized path generally requires contiguous `uint8` arrays. Views such as `image[:, ::2, :]` remain supported by safe public APIs, but they may use slower fallback paths. -### static distribution advantage +## features -- **no dependency conflicts**: opencv-python version compatibility issues eliminated -- **consistent performance**: same optimized OpenCV across all platforms -- **simple deployment**: single wheel, no system dependencies +- Python bindings through PyO3 and rust-numpy +- Batch APIs for images and videos +- OpenCV-compatible constants and helper functions for common operations +- Optional SIMD feature for format conversion and selected numeric paths +- Error handling for invalid dimensions, unsupported channels, and invalid crop bounds +- Source build support for dynamic or static OpenCV configurations ## building from source ```bash -# for python pip install maturin maturin develop --release +``` -# for rust -cargo build --release +The OpenCV Rust bindings need to find a working OpenCV and Clang installation. If the environment has stale OpenCV or LLVM variables, unset them before building: + +```bash +env -u OPENCV_LINK_LIBS -u OPENCV_LINK_PATHS -u OPENCV_INCLUDE_PATHS \ + -u LIBCLANG_PATH -u LLVM_CONFIG_PATH \ + maturin develop --release ``` -requires rust 1.70+ and python 3.11+ if you want the python bindings. +See [docs/BUILDING_STATIC_OPENCV.md](docs/BUILDING_STATIC_OPENCV.md) for static OpenCV bundle notes. ## license -MIT. do whatever you want with it, leave attribution in-tact. +MIT. See [LICENSE](LICENSE). diff --git a/docs/API_COMPAT_CV2.md b/docs/API_COMPAT_CV2.md index b7fff75..04d2037 100644 --- a/docs/API_COMPAT_CV2.md +++ b/docs/API_COMPAT_CV2.md @@ -1,217 +1,143 @@ -# OpenCV (cv2) API Compatibility Guide +# OpenCV API Compatibility Guide -TrainingSample provides drop-in replacements for common OpenCV operations with significant performance improvements through Rust optimizations and true batch processing. +TrainingSample exposes a subset of OpenCV-style image APIs plus batch-oriented helpers. The goal is to reduce Python loop overhead for common preprocessing workloads, not to implement the full `cv2` surface. -## Quick Start: Drop-in Replacement - -Replace `cv2` imports with `trainingsample` for instant performance gains: +## Quick Start ```python -# old cv2 approach import cv2 import numpy as np - -# new high-performance approach import trainingsample as tsr ``` -## πŸ† Zero-Copy Operations (Industry-Leading Performance) - -**BREAKTHROUGH ACHIEVEMENT: We leverage OpenCV's C++ power to BEAT opencv-python (cv2) while providing record-breaking batch processing!** +Use TrainingSample where a matching helper exists: -### Single Image Resizing (Faster than cv2!) ```python -# 1.12x FASTER than cv2.resize() - leveraging OpenCV C++ without binding overhead -result = tsr.batch_resize_images_zero_copy( - img, # np.ndarray - single image - target_size, # (width, height) - target dimensions - interpolation=tsr.INTER_LINEAR # Optional: INTER_NEAREST, INTER_LINEAR (default), INTER_CUBIC, INTER_LANCZOS4 -) -# Direct numpy array return, zero wrapper overhead, intelligent dispatch +resized = tsr.resize(image, (224, 224), interpolation=tsr.INTER_LINEAR) +gray = tsr.cvt_color(image, tsr.COLOR_RGB2GRAY) +edges = tsr.canny(image, threshold1=50, threshold2=150) ``` -### Batch Resizing (Multiple APIs for Different Use Cases) -```python -# BATCH LIST API: 2.4x faster than OpenCV individual calls -results = tsr.batch_resize_images_zero_copy( - images, # List[np.ndarray] - batch of images - target_sizes, # List[(width, height)] - target dimensions - interpolation=tsr.INTER_LINEAR # Optional: choose interpolation method -) -# Returns: List[np.ndarray] - perfect for immediate processing - -# ITERATOR API: True zero-copy with lazy conversion (memory efficient) -for result in tsr.batch_resize_images_iterator(images, target_sizes, interpolation=tsr.INTER_CUBIC): - process(result) # Convert only when accessed, supports early termination -# 2.3x faster than OpenCV, minimal memory footprint -``` +For batches, prefer the batch APIs instead of a Python loop: -#### 🎯 Interpolation Methods ```python -# Available interpolation constants (OpenCV-compatible): -tsr.INTER_NEAREST # Fast, blocky - good for masks/labels -tsr.INTER_LINEAR # Default - good balance of speed and quality -tsr.INTER_CUBIC # High quality, slower - best for upsampling -tsr.INTER_LANCZOS4 # Best quality, slowest - professional upsampling - -# Usage examples: -fast_resize = tsr.batch_resize_images_zero_copy(images, sizes, tsr.INTER_NEAREST) -quality_resize = tsr.batch_resize_images_zero_copy(images, sizes, tsr.INTER_LANCZOS4) - -# Performance vs Quality Trade-offs: -# INTER_NEAREST: ~4x faster than LANCZOS4, acceptable for downsampling -# INTER_LINEAR: ~2x faster than LANCZOS4, good general purpose (default) -# INTER_CUBIC: ~1.5x faster than LANCZOS4, good for upsampling -# INTER_LANCZOS4: Best quality, use for professional image processing +images = [load_image(path) for path in paths] +sizes = [(224, 224)] * len(images) + +resized = tsr.batch_resize_images(images, sizes) +luminances = tsr.batch_calculate_luminance(resized) ``` -### Batch Cropping (Zero-Copy) +## Supported OpenCV-Style Operations + +### Image Decoding + ```python -# 4-5x faster than regular batch operations -cropped = tsr.batch_crop_images_zero_copy( - images, # List[np.ndarray] - batch of images - crop_boxes # List[(x, y, width, height)] - crop coordinates -) - -# Center cropping with zero-copy optimization -center_cropped = tsr.batch_center_crop_images_zero_copy( - images, # List[np.ndarray] - target_sizes # List[(width, height)] -) +with open("image.jpg", "rb") as f: + img_bytes = f.read() + +img = tsr.imdecode(img_bytes, tsr.IMREAD_COLOR) +img_gray = tsr.imdecode(img_bytes, tsr.IMREAD_GRAYSCALE) ``` -### Batch Luminance (Zero-Copy + Parallel) +### Color Space Conversion + ```python -# 5-8x faster with parallel processing + adaptive SIMD -luminances = tsr.batch_calculate_luminance_zero_copy(images) -# Returns: List[float] - ITU-R BT.709 luminance values (0-255 range) +gray = tsr.cvt_color(image, tsr.COLOR_RGB2GRAY) +bgr = tsr.cvt_color(image, tsr.COLOR_RGB2BGR) ``` -## πŸ“Š Standard Batch Operations - -High-performance batch processing for common operations: +### Edge Detection -### Image Loading ```python -# Parallel image loading from file paths -images = tsr.load_image_batch([ - 'path/to/image1.jpg', - 'path/to/image2.png', - 'path/to/image3.webp' -]) +edges = tsr.canny(image, threshold1=50, threshold2=150) ``` -### Batch Cropping +### Image Resizing + ```python -# Regular batch cropping (still faster than individual cv2 calls) -images = tsr.batch_crop_images(images, crop_boxes) -center_cropped = tsr.batch_center_crop_images(images, target_sizes) -random_cropped = tsr.batch_random_crop_images(images, target_sizes) +resized = tsr.resize(image, (width, height), interpolation=tsr.INTER_LINEAR) ``` -### Batch Resizing (Zero-Copy) +Supported interpolation constants: + ```python -# Ultra-fast zero-copy batch resizing (8+ images for optimal performance) -resized = tsr.batch_resize_images_zero_copy( - images, # List[np.ndarray] - batch of images - target_sizes # List[(width, height)] - target dimensions -) -# 2.4x faster than OpenCV individual calls at 64 images -# 16,306 images/sec throughput with parallel processing +tsr.INTER_NEAREST +tsr.INTER_LINEAR +tsr.INTER_CUBIC +tsr.INTER_LANCZOS4 ``` -### Standard Batch Resizing +## Batch Operations + +### Cropping + ```python -# High-performance batch resizing -resized = tsr.batch_resize_images( - images, - target_sizes, # List[(width, height)] - interpolation="bilinear" # or "lanczos" -) - -# Video frame batch processing -video_frames = tsr.batch_resize_videos(videos, target_sizes) +crop_boxes = [(x, y, width, height) for image in images] +cropped = tsr.batch_crop_images(images, crop_boxes) ``` -### Batch Luminance Calculation +Center and random crop helpers use target sizes: + ```python -# Calculate ITU-R BT.709 luminance for batch of images -luminances = tsr.batch_calculate_luminance(images) -# Formula: L = 0.2126*R + 0.7152*G + 0.0722*B +target_sizes = [(224, 224)] * len(images) +center_cropped = tsr.batch_center_crop_images(images, target_sizes) +random_cropped = tsr.batch_random_crop_images(images, target_sizes) ``` -## 🎨 Format Conversion (Ultra-Fast) - -Sub-millisecond format conversions with SIMD optimization: +### Resizing ```python -# RGB to RGBA conversion (add alpha channel) -rgba_image, timing = tsr.rgb_to_rgba_optimized(rgb_image, alpha=255) - -# RGBA to RGB conversion (remove alpha channel) -rgb_image, timing = tsr.rgba_to_rgb_optimized(rgba_image) +target_sizes = [(224, 224)] * len(images) +resized = tsr.batch_resize_images(images, target_sizes) ``` -## πŸ”§ OpenCV-Compatible Individual Operations +The public resize API returns a list of owned NumPy arrays. Current implementation uses OpenCV-backed resize internally and transfers owned Rust arrays into NumPy without an additional copy. -Drop-in replacements for common cv2 functions: +### Luminance -### Image Decoding ```python -# Equivalent to cv2.imdecode() -import trainingsample as tsr +luminances = tsr.batch_calculate_luminance(images) +``` -# Read image bytes -with open('image.jpg', 'rb') as f: - img_bytes = f.read() +For contiguous arrays, luminance uses a channel-sum fast path. Non-contiguous arrays are accepted by the safe public API but may run through a slower ndarray fallback. -# Decode with OpenCV-compatible flags -img = tsr.imdecode(img_bytes, tsr.IMREAD_COLOR) -img_gray = tsr.imdecode(img_bytes, tsr.IMREAD_GRAYSCALE) -``` +### Video Resizing -### Color Space Conversion ```python -# Equivalent to cv2.cvtColor() -gray = tsr.cvt_color(image, tsr.COLOR_RGB2GRAY) -bgr = tsr.cvt_color(image, tsr.COLOR_RGB2BGR) +videos = [video_array] # shape: (frames, height, width, 3) +target_sizes = [(224, 224)] +resized_videos = tsr.batch_resize_videos(videos, target_sizes) ``` -### Edge Detection -```python -# Equivalent to cv2.Canny() -edges = tsr.canny(image, threshold1=50, threshold2=150) -``` +## Zero-Copy Entry Points + +Some lower-level APIs expose stricter zero-copy behavior: -### Image Resizing ```python -# Equivalent to cv2.resize() -resized = tsr.resize(image, (width, height), interpolation=tsr.INTER_LINEAR) +cropped = tsr.batch_crop_images_zero_copy(images, crop_boxes) +luminances = tsr.batch_calculate_luminance_zero_copy(images) +resized = tsr.batch_resize_images_zero_copy(images, target_sizes) ``` -## πŸ“Ή Video Processing +These functions are intended for contiguous arrays. Unsafe zero-copy crop and resize paths reject non-contiguous views with a `ValueError`. -OpenCV-compatible video capture and writing: +## Video Capture and Writing -### Video Capture ```python -# Equivalent to cv2.VideoCapture -cap = tsr.VideoCapture('video.mp4') +cap = tsr.VideoCapture("video.mp4") if cap.is_opened(): ret, frame = cap.read() if ret: - # Process frame - processed = tsr.batch_calculate_luminance([frame]) + luminance = tsr.batch_calculate_luminance([frame]) cap.release() ``` -### Video Writing ```python -# Equivalent to cv2.VideoWriter -fourcc = tsr.fourcc('M', 'J', 'P', 'G') -writer = tsr.VideoWriter('output.avi', fourcc, 30.0, (width, height)) +fourcc = tsr.fourcc("M", "J", "P", "G") +writer = tsr.VideoWriter("output.avi", fourcc, 30.0, (width, height)) for frame in frames: writer.write(frame) @@ -219,139 +145,72 @@ for frame in frames: writer.release() ``` -## πŸ” Object Detection +## Object Detection ```python -# Equivalent to cv2.CascadeClassifier -classifier = tsr.CascadeClassifier('haarcascade_frontalface_alt.xml') +classifier = tsr.CascadeClassifier("haarcascade_frontalface_alt.xml") faces = classifier.detect_multi_scale(image) ``` -## ⚑ Performance Comparison - -| Operation | cv2 Individual | TSR Batch | TSR Zero-Copy | TSR Iterator | Best Speedup | -|-----------|---------------|-----------|---------------|--------------|--------------| -| **Single Resize** | **0.134ms** | **-** | **0.146ms** | **-** | **1.12x FASTER** πŸ† | -| Crop | 1.40ms | 1.40ms | 0.34ms | - | **4.1x** πŸ† | -| Center Crop | 1.59ms | 1.59ms | 0.48ms | - | **3.3x** πŸ† | -| Luminance | 4.38ms | 4.38ms | 0.55ms | - | **8.0x** πŸ† | -| **Batch Resize (8)** | **1.10ms** | **0.47ms** | **-** | **0.48ms** | **2.4x** πŸ† | -| Format Conv | 0.10ms | 0.02ms | 0.01ms | - | **10x** πŸ† | - -## 🎯 Best Practices +## Benchmark Snapshot -### When to Use Zero-Copy Operations -- **Always use for batch processing** - 3-8x performance gains -- **Large image datasets** - Memory-efficient with buffer pooling -- **Real-time applications** - Parallel processing + SIMD acceleration +The following numbers came from the local benchmark suite after the latest Python-interface optimization work: -### Migration from OpenCV -```python -# SINGLE IMAGE: Drop-in replacement that's actually FASTER -# before -result = cv2.resize(img, (256, 256)) - -# after (1.12x FASTER!) -result = tsr.batch_resize_images_zero_copy(img, (256, 256)) - -# BATCH PROCESSING: Massive speedup -# before (slow) -results = [] -for img in images: - result = cv2.resize(img, (256, 256)) - results.append(result) - -# after (2.4x FASTER!) -results = tsr.batch_resize_images_zero_copy(images, [(256, 256)] * len(images)) - -# MEMORY EFFICIENT: Iterator for large batches -for result in tsr.batch_resize_images_iterator(images, target_sizes): - process(result) # Convert only when needed +```bash +.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s ``` -### Memory Efficiency -```python -# before (slow - multiple boundary crossings) -for img in images: - gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY) - resized = cv2.resize(gray, target_size) - edges = cv2.Canny(resized, 50, 150) - -# after (fast - single batch operation) -grays = tsr.batch_cvt_color(images, tsr.COLOR_RGB2GRAY) -resized = tsr.batch_resize_images_zero_copy(grays, sizes) -edges = tsr.batch_canny(resized, threshold1=50, threshold2=150) -``` +Environment: Linux x86_64, CPython 3.13, NumPy 2.3.4, system OpenCV 4.11 through the Rust `opencv` crate. -## πŸš€ Advanced Features +| Scenario | TrainingSample | Comparison in same run | +|----------|----------------|------------------------| +| Batch resize, 4 mixed-size images | 0.4 ms | OpenCV loop: 2.6 ms | +| Batch luminance, 4 mixed-size images | 0.6 ms | OpenCV loop: 0.9 ms | +| Resize + luminance pipeline, 4 mixed-size images | 0.6 ms | OpenCV loop: 2.1 ms | +| Mixed-shape luminance, 6 images | 3.3 ms | NumPy loop: 19.4 ms | +| Mixed-shape crop, 8 images | 3.3 ms | NumPy slicing loop: near-zero because slicing returns views | -### Adaptive SIMD Processing -TrainingSample automatically chooses between SIMD and scalar operations based on image size: -- **Small images (<64K pixels)**: Scalar processing (avoids SIMD overhead) -- **Large images (>64K pixels)**: SIMD acceleration (AVX2/NEON) +The crop comparison is intentionally caveated: NumPy slicing can be effectively free when it returns a view. TrainingSample returns owned output arrays, which is the right comparison when the next stage needs independent contiguous buffers. -### Buffer Pool Management -Zero-copy operations use intelligent buffer pooling: -- **Automatic memory reuse** across batch operations -- **Size-based pooling** for optimal allocation patterns -- **Thread-safe sharing** for parallel processing +## Migration Notes + +### Prefer Batch APIs for Repeated Work -### Parallel Processing Architecture ```python -# Automatically parallelizes across available CPU cores -luminances = tsr.batch_calculate_luminance_zero_copy(images) -# - Extracts raw pointers on main thread -# - Distributes processing across worker threads -# - Uses lock-free data structures for maximum throughput -``` +# OpenCV loop +results = [cv2.resize(img, (224, 224)) for img in images] -## πŸ”§ Installation & Setup +# TrainingSample batch call +results = tsr.batch_resize_images(images, [(224, 224)] * len(images)) +``` -```bash -pip install trainingsample +### Keep Inputs Contiguous When Performance Matters -# For maximum performance, ensure you have: -# - Multi-core CPU (parallel processing) -# - AVX2 support (x86) or NEON (ARM) for SIMD +```python +if not image.flags["C_CONTIGUOUS"]: + image = np.ascontiguousarray(image) ``` -## πŸ“ˆ Benchmarking Your Workload +Public safe APIs accept many strided views, but contiguous arrays are usually faster and are required by strict zero-copy paths. + +### Validate Your Own Workload + +Image shape, interpolation, batch size, and memory bandwidth can change results. Benchmark the exact pipeline you intend to ship: ```python import time -import trainingsample as tsr - -# Benchmark your specific use case -images = load_your_images() start = time.perf_counter() -results = tsr.batch_operation_zero_copy(images, params) +results = tsr.batch_resize_images(images, sizes) duration = time.perf_counter() - start -print(f"Processed {len(images)} images in {duration*1000:.2f}ms") -print(f"Throughput: {len(images)/duration:.1f} images/sec") +print(f"{len(images) / duration:.1f} images/sec") ``` -## πŸ† Summary - -TrainingSample provides: -- **memory efficiency**: reduced Python object overhead in batch operations -- **computational efficiency**: SIMD vectorization and parallel processing -- **API compatibility**: drop-in replacement for common cv2 operations -- **zero-copy semantics**: direct buffer manipulation for maximum performance - -**INDUSTRY-LEADING Performance Gains:** -- **BEATS OpenCV** for single image operations (1.12x faster resize) -- **2.4x faster** batch processing vs OpenCV individual calls -- **17,204+ images/sec** batch resize throughput -- **True zero-copy iteration** with lazy conversion -- **100% API compatibility** with OpenCV - drop-in replacement -- **Intelligent auto-batching** - same function handles single + batch -- **Memory usage reduction** through buffer pooling + lazy conversion - -**Limitations:** -- **memory overhead**: batch processing requires significant RAM for large images -- **startup cost**: small overhead for very small batches (<5 images) -- **Python GIL**: some operations still limited by Python's global interpreter lock - -For maximum performance gains, use the zero-copy batch operations with mixed-size image datasets on multi-core systems. +## Limitations + +- This is not a complete `cv2` replacement. +- Batch APIs allocate owned output arrays. +- Small inputs can be dominated by call overhead. +- Zero-copy functions require contiguous arrays for crop and resize paths. +- System OpenCV and wheel build configuration can affect performance and available codecs. diff --git a/docs/BUILDING_STATIC_OPENCV.md b/docs/BUILDING_STATIC_OPENCV.md index dd853be..e63e672 100644 --- a/docs/BUILDING_STATIC_OPENCV.md +++ b/docs/BUILDING_STATIC_OPENCV.md @@ -2,7 +2,7 @@ The `opencv` crate expects to find an existing OpenCV toolkit and, by default, it links against the dynamic libraries that come with a system installation -(`libopencv_core.dylib`, `libopencv_core.so`, …). To ship the `trainingsample` +(`libopencv_core.dylib`, `libopencv_core.so`, etc.). To ship the `trainingsample` crate without asking end users to install OpenCV themselves, build a static OpenCV distribution once and point Cargo at it during compilation. @@ -104,7 +104,7 @@ OpenCV distribution once and point Cargo at it during compilation. ```bash cp ~/Downloads/opencv-build-static/3rdparty/lib/liblibjpeg-turbo.a third_party/opencv-static/lib/ ln -sf liblibjpeg-turbo.a third_party/opencv-static/lib/libjpeg.a - # Repeat for liblibpng.aβ†’libpng.a, liblibtiff.aβ†’libtiff.a, liblibwebp.aβ†’libwebp.a, libzlib.aβ†’libz.a, liblibjasper.aβ†’libjasper.a + # Repeat for liblibpng.a -> libpng.a, liblibtiff.a -> libtiff.a, liblibwebp.a -> libwebp.a, libzlib.a -> libz.a, liblibjasper.a -> libjasper.a ``` 3. After installation you should have: @@ -130,7 +130,7 @@ OpenCV distribution once and point Cargo at it during compilation. > library. Replace `static=stdc++` with `dylib=c++` (or `framework=Accelerate` > when required) in the linking step below. -## 2. Point Cargo at the static toolchain +## 3. Point Cargo at the static toolchain Add a `.cargo/config.toml` (kept inside the repo) with the environment variables that the `opencv` build script understands: @@ -159,7 +159,7 @@ file so subsequent runs do not skip regeneration. If you elected to install the individual module archives instead of `opencv_world`, list each one (`static=opencv_core`, `static=opencv_imgproc`, -…). Keep the order roughly from high- to low-level modules so the linker can +and so on). Keep the order roughly from high- to low-level modules so the linker can resolve symbols in one pass. For cross-compilation add target-specific sections, e.g.: @@ -169,7 +169,7 @@ For cross-compilation add target-specific sections, e.g.: OPENCV_LINK_LIBS = "static=opencv_world,static=avformat,static=avcodec,static=avfilter,static=swresample,static=swscale,static=avutil,static=png,static=jpeg,static=tiff,static=z,static=jasper,dylib=c++" ``` -## 3. Build the crate +## 4. Build the crate With the static bundle in place you can now build the crate without touching the system OpenCV installation: @@ -182,7 +182,7 @@ The resulting `libtrainingsample.{so,dylib}` (or the wheels produced by the Python bindings) now embed the OpenCV symbols directly, so end users do not need `opencv_core` on their machines. -## 4. Regenerating the bundle +## 5. Regenerating the bundle Whenever you need to update OpenCV: @@ -190,8 +190,8 @@ Whenever you need to update OpenCV: tree. 2. Verify that the list in `OPENCV_LINK_LIBS` still matches the archives produced. 3. Commit the regenerated contents of `third_party/opencv-static/` if you keep - it under version control (or upload it to your release pipeline’s artifact + it under version control (or upload it to your release pipeline's artifact store). -That is all Cargo needsβ€”no changes to `Cargo.toml` are required beyond enabling +That is all Cargo needs. No changes to `Cargo.toml` are required beyond enabling the `opencv` feature when you want the acceleration path. diff --git a/src/luminance.rs b/src/luminance.rs index 38e4cac..daffe48 100644 --- a/src/luminance.rs +++ b/src/luminance.rs @@ -1,37 +1,51 @@ use ndarray::ArrayView3; #[cfg(feature = "simd")] -pub use crate::luminance_simd::{ - calculate_luminance_optimized, calculate_luminance_optimized_sequential, LuminanceMetrics, -}; +pub use crate::luminance_simd::{calculate_luminance_optimized, LuminanceMetrics}; /// Main luminance calculation function with automatic SIMD optimization pub fn calculate_luminance_array(image: &ArrayView3) -> f64 { - #[cfg(feature = "simd")] - { - let (result, _metrics) = calculate_luminance_optimized(image); - result + if let Some(result) = calculate_luminance_contiguous(image) { + return result; } - #[cfg(not(feature = "simd"))] - { - calculate_luminance_scalar(image) - } + calculate_luminance_scalar(image) } /// Single-threaded luminance calculation to avoid nested parallelism in batch operations pub fn calculate_luminance_array_sequential(image: &ArrayView3) -> f64 { - #[cfg(feature = "simd")] - { - // Use single-threaded SIMD optimization to avoid nested parallelism - let (result, _metrics) = calculate_luminance_optimized_sequential(image); - result + if let Some(result) = calculate_luminance_contiguous(image) { + return result; } - #[cfg(not(feature = "simd"))] - { - calculate_luminance_scalar(image) + calculate_luminance_scalar(image) +} + +fn calculate_luminance_contiguous(image: &ArrayView3) -> Option { + let (height, width, channels) = image.dim(); + let data = image.as_slice()?; + + if height == 0 || width == 0 || channels == 0 { + return Some(0.0); + } + + if channels < 3 { + let sum: u64 = data.iter().map(|&x| x as u64).sum(); + return Some(sum as f64 / data.len() as f64); } + + let pixel_count = height * width; + let mut r_sum = 0u64; + let mut g_sum = 0u64; + let mut b_sum = 0u64; + + for pixel in data.chunks_exact(channels).take(pixel_count) { + r_sum += pixel[0] as u64; + g_sum += pixel[1] as u64; + b_sum += pixel[2] as u64; + } + + Some((0.299 * r_sum as f64 + 0.587 * g_sum as f64 + 0.114 * b_sum as f64) / pixel_count as f64) } /// Ultra-fast adaptive luminance calculation with automatic SIMD/scalar selection diff --git a/src/python_bindings.rs b/src/python_bindings.rs index 869e030..106939a 100644 --- a/src/python_bindings.rs +++ b/src/python_bindings.rs @@ -155,7 +155,7 @@ pub unsafe fn batch_crop_images_zero_copy<'py>( let array = ndarray::Array3::from_shape_vec((height, width, channels), output_buffer) .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)))?; - let py_array = PyArray3::from_array_bound(py, &array); + let py_array = PyArray3::from_owned_array_bound(py, array); py_results.push(py_array); } @@ -206,7 +206,7 @@ pub unsafe fn batch_center_crop_images_zero_copy<'py>( .map_err(|e| { pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)) })?; - let py_array = PyArray3::from_array_bound(py, &array); + let py_array = PyArray3::from_owned_array_bound(py, array); py_results.push(py_array); } @@ -389,8 +389,8 @@ pub fn batch_resize_images_zero_copy<'py>( // Convert to PyArray3 and return as Python list let py_results: Vec>> = results - .iter() - .map(|array| PyArray3::from_array_bound(py, array)) + .into_iter() + .map(|array| PyArray3::from_owned_array_bound(py, array)) .collect(); Ok(PyList::new_bound(py, py_results).into_any()) @@ -552,7 +552,7 @@ fn resize_single_image_direct<'py>( })?; // DIRECT return - no Vec wrapper overhead! - Ok(PyArray3::from_array_bound(py, &result)) + Ok(PyArray3::from_owned_array_bound(py, result)) } #[cfg(not(feature = "opencv"))] @@ -622,7 +622,7 @@ impl ResizeIterator { // Convert raw buffer directly to PyArray3 - ZERO intermediate steps! match ndarray::Array3::from_shape_vec((*height, *width, *channels), buffer.clone()) { - Ok(array) => Some(PyArray3::from_array_bound(py, &array)), + Ok(array) => Some(PyArray3::from_owned_array_bound(py, array)), Err(_) => None, // Skip malformed arrays } } @@ -874,7 +874,7 @@ pub fn batch_crop_images<'py>( let img_view = image.as_array(); match crop_image_array(&img_view, x, y, width, height) { Ok(cropped) => { - let py_array = PyArray3::from_array_bound(py, &cropped); + let py_array = PyArray3::from_owned_array_bound(py, cropped); py_results.push(py_array); } Err(e) => { @@ -907,7 +907,7 @@ pub fn batch_center_crop_images<'py>( let img_view = image.as_array(); match crate::cropping::center_crop_image_array(&img_view, target_width, target_height) { Ok(cropped) => { - let py_array = PyArray3::from_array_bound(py, &cropped); + let py_array = PyArray3::from_owned_array_bound(py, cropped); py_results.push(py_array); } Err(e) => { @@ -934,7 +934,7 @@ pub fn batch_random_crop_images<'py>( let img_view = image.as_array(); match random_crop_image_array(&img_view, target_width, target_height) { Ok(cropped) => { - let py_array = PyArray3::from_array_bound(py, &cropped); + let py_array = PyArray3::from_owned_array_bound(py, cropped); py_results.push(py_array); } Err(e) => { @@ -986,7 +986,7 @@ pub fn rgb_to_rgba_optimized<'py>( let rgba_array = ndarray::Array3::from_shape_vec((height, width, 4), rgba_data) .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)))?; - let py_array = PyArray3::from_array_bound(py, &rgba_array); + let py_array = PyArray3::from_owned_array_bound(py, rgba_array); Ok((py_array, metrics.throughput_mpixels_per_sec)) } @@ -1015,7 +1015,7 @@ pub fn rgba_to_rgb_optimized<'py>( let rgb_array = ndarray::Array3::from_shape_vec((height, width, 3), rgb_data) .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)))?; - let py_array = PyArray3::from_array_bound(py, &rgb_array); + let py_array = PyArray3::from_owned_array_bound(py, rgb_array); Ok((py_array, metrics.throughput_mpixels_per_sec)) } @@ -1059,7 +1059,7 @@ pub fn batch_resize_images<'py>( Ok(resized_images) => { let py_results: Vec<_> = resized_images .into_iter() - .map(|resized| PyArray3::from_array_bound(py, &resized)) + .map(|resized| PyArray3::from_owned_array_bound(py, resized)) .collect(); Ok(py_results) } @@ -1097,7 +1097,7 @@ pub fn batch_resize_videos<'py>( Ok(resized_videos) => { let py_results: Vec<_> = resized_videos .into_iter() - .map(|resized| PyArray4::from_array_bound(py, &resized)) + .map(|resized| PyArray4::from_owned_array_bound(py, resized)) .collect(); Ok(py_results) } @@ -1136,7 +1136,7 @@ pub fn resize_bilinear_opencv<'py>( match resize_bilinear_opencv(&image_array, target_width, target_height) { Ok(resized) => { - let py_array = PyArray3::from_array_bound(py, &resized); + let py_array = PyArray3::from_owned_array_bound(py, resized); Ok(py_array) } Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!( @@ -1160,7 +1160,7 @@ pub fn resize_lanczos4_opencv<'py>( match resize_lanczos4_opencv(&image_array, target_width, target_height) { Ok(resized) => { - let py_array = PyArray3::from_array_bound(py, &resized); + let py_array = PyArray3::from_owned_array_bound(py, resized); Ok(py_array) } Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!( @@ -1222,7 +1222,7 @@ pub fn imdecode_py<'py>( match imdecode(buf, imread_flags) { Ok(image) => { - let py_array = PyArray3::from_array_bound(py, &image); + let py_array = PyArray3::from_owned_array_bound(py, image); Ok(py_array) } Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!( @@ -1258,7 +1258,7 @@ pub fn cvt_color_py<'py>( let src_array = src.as_array(); match cvt_color(&src_array, color_code) { Ok(converted) => { - let py_array = PyArray3::from_array_bound(py, &converted); + let py_array = PyArray3::from_owned_array_bound(py, converted); Ok(py_array) } Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!( @@ -1281,7 +1281,7 @@ pub fn canny_py<'py>( let image_array = image.as_array(); match canny(&image_array, threshold1, threshold2) { Ok(edges) => { - let py_array = PyArray3::from_array_bound(py, &edges); + let py_array = PyArray3::from_owned_array_bound(py, edges); Ok(py_array) } Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!( @@ -1317,7 +1317,7 @@ pub fn resize_py<'py>( let src_array = src.as_array(); match resize(&src_array, dsize, interp) { Ok(resized) => { - let py_array = PyArray3::from_array_bound(py, &resized); + let py_array = PyArray3::from_owned_array_bound(py, resized); Ok(py_array) } Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!( @@ -1419,7 +1419,7 @@ impl PyVideoCapture { fn read<'py>(&mut self, py: Python<'py>) -> PyResult<(bool, Option>>)> { let (ret, frame) = self.inner.read(); if let Some(frame_data) = frame { - let py_array = PyArray3::from_array_bound(py, &frame_data); + let py_array = PyArray3::from_owned_array_bound(py, frame_data); Ok((ret, Some(py_array))) } else { Ok((ret, None)) @@ -1639,7 +1639,7 @@ impl PyBatchProcessor { Ok(results) => { let py_results: Vec<_> = results .into_iter() - .map(|result| PyArray3::from_array_bound(py, &result)) + .map(|result| PyArray3::from_owned_array_bound(py, result)) .collect(); Ok(py_results) } @@ -1679,7 +1679,7 @@ impl PyBatchProcessor { Ok(results) => { let py_results: Vec<_> = results .into_iter() - .map(|result| PyArray3::from_array_bound(py, &result)) + .map(|result| PyArray3::from_owned_array_bound(py, result)) .collect(); Ok(py_results) } @@ -1704,7 +1704,7 @@ impl PyBatchProcessor { Ok(results) => { let py_results: Vec<_> = results .into_iter() - .map(|result| PyArray3::from_array_bound(py, &result)) + .map(|result| PyArray3::from_owned_array_bound(py, result)) .collect(); Ok(py_results) } @@ -1782,7 +1782,7 @@ impl PyBatchProcessor { Ok(results) => { let py_results: Vec<_> = results .into_iter() - .map(|result| PyArray3::from_array_bound(py, &result)) + .map(|result| PyArray3::from_owned_array_bound(py, result)) .collect(); Ok(py_results) } @@ -1903,7 +1903,7 @@ impl PyTrueBatchProcessor { Ok(results) => { let py_results: Vec<_> = results .into_iter() - .map(|result| PyArray3::from_array_bound(py, &result)) + .map(|result| PyArray3::from_owned_array_bound(py, result)) .collect(); Ok(py_results) } @@ -1939,7 +1939,7 @@ impl PyTrueBatchProcessor { Ok(results) => { let py_results: Vec<_> = results .into_iter() - .map(|result| PyArray3::from_array_bound(py, &result)) + .map(|result| PyArray3::from_owned_array_bound(py, result)) .collect(); Ok(py_results) }