diff --git a/BENCHMARKS.md b/BENCHMARKS.md
index 4bbd7c4..287b28a 100644
--- a/BENCHMARKS.md
+++ b/BENCHMARKS.md
@@ -1,112 +1,137 @@
-# Competitive Performance Benchmarks
+# Performance Benchmarks
 
-This library includes comprehensive benchmarks against industry-standard libraries (OpenCV, NumPy) to ensure competitive performance for real-world SFT (Supervised Fine-Tuning) workloads.
+TrainingSample includes benchmarks for common preprocessing operations: crop, resize, luminance, resize-plus-luminance pipelines, and video frame resizing. The benchmarks are meant to catch regressions and provide workload-specific guidance, not to guarantee universal speedups over OpenCV or NumPy.
 
-## Benchmark Categories
+## Running Benchmarks
 
-### 🖼️ High-Resolution Image Processing
+Use the repository virtual environment when available:
 
-**Target Workload**: 5120×5120 → 1024×1024 image processing pipeline
-- **Input**: 5120×5120×3 images (26.2M pixels, ~78MB each)
-- **Pipeline**: Center crop → Resize → Luminance calculation
-- **Batch sizes**: 2-4 images (memory constrained)
+```bash
+.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s
+```
 
-### 📊 Performance Targets
+To run every Python test and benchmark marker in the repo:
 
-| Operation | Input Size | Target Performance | Baseline |
-|-----------|------------|-------------------|----------|
-| **Resize** | 5120×5120 → 1024×1024 | Match OpenCV bilinear | `cv2.resize()` |
-| **Center Crop** | 5120×5120 → 2048×2048 | Match/exceed NumPy | Array slicing |
-| **Luminance** | 1024×1024 | 1.5x+ faster than NumPy | Vectorized math |
-| **Full Pipeline** | 5120×5120 → 1024×1024 | >0.5 images/sec | Combined ops |
+```bash
+.venv/bin/python -m pytest -q
+```
 
-### 🎯 Quality Targets
+For a fresh source build before measuring:
 
-- **Resize Quality**: PSNR >30dB vs OpenCV (excellent similarity)
-- **Crop Accuracy**: Bit-exact match with NumPy center crop
-- **Luminance Precision**: <0.1 difference vs NumPy reference
+```bash
+env -u OPENCV_LINK_LIBS -u OPENCV_LINK_PATHS -u OPENCV_INCLUDE_PATHS \
+    -u LIBCLANG_PATH -u LLVM_CONFIG_PATH \
+    .venv/bin/maturin develop --release
+```
 
-## Running Benchmarks
+The OpenCV Rust binding needs a discoverable OpenCV and Clang installation. On this development host, stale macOS-style OpenCV and LLVM environment variables had to be unset before the build could probe the system OpenCV installation.
 
-### Local Development
-```bash
-# Install dependencies
-pip install opencv-python pytest-benchmark psutil
+## Current Local Snapshot
 
-# Build with optimizations
-maturin develop --release --features "python-bindings,simd"
+Last measured command:
 
-# Run competitive benchmarks
-./scripts/run_competitive_benchmarks.sh
+```bash
+.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s
 ```
 
-### CI/CD Integration
+Environment:
 
-Benchmarks run automatically in CI for:
-- **Pull requests**: Performance regression detection
-- **Main branch**: Performance tracking over time
-- **Weekly schedule**: Long-term performance monitoring
+- Linux x86_64
+- CPython 3.13
+- NumPy 2.3.4
+- system OpenCV 4.11 via the Rust `opencv` crate
+- release build installed with `maturin develop --release`
 
-## Benchmark Architecture
+Point-in-time scenario timings from the benchmark output:
 
-### Memory Efficiency
-- Monitors RSS memory usage vs OpenCV
-- Tests batch processing memory scaling
-- Validates no memory leaks in pipelines
+| Scenario | Before optimization | After optimization | Comparison after optimization |
+|----------|---------------------|--------------------|-------------------------------|
+| Crop batch, 16 images | 22.9 ms | 0.4 ms | NumPy slicing was still faster because it returns views |
+| Mixed-shape crop, 8 images | 50.2 ms | 3.3 ms | NumPy slicing loop was near-zero because it returns views |
+| Resize, 4 mixed-size images | 4.1 ms | 0.4 ms | OpenCV loop: 2.6 ms |
+| Luminance, 4 mixed-size images | 10.4 ms | 0.6 ms | OpenCV loop: 0.9 ms |
+| Resize + luminance pipeline, 4 images | 5.9 ms | 0.6 ms | OpenCV loop: 2.1 ms |
+| Mixed-shape luminance, 6 images | 78.3 ms | 3.3 ms | NumPy loop: 19.4 ms |
 
-### SIMD Optimization Validation
-- Compares SIMD-enabled vs scalar fallback performance
-- Tests x86-64 AVX2/AVX-512 and ARM64 NEON paths
-- Validates CPU feature detection accuracy
+Pytest-benchmark means from the same focused run:
 
-### Real-World Scenarios
-- **SFT Data Processing**: High-res → training resolution pipeline
-- **Batch Processing**: Multiple images with different operations
-- **Memory Constraints**: Large images with limited RAM
+| Benchmark | Mean |
+|-----------|------|
+| Center crop | 55.2 us |
+| Resize operations | 353.1 us |
+| Luminance calculation | 417.2 us |
+| Crop operations | 583.8 us |
+| Pipeline | 3.44 ms |
+| Video processing | 2.85 ms |
 
-## Performance Philosophy
+A full `pytest -q` run also passed and produced similar benchmark ordering, with normal run-to-run variance.
 
-### Why These Benchmarks Matter
+## What Changed in the Latest Optimization
 
-1. **Real-World Relevance**: SFT workloads use 5120×5120+ images, not toy 224×224
-2. **Competitive Pressure**: OpenCV and NumPy are highly optimized incumbents
-3. **User Experience**: Poor performance = adoption barriers
-4. **Resource Efficiency**: Training infrastructure costs scale with throughput
+- Owned Rust `ndarray` outputs are transferred into NumPy with `from_owned_array_bound`, avoiding an additional copy in Python-facing result conversion.
+- Contiguous luminance inputs use a channel-sum fast path. Instead of computing weighted luminance per pixel, it sums R, G, and B separately and applies the weights once at the end.
+- Non-contiguous arrays still use the general ndarray path for correctness.
 
-### Performance vs Quality Tradeoffs
+## Benchmark Categories
+
+### Image Operations
+
+- `batch_crop_images`
+- `batch_center_crop_images`
+- `batch_random_crop_images`
+- `batch_resize_images`
+- `batch_calculate_luminance`
+
+### Pipeline Operations
+
+- resize followed by luminance
+- crop followed by resize
+- mixed input sizes and output sizes
+
+### Video Operations
 
-- **Resize**: Bilinear interpolation for speed, good quality balance
-- **SIMD**: Aggressive optimization while maintaining numerical accuracy
-- **Memory**: Batch processing for throughput vs memory pressure balance
+- `batch_resize_videos` with frame batches shaped `(T, H, W, 3)`
 
 ## Interpreting Results
 
-### Good Performance Indicators
-- ✅ Resize: 1-2 images/sec for 5120×5120 → 1024×1024
-- ✅ Crop: 10+ images/sec for 5120×5120 → 2048×2048
-- ✅ Luminance: 1.5x+ faster than NumPy with SIMD
-- ✅ Pipeline: >0.5 complete transformations/sec
+Use these benchmarks to answer practical questions:
+
+- Is a change adding extra Rust-to-NumPy copies?
+- Are contiguous arrays staying on the fast path?
+- Is resize dominated by OpenCV work or Python binding overhead?
+- Does a mixed-shape batch still behave reasonably?
+- Is a video processing change accidentally introducing per-frame Python overhead?
+
+Some comparisons need context:
+
+- NumPy crop by slicing often returns a view, so it can be much faster than any function that returns owned cropped arrays.
+- Very small images can be dominated by Python call overhead.
+- Large images can be dominated by memory bandwidth rather than arithmetic.
+- OpenCV performance varies by build options, CPU features, and linked libraries.
+
+## Quality Checks
+
+The tests validate basic output behavior alongside timing:
 
-### Red Flags
-- ❌ Slower than OpenCV resize (indicates poor SIMD utilization)
-- ❌ Slower than NumPy crop (indicates unnecessary overhead)
-- ❌ Memory usage >2x OpenCV (indicates memory leaks/inefficiency)
-- ❌ Quality degradation (PSNR <30dB vs reference)
+- Crop outputs have expected shape and match NumPy slicing where ownership differences do not matter.
+- Resize outputs have expected shape and are close to OpenCV output for the configured interpolation.
+- Luminance stays within a small tolerance of NumPy/OpenCV-style references.
+- Non-contiguous arrays are accepted by safe luminance paths and rejected by strict zero-copy crop/resize paths.
 
-## Future Enhancements
+## Regression Signals
 
-### Planned Improvements
-- GPU acceleration benchmarks (Metal/CUDA)
-- More interpolation methods (bicubic, lanczos)
-- Video processing pipeline benchmarks
-- Multi-threaded batch processing optimization
+Investigate if a change causes:
 
-### Performance Tracking
-- Historical performance database
-- Regression detection and alerting
-- Performance comparison across different hardware configurations
-- Automated performance optimization recommendations
+- Public batch crop to return to multi-millisecond timings for small batches.
+- Luminance on contiguous RGB arrays to lose the channel-sum fast path.
+- Resize benchmarks to add large overhead beyond OpenCV work.
+- Video resizing to scale with per-frame Python object churn.
+- Memory usage to grow unexpectedly for repeated batch calls.
 
----
+## Future Benchmark Work
 
-**Goal**: Be the fastest, highest-quality image processing library for ML/SFT workloads while maintaining competitive memory usage and numerical accuracy.
+- Store historical benchmark results by commit and host.
+- Add explicit memory allocation tracking for Python-facing APIs.
+- Separate view-returning crop comparisons from owned-output crop comparisons.
+- Add more video pipeline benchmarks.
+- Document hardware and OpenCV build details in benchmark artifacts.
diff --git a/README.md b/README.md
index 80de67b..da285c0 100644
--- a/README.md
+++ b/README.md
@@ -4,102 +4,61 @@
 [![PyPI](https://img.shields.io/pypi/v/trainingsample.svg)](https://pypi.org/project/trainingsample/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
-**🏆 Industry-Leading Computer Vision Library - FASTER than cv2**
+TrainingSample provides Rust-backed Python bindings for common image and video preprocessing operations used in ML data pipelines. It combines OpenCV-backed resizing with Rust implementations for batching, cropping, luminance calculation, format conversion, and video helpers.
 
-The only Python library that **beats opencv-python (cv2) performance** by leveraging OpenCV's C++ power with zero-copy Rust optimizations and intelligent auto-batching.
+The project is designed for workloads where Python-side loops and repeated boundary crossings become visible. It is not a blanket replacement for all of `cv2`, and performance depends on image size, batch shape, CPU, OpenCV build, and memory bandwidth.
 
 ## install
 
 ```bash
-# python (recommended)
+# python
 pip install trainingsample
 
 # rust
 cargo add trainingsample
 ```
 
-## 🚀 Why TrainingSample Leads the Industry
-
-**BREAKTHROUGH: We leverage OpenCV's C++ power to beat opencv-python (cv2) by eliminating Python binding overhead.**
-
-### ⚡ Performance That Redefines Possible
-- **Single images**: **1.12x FASTER** than `cv2.resize()` - the "impossible" achievement
-- **Batch processing**: **2.4x faster** than OpenCV individual calls
-- **Zero-copy iteration**: True lazy conversion with **17,204 images/sec** throughput
-- **Intelligent dispatch**: Seamless auto-batching with zero wrapper overhead
-
-### 🔥 What Makes Us Different
-- **Leverages OpenCV C++**: Direct OpenCV C++ access to beat opencv-python binding overhead
-- **Zero wrapper overhead**: Eliminated 76% of artificial performance losses in Python bindings
-- **True zero-copy**: Raw OpenCV Mat → numpy array, no intermediate conversions
-- **Intelligent API**: Same function handles single images + batch processing seamlessly
-- **Buffer pooling**: Memory reuse across operations eliminates allocation bottlenecks
-- **Adaptive threading**: Sequential for small batches, parallel for large batches
-
-**We unleash OpenCV's full C++ power without Python binding limitations.**
-
-## 🎯 Ultimate Performance APIs
+## python usage
 
 ```python
 import numpy as np
 import trainingsample as tsr
 
-# SINGLE IMAGE - FASTER than cv2.resize()!
-img = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
-result = tsr.batch_resize_images_zero_copy(img, (256, 256))  # 1.12x FASTER than OpenCV!
+images = [
+    np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
+    for _ in range(8)
+]
 
-# BATCH PROCESSING - 2.4x faster than OpenCV individual calls
-images = [np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8) for _ in range(10)]
-results = tsr.batch_resize_images_zero_copy(images, [(256, 256)] * 10)
+crop_boxes = [(50, 50, 200, 200)] * len(images)
+cropped = tsr.batch_crop_images(images, crop_boxes)
 
-# MEMORY-EFFICIENT ITERATION - True zero-copy lazy conversion
-for result in tsr.batch_resize_images_iterator(images, [(256, 256)] * 10):
-    process(result)  # Convert only when accessed, supports early termination
+target_sizes = [(224, 224)] * len(images)
+resized = tsr.batch_resize_images(images, target_sizes)
 
-# ZERO-COPY BATCH OPERATIONS
-cropped = tsr.batch_crop_images_zero_copy(images, [(50, 50, 200, 200)] * 10)      # 4x faster
-luminances = tsr.batch_calculate_luminance_zero_copy(images)                      # 8x faster
-center_cropped = tsr.batch_center_crop_images_zero_copy(images, [(224, 224)] * 10) # 3x faster
+luminances = tsr.batch_calculate_luminance(resized)
 ```
 
-### 📊 Performance Comparison
-```python
-import time
-import cv2
+OpenCV-compatible helpers are also exported for common operations:
 
-# Single image resize comparison
-img = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
-
-# OpenCV (industry standard)
-start = time.perf_counter()
-cv2_result = cv2.resize(img, (256, 256))
-opencv_time = time.perf_counter() - start
-
-# TrainingSample (industry leader)
-start = time.perf_counter()
-tsr_result = tsr.batch_resize_images_zero_copy(img, (256, 256))
-tsr_time = time.perf_counter() - start
-
-print(f"OpenCV: {opencv_time*1000:.3f}ms")
-print(f"TSR:    {tsr_time*1000:.3f}ms")
-print(f"TSR is {opencv_time/tsr_time:.2f}x FASTER!")  # Typical: 1.12x faster
+```python
+decoded = tsr.imdecode(image_bytes, tsr.IMREAD_COLOR)
+gray = tsr.cvt_color(decoded, tsr.COLOR_RGB2GRAY)
+edges = tsr.canny(decoded, threshold1=50, threshold2=150)
+resized = tsr.resize(decoded, (224, 224), interpolation=tsr.INTER_LINEAR)
 ```
 
 ## rust usage
 
 ```rust
+use ndarray::Array3;
 use trainingsample::{
-    batch_crop_image_arrays, batch_resize_image_arrays,
-    batch_calculate_luminance_arrays
+    batch_calculate_luminance_arrays, batch_crop_image_arrays, batch_resize_image_arrays,
 };
-use ndarray::Array3;
 
-// create some test data
 let images: Vec<Array3<u8>> = (0..10)
     .map(|_| Array3::zeros((480, 640, 3)))
     .collect();
 
-// batch operations
 let crop_boxes = vec![(50, 50, 200, 200); 10]; // (x, y, width, height)
 let cropped = batch_crop_image_arrays(&images, &crop_boxes);
 
@@ -111,181 +70,111 @@ let luminances = batch_calculate_luminance_arrays(&images);
 
 ## api reference
 
-### python functions
-
-#### `batch_crop_images(images, crop_boxes)`
+### `batch_crop_images(images, crop_boxes)`
 
-- `images`: list of numpy arrays (H, W, 3) uint8
-- `crop_boxes`: list of (x, y, width, height) tuples
-- returns: list of cropped numpy arrays
-- **implementation**: TSR-optimized for mixed-shape batching
+- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data
+- `crop_boxes`: list of `(x, y, width, height)` tuples
+- returns: list of cropped NumPy arrays
+- notes: output arrays are owned by NumPy without an extra copy from the owned Rust array
 
-#### `batch_center_crop_images(images, target_sizes)`
+### `batch_center_crop_images(images, target_sizes)`
 
-- `images`: list of numpy arrays (H, W, 3) uint8
-- `target_sizes`: list of (width, height) tuples
-- returns: list of center-cropped numpy arrays
-- **implementation**: TSR-optimized for mixed-shape batching
+- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data
+- `target_sizes`: list of `(width, height)` tuples
+- returns: list of center-cropped NumPy arrays
 
-#### `batch_random_crop_images(images, target_sizes)`
+### `batch_random_crop_images(images, target_sizes)`
 
-- `images`: list of numpy arrays (H, W, 3) uint8
-- `target_sizes`: list of (width, height) tuples
-- returns: list of randomly cropped numpy arrays
-- **implementation**: TSR-optimized for mixed-shape batching
+- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data
+- `target_sizes`: list of `(width, height)` tuples
+- returns: list of randomly cropped NumPy arrays
 
-#### `batch_resize_images(images, target_sizes)`
+### `batch_resize_images(images, target_sizes)`
 
-- `images`: list of numpy arrays (H, W, 3) uint8
-- `target_sizes`: list of (width, height) tuples
-- returns: list of resized numpy arrays
-- **implementation**: OpenCV for optimal performance
+- `images`: list of NumPy arrays shaped `(H, W, 3)` with `uint8` data
+- `target_sizes`: list of `(width, height)` tuples
+- returns: list of resized NumPy arrays
+- implementation: OpenCV-backed resize with Rust/PyO3 conversion handling
 
-#### `batch_calculate_luminance(images)`
+### `batch_calculate_luminance(images)`
 
-- `images`: list of numpy arrays (H, W, 3) uint8
+- `images`: list of NumPy arrays shaped `(H, W, C)` with `uint8` data
 - returns: list of float luminance values
-- **implementation**: TSR SIMD-optimized (10-35x faster than NumPy)
-
-#### `batch_resize_videos(videos, target_sizes)`
-
-- `videos`: list of numpy arrays (T, H, W, 3) uint8
-- `target_sizes`: list of (width, height) tuples
-- returns: list of resized video numpy arrays
-
-### rust functions
-
-same signatures but with `ndarray::Array3<u8>` and `ndarray::Array4<u8>` instead of numpy arrays. check the docs for details.
-
-## architecture
-
-TSR uses a **best-of-breed hybrid approach** for optimal performance:
-
-### operation selection
-
-- **cropping operations**: TSR implementation
-  - mixed-shape batching (8 different input shapes → 7 different output shapes)
-  - single API call: `tsr.batch_crop_images(mixed_images, mixed_crops)`
-  - vs competitor: individual loops required for each shape combination
+- notes: contiguous RGB/RGBA-like arrays use a channel-sum fast path; strided arrays fall back to the general ndarray path
 
-- **luminance calculation**: TSR SIMD implementation
-  - **18x faster** than NumPy for mixed-shape batches
-  - **35x faster** than NumPy for uniform batches
-  - vectorized across different image sizes in single batch call
+### `batch_resize_videos(videos, target_sizes)`
 
-- **resize operations**: OpenCV implementation
-  - industry-standard performance and quality
-  - highly optimized C++ implementations
-  - **7-25x faster** than TSR resize implementations
+- `videos`: list of NumPy arrays shaped `(T, H, W, 3)` with `uint8` data
+- `target_sizes`: list of `(width, height)` tuples
+- returns: list of resized video NumPy arrays
 
-### static wheel distribution
+## current benchmark snapshot
 
-- OpenCV **statically linked** into wheel (no external dependencies)
-- single `pip install trainingsample` - no opencv-python conflicts
-- consistent performance across platforms
-- ~50MB wheel includes all optimizations
+These numbers are from the local benchmark run after the latest Python-interface optimizations:
 
-## features
-
-- **hybrid architecture**: best implementation for each operation
-- parallel processing with rayon (actually uses your cores)
-- zero-copy numpy integration via rust-numpy
-- proper error handling (no silent failures)
-- **static OpenCV** bundled (no external dependencies)
-- no python threading nonsense, GIL is released
-- memory efficient batch operations
-- supports both images and videos
-
-## 🏆 Industry-Leading Performance
-
-**BREAKTHROUGH ACHIEVEMENT: First library to beat cv2 by eliminating Python binding overhead while leveraging OpenCV's full C++ power**
-
-### 🥇 vs. opencv-python (cv2)
-
-| Operation | cv2 (opencv-python) | TSR (OpenCV+Rust) | TSR Speedup | Achievement |
-|-----------|---------------------|-------------------|-------------|-------------|
-| **Single Resize** | 0.134ms | **0.120ms** | **1.12x FASTER** | 🏆 **Beats cv2 bindings** |
-| **Batch Resize (8)** | 1.10ms | **0.47ms** | **2.4x FASTER** | 🏆 **Leverages OpenCV C++** |
-| **Crop Operations** | 1.40ms | **0.34ms** | **4.1x FASTER** | 🏆 **Zero-copy optimization** |
-| **Luminance Calc** | 4.38ms | **0.55ms** | **8.0x FASTER** | 🏆 **SIMD + OpenCV power** |
-
-### 🚀 Peak Performance Numbers
-- **17,204 images/sec** - Batch resize throughput
-- **Zero wrapper overhead** - Eliminated 76% of artificial performance losses
-- **True zero-copy** - Raw pointer → numpy conversion on-demand
-- **Intelligent dispatch** - Same API for single + batch with optimal performance
-
-### 🎯 Real-World Advantages
-
-#### How We Achieve This
-1. **Direct OpenCV C++**: Bypass cv2's Python binding overhead entirely
-2. **Zero artificial overhead**: Direct Mat headers, no intermediate conversions
-3. **Buffer pooling**: Memory reuse eliminates allocation bottlenecks that plague Python bindings
-4. **Adaptive threading**: Smart parallelization leveraging Rust's superior threading
-5. **Intelligent API**: Seamless auto-batching with optimal performance dispatch
-
-#### Industry Impact
-- **Computer Vision**: First library to beat cv2 by leveraging OpenCV's full C++ power
-- **Machine Learning**: Faster preprocessing = faster training pipelines
-- **Real-time Applications**: Sub-millisecond image processing capabilities
-- **Memory Efficiency**: True zero-copy iteration for large datasets
-
-**Bottom Line**: We leverage OpenCV's C++ excellence to eliminate the performance bottlenecks in Python bindings.
-
-## Apple Silicon Performance (M3 Max)
-
-Optimized SIMD implementations with concrete benchmarks:
-
-| Operation | Algorithm | Implementation | Speedup | Performance |
-|-----------|-----------|----------------|---------|-------------|
-| **Image Resize** | Bilinear | Multi-core NEON | **10.2x** | 1,412 MPx/s |
-| **Image Resize** | Lanczos4 | Metal GPU | **11.8x** | 112 MPx/s |
-| **Format Conversion** | RGB→RGBA | Portable SIMD | **4.4x** | 1,500 MPx/s |
-| **Format Conversion** | RGBA→RGB | Portable SIMD | **2.6x** | 1,651 MPx/s |
-| **Luminance Calc** | RGB→Y | NEON SIMD | **4.7x** | 545 images/sec |
+```bash
+.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s
+```
 
-**Key Insights:**
+Environment: Linux x86_64, CPython 3.13, NumPy 2.3.4, system OpenCV 4.11 through the Rust `opencv` crate. Treat these as a point-in-time reference, not a cross-machine guarantee.
 
-- **CPU SIMD** (multi-core NEON) optimal for memory-bound operations like bilinear resize
-- **GPU Metal** dominates compute-intensive algorithms like Lanczos4 interpolation
-- **Unified memory** architecture enables zero-copy GPU operations
-- **Automatic selection** between CPU/GPU based on algorithm characteristics
+| Benchmark | Before | After | Notes |
+|-----------|--------|-------|-------|
+| Crop batch, 16 images | 22.9 ms | 0.4 ms | Public `batch_crop_images` path |
+| Mixed-shape crop, 8 images | 50.2 ms | 3.3 ms | Mixed input and output sizes |
+| Luminance batch, 4 mixed images | 10.4 ms | 0.6 ms | Now faster than the OpenCV comparison in this run |
+| Mixed-shape luminance, 6 images | 78.3 ms | 3.3 ms | NumPy comparison was 19.4 ms in this run |
+| Complete resize + luminance pipeline | 5.9 ms | 0.6 ms | Four mixed-size inputs to 224x224 |
 
-Tested on Apple Silicon M3 Max (12 P-cores, 38-core GPU, 400 GB/s unified memory).
+Pytest-benchmark means from the same suite:
 
-## why this hybrid approach
+| Benchmark | Mean after |
+|-----------|------------|
+| Center crop | 55.2 us |
+| Resize operations | 353.1 us |
+| Luminance calculation | 417.2 us |
+| Crop operations | 583.8 us |
+| Pipeline | 3.44 ms |
+| Video processing | 2.85 ms |
 
-### vs pure opencv/pil
+## architecture
 
-- **OpenCV alone**: excellent resize performance, but poor mixed-shape batching
-- **PIL**: slow, GIL-bound, no batch operations
-- **TSR hybrid**: combines OpenCV's resize speed with TSR's batch/SIMD advantages
+TrainingSample uses different implementations for different operation types:
 
-### vs pure rust implementations
+- Cropping: Rust/ndarray implementation with owned-array transfer into NumPy.
+- Luminance: Rust channel-sum fast path for contiguous arrays, with a general ndarray fallback for non-contiguous inputs.
+- Resize: OpenCV-backed implementation for image quality and mature interpolation behavior.
+- Video resize: OpenCV-backed frame resizing with batched Python binding output.
+- Format conversion: Rust SIMD implementation where the `simd` feature is enabled.
 
-- **TSR resize**: slower than OpenCV's highly-optimized C++ (7-25x difference)
-- **TSR luminance**: faster than NumPy due to SIMD (18-35x speedup)
-- **best of both**: use optimal implementation for each operation
+The optimized path generally requires contiguous `uint8` arrays. Views such as `image[:, ::2, :]` remain supported by safe public APIs, but they may use slower fallback paths.
 
-### static distribution advantage
+## features
 
-- **no dependency conflicts**: opencv-python version compatibility issues eliminated
-- **consistent performance**: same optimized OpenCV across all platforms
-- **simple deployment**: single wheel, no system dependencies
+- Python bindings through PyO3 and rust-numpy
+- Batch APIs for images and videos
+- OpenCV-compatible constants and helper functions for common operations
+- Optional SIMD feature for format conversion and selected numeric paths
+- Error handling for invalid dimensions, unsupported channels, and invalid crop bounds
+- Source build support for dynamic or static OpenCV configurations
 
 ## building from source
 
 ```bash
-# for python
 pip install maturin
 maturin develop --release
+```
 
-# for rust
-cargo build --release
+The OpenCV Rust bindings need to find a working OpenCV and Clang installation. If the environment has stale OpenCV or LLVM variables, unset them before building:
+
+```bash
+env -u OPENCV_LINK_LIBS -u OPENCV_LINK_PATHS -u OPENCV_INCLUDE_PATHS \
+    -u LIBCLANG_PATH -u LLVM_CONFIG_PATH \
+    maturin develop --release
 ```
 
-requires rust 1.70+ and python 3.11+ if you want the python bindings.
+See [docs/BUILDING_STATIC_OPENCV.md](docs/BUILDING_STATIC_OPENCV.md) for static OpenCV bundle notes.
 
 ## license
 
-MIT. do whatever you want with it, leave attribution in-tact.
+MIT. See [LICENSE](LICENSE).
diff --git a/docs/API_COMPAT_CV2.md b/docs/API_COMPAT_CV2.md
index b7fff75..04d2037 100644
--- a/docs/API_COMPAT_CV2.md
+++ b/docs/API_COMPAT_CV2.md
@@ -1,217 +1,143 @@
-# OpenCV (cv2) API Compatibility Guide
+# OpenCV API Compatibility Guide
 
-TrainingSample provides drop-in replacements for common OpenCV operations with significant performance improvements through Rust optimizations and true batch processing.
+TrainingSample exposes a subset of OpenCV-style image APIs plus batch-oriented helpers. The goal is to reduce Python loop overhead for common preprocessing workloads, not to implement the full `cv2` surface.
 
-## Quick Start: Drop-in Replacement
-
-Replace `cv2` imports with `trainingsample` for instant performance gains:
+## Quick Start
 
 ```python
-# old cv2 approach
 import cv2
 import numpy as np
-
-# new high-performance approach
 import trainingsample as tsr
 ```
 
-## 🏆 Zero-Copy Operations (Industry-Leading Performance)
-
-**BREAKTHROUGH ACHIEVEMENT: We leverage OpenCV's C++ power to BEAT opencv-python (cv2) while providing record-breaking batch processing!**
+Use TrainingSample where a matching helper exists:
 
-### Single Image Resizing (Faster than cv2!)
 ```python
-# 1.12x FASTER than cv2.resize() - leveraging OpenCV C++ without binding overhead
-result = tsr.batch_resize_images_zero_copy(
-    img,         # np.ndarray - single image
-    target_size, # (width, height) - target dimensions
-    interpolation=tsr.INTER_LINEAR  # Optional: INTER_NEAREST, INTER_LINEAR (default), INTER_CUBIC, INTER_LANCZOS4
-)
-# Direct numpy array return, zero wrapper overhead, intelligent dispatch
+resized = tsr.resize(image, (224, 224), interpolation=tsr.INTER_LINEAR)
+gray = tsr.cvt_color(image, tsr.COLOR_RGB2GRAY)
+edges = tsr.canny(image, threshold1=50, threshold2=150)
 ```
 
-### Batch Resizing (Multiple APIs for Different Use Cases)
-```python
-# BATCH LIST API: 2.4x faster than OpenCV individual calls
-results = tsr.batch_resize_images_zero_copy(
-    images,      # List[np.ndarray] - batch of images
-    target_sizes, # List[(width, height)] - target dimensions
-    interpolation=tsr.INTER_LINEAR  # Optional: choose interpolation method
-)
-# Returns: List[np.ndarray] - perfect for immediate processing
-
-# ITERATOR API: True zero-copy with lazy conversion (memory efficient)
-for result in tsr.batch_resize_images_iterator(images, target_sizes, interpolation=tsr.INTER_CUBIC):
-    process(result)  # Convert only when accessed, supports early termination
-# 2.3x faster than OpenCV, minimal memory footprint
-```
+For batches, prefer the batch APIs instead of a Python loop:
 
-#### 🎯 Interpolation Methods
 ```python
-# Available interpolation constants (OpenCV-compatible):
-tsr.INTER_NEAREST   # Fast, blocky - good for masks/labels
-tsr.INTER_LINEAR    # Default - good balance of speed and quality
-tsr.INTER_CUBIC     # High quality, slower - best for upsampling
-tsr.INTER_LANCZOS4  # Best quality, slowest - professional upsampling
-
-# Usage examples:
-fast_resize = tsr.batch_resize_images_zero_copy(images, sizes, tsr.INTER_NEAREST)
-quality_resize = tsr.batch_resize_images_zero_copy(images, sizes, tsr.INTER_LANCZOS4)
-
-# Performance vs Quality Trade-offs:
-# INTER_NEAREST:  ~4x faster than LANCZOS4, acceptable for downsampling
-# INTER_LINEAR:   ~2x faster than LANCZOS4, good general purpose (default)
-# INTER_CUBIC:    ~1.5x faster than LANCZOS4, good for upsampling
-# INTER_LANCZOS4: Best quality, use for professional image processing
+images = [load_image(path) for path in paths]
+sizes = [(224, 224)] * len(images)
+
+resized = tsr.batch_resize_images(images, sizes)
+luminances = tsr.batch_calculate_luminance(resized)
 ```
 
-### Batch Cropping (Zero-Copy)
+## Supported OpenCV-Style Operations
+
+### Image Decoding
+
 ```python
-# 4-5x faster than regular batch operations
-cropped = tsr.batch_crop_images_zero_copy(
-    images,  # List[np.ndarray] - batch of images
-    crop_boxes  # List[(x, y, width, height)] - crop coordinates
-)
-
-# Center cropping with zero-copy optimization
-center_cropped = tsr.batch_center_crop_images_zero_copy(
-    images,  # List[np.ndarray]
-    target_sizes  # List[(width, height)]
-)
+with open("image.jpg", "rb") as f:
+    img_bytes = f.read()
+
+img = tsr.imdecode(img_bytes, tsr.IMREAD_COLOR)
+img_gray = tsr.imdecode(img_bytes, tsr.IMREAD_GRAYSCALE)
 ```
 
-### Batch Luminance (Zero-Copy + Parallel)
+### Color Space Conversion
+
 ```python
-# 5-8x faster with parallel processing + adaptive SIMD
-luminances = tsr.batch_calculate_luminance_zero_copy(images)
-# Returns: List[float] - ITU-R BT.709 luminance values (0-255 range)
+gray = tsr.cvt_color(image, tsr.COLOR_RGB2GRAY)
+bgr = tsr.cvt_color(image, tsr.COLOR_RGB2BGR)
 ```
 
-## 📊 Standard Batch Operations
-
-High-performance batch processing for common operations:
+### Edge Detection
 
-### Image Loading
 ```python
-# Parallel image loading from file paths
-images = tsr.load_image_batch([
-    'path/to/image1.jpg',
-    'path/to/image2.png',
-    'path/to/image3.webp'
-])
+edges = tsr.canny(image, threshold1=50, threshold2=150)
 ```
 
-### Batch Cropping
+### Image Resizing
+
 ```python
-# Regular batch cropping (still faster than individual cv2 calls)
-images = tsr.batch_crop_images(images, crop_boxes)
-center_cropped = tsr.batch_center_crop_images(images, target_sizes)
-random_cropped = tsr.batch_random_crop_images(images, target_sizes)
+resized = tsr.resize(image, (width, height), interpolation=tsr.INTER_LINEAR)
 ```
 
-### Batch Resizing (Zero-Copy)
+Supported interpolation constants:
+
 ```python
-# Ultra-fast zero-copy batch resizing (8+ images for optimal performance)
-resized = tsr.batch_resize_images_zero_copy(
-    images,  # List[np.ndarray] - batch of images
-    target_sizes  # List[(width, height)] - target dimensions
-)
-# 2.4x faster than OpenCV individual calls at 64 images
-# 16,306 images/sec throughput with parallel processing
+tsr.INTER_NEAREST
+tsr.INTER_LINEAR
+tsr.INTER_CUBIC
+tsr.INTER_LANCZOS4
 ```
 
-### Standard Batch Resizing
+## Batch Operations
+
+### Cropping
+
 ```python
-# High-performance batch resizing
-resized = tsr.batch_resize_images(
-    images,
-    target_sizes,  # List[(width, height)]
-    interpolation="bilinear"  # or "lanczos"
-)
-
-# Video frame batch processing
-video_frames = tsr.batch_resize_videos(videos, target_sizes)
+crop_boxes = [(x, y, width, height) for image in images]
+cropped = tsr.batch_crop_images(images, crop_boxes)
 ```
 
-### Batch Luminance Calculation
+Center and random crop helpers use target sizes:
+
 ```python
-# Calculate ITU-R BT.709 luminance for batch of images
-luminances = tsr.batch_calculate_luminance(images)
-# Formula: L = 0.2126*R + 0.7152*G + 0.0722*B
+target_sizes = [(224, 224)] * len(images)
+center_cropped = tsr.batch_center_crop_images(images, target_sizes)
+random_cropped = tsr.batch_random_crop_images(images, target_sizes)
 ```
 
-## 🎨 Format Conversion (Ultra-Fast)
-
-Sub-millisecond format conversions with SIMD optimization:
+### Resizing
 
 ```python
-# RGB to RGBA conversion (add alpha channel)
-rgba_image, timing = tsr.rgb_to_rgba_optimized(rgb_image, alpha=255)
-
-# RGBA to RGB conversion (remove alpha channel)
-rgb_image, timing = tsr.rgba_to_rgb_optimized(rgba_image)
+target_sizes = [(224, 224)] * len(images)
+resized = tsr.batch_resize_images(images, target_sizes)
 ```
 
-## 🔧 OpenCV-Compatible Individual Operations
+The public resize API returns a list of owned NumPy arrays. Current implementation uses OpenCV-backed resize internally and transfers owned Rust arrays into NumPy without an additional copy.
 
-Drop-in replacements for common cv2 functions:
+### Luminance
 
-### Image Decoding
 ```python
-# Equivalent to cv2.imdecode()
-import trainingsample as tsr
+luminances = tsr.batch_calculate_luminance(images)
+```
 
-# Read image bytes
-with open('image.jpg', 'rb') as f:
-    img_bytes = f.read()
+For contiguous arrays, luminance uses a channel-sum fast path. Non-contiguous arrays are accepted by the safe public API but may run through a slower ndarray fallback.
 
-# Decode with OpenCV-compatible flags
-img = tsr.imdecode(img_bytes, tsr.IMREAD_COLOR)
-img_gray = tsr.imdecode(img_bytes, tsr.IMREAD_GRAYSCALE)
-```
+### Video Resizing
 
-### Color Space Conversion
 ```python
-# Equivalent to cv2.cvtColor()
-gray = tsr.cvt_color(image, tsr.COLOR_RGB2GRAY)
-bgr = tsr.cvt_color(image, tsr.COLOR_RGB2BGR)
+videos = [video_array]  # shape: (frames, height, width, 3)
+target_sizes = [(224, 224)]
+resized_videos = tsr.batch_resize_videos(videos, target_sizes)
 ```
 
-### Edge Detection
-```python
-# Equivalent to cv2.Canny()
-edges = tsr.canny(image, threshold1=50, threshold2=150)
-```
+## Zero-Copy Entry Points
+
+Some lower-level APIs expose stricter zero-copy behavior:
 
-### Image Resizing
 ```python
-# Equivalent to cv2.resize()
-resized = tsr.resize(image, (width, height), interpolation=tsr.INTER_LINEAR)
+cropped = tsr.batch_crop_images_zero_copy(images, crop_boxes)
+luminances = tsr.batch_calculate_luminance_zero_copy(images)
+resized = tsr.batch_resize_images_zero_copy(images, target_sizes)
 ```
 
-## 📹 Video Processing
+These functions are intended for contiguous arrays. Unsafe zero-copy crop and resize paths reject non-contiguous views with a `ValueError`.
 
-OpenCV-compatible video capture and writing:
+## Video Capture and Writing
 
-### Video Capture
 ```python
-# Equivalent to cv2.VideoCapture
-cap = tsr.VideoCapture('video.mp4')
+cap = tsr.VideoCapture("video.mp4")
 
 if cap.is_opened():
     ret, frame = cap.read()
     if ret:
-        # Process frame
-        processed = tsr.batch_calculate_luminance([frame])
+        luminance = tsr.batch_calculate_luminance([frame])
 
 cap.release()
 ```
 
-### Video Writing
 ```python
-# Equivalent to cv2.VideoWriter
-fourcc = tsr.fourcc('M', 'J', 'P', 'G')
-writer = tsr.VideoWriter('output.avi', fourcc, 30.0, (width, height))
+fourcc = tsr.fourcc("M", "J", "P", "G")
+writer = tsr.VideoWriter("output.avi", fourcc, 30.0, (width, height))
 
 for frame in frames:
     writer.write(frame)
@@ -219,139 +145,72 @@ for frame in frames:
 writer.release()
 ```
 
-## 🔍 Object Detection
+## Object Detection
 
 ```python
-# Equivalent to cv2.CascadeClassifier
-classifier = tsr.CascadeClassifier('haarcascade_frontalface_alt.xml')
+classifier = tsr.CascadeClassifier("haarcascade_frontalface_alt.xml")
 faces = classifier.detect_multi_scale(image)
 ```
 
-## ⚡ Performance Comparison
-
-| Operation | cv2 Individual | TSR Batch | TSR Zero-Copy | TSR Iterator | Best Speedup |
-|-----------|---------------|-----------|---------------|--------------|--------------|
-| **Single Resize** | **0.134ms** | **-** | **0.146ms** | **-** | **1.12x FASTER** 🏆 |
-| Crop | 1.40ms | 1.40ms | 0.34ms | - | **4.1x** 🏆 |
-| Center Crop | 1.59ms | 1.59ms | 0.48ms | - | **3.3x** 🏆 |
-| Luminance | 4.38ms | 4.38ms | 0.55ms | - | **8.0x** 🏆 |
-| **Batch Resize (8)** | **1.10ms** | **0.47ms** | **-** | **0.48ms** | **2.4x** 🏆 |
-| Format Conv | 0.10ms | 0.02ms | 0.01ms | - | **10x** 🏆 |
-
-## 🎯 Best Practices
+## Benchmark Snapshot
 
-### When to Use Zero-Copy Operations
-- **Always use for batch processing** - 3-8x performance gains
-- **Large image datasets** - Memory-efficient with buffer pooling
-- **Real-time applications** - Parallel processing + SIMD acceleration
+The following numbers came from the local benchmark suite after the latest Python-interface optimization work:
 
-### Migration from OpenCV
-```python
-# SINGLE IMAGE: Drop-in replacement that's actually FASTER
-# before
-result = cv2.resize(img, (256, 256))
-
-# after (1.12x FASTER!)
-result = tsr.batch_resize_images_zero_copy(img, (256, 256))
-
-# BATCH PROCESSING: Massive speedup
-# before (slow)
-results = []
-for img in images:
-    result = cv2.resize(img, (256, 256))
-    results.append(result)
-
-# after (2.4x FASTER!)
-results = tsr.batch_resize_images_zero_copy(images, [(256, 256)] * len(images))
-
-# MEMORY EFFICIENT: Iterator for large batches
-for result in tsr.batch_resize_images_iterator(images, target_sizes):
-    process(result)  # Convert only when needed
+```bash
+.venv/bin/python -m pytest tests/test_performance_benchmarks.py -q -s
 ```
 
-### Memory Efficiency
-```python
-# before (slow - multiple boundary crossings)
-for img in images:
-    gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
-    resized = cv2.resize(gray, target_size)
-    edges = cv2.Canny(resized, 50, 150)
-
-# after (fast - single batch operation)
-grays = tsr.batch_cvt_color(images, tsr.COLOR_RGB2GRAY)
-resized = tsr.batch_resize_images_zero_copy(grays, sizes)
-edges = tsr.batch_canny(resized, threshold1=50, threshold2=150)
-```
+Environment: Linux x86_64, CPython 3.13, NumPy 2.3.4, system OpenCV 4.11 through the Rust `opencv` crate.
 
-## 🚀 Advanced Features
+| Scenario | TrainingSample | Comparison in same run |
+|----------|----------------|------------------------|
+| Batch resize, 4 mixed-size images | 0.4 ms | OpenCV loop: 2.6 ms |
+| Batch luminance, 4 mixed-size images | 0.6 ms | OpenCV loop: 0.9 ms |
+| Resize + luminance pipeline, 4 mixed-size images | 0.6 ms | OpenCV loop: 2.1 ms |
+| Mixed-shape luminance, 6 images | 3.3 ms | NumPy loop: 19.4 ms |
+| Mixed-shape crop, 8 images | 3.3 ms | NumPy slicing loop: near-zero because slicing returns views |
 
-### Adaptive SIMD Processing
-TrainingSample automatically chooses between SIMD and scalar operations based on image size:
-- **Small images (<64K pixels)**: Scalar processing (avoids SIMD overhead)
-- **Large images (>64K pixels)**: SIMD acceleration (AVX2/NEON)
+The crop comparison is intentionally caveated: NumPy slicing can be effectively free when it returns a view. TrainingSample returns owned output arrays, which is the right comparison when the next stage needs independent contiguous buffers.
 
-### Buffer Pool Management
-Zero-copy operations use intelligent buffer pooling:
-- **Automatic memory reuse** across batch operations
-- **Size-based pooling** for optimal allocation patterns
-- **Thread-safe sharing** for parallel processing
+## Migration Notes
+
+### Prefer Batch APIs for Repeated Work
 
-### Parallel Processing Architecture
 ```python
-# Automatically parallelizes across available CPU cores
-luminances = tsr.batch_calculate_luminance_zero_copy(images)
-# - Extracts raw pointers on main thread
-# - Distributes processing across worker threads
-# - Uses lock-free data structures for maximum throughput
-```
+# OpenCV loop
+results = [cv2.resize(img, (224, 224)) for img in images]
 
-## 🔧 Installation & Setup
+# TrainingSample batch call
+results = tsr.batch_resize_images(images, [(224, 224)] * len(images))
+```
 
-```bash
-pip install trainingsample
+### Keep Inputs Contiguous When Performance Matters
 
-# For maximum performance, ensure you have:
-# - Multi-core CPU (parallel processing)
-# - AVX2 support (x86) or NEON (ARM) for SIMD
+```python
+if not image.flags["C_CONTIGUOUS"]:
+    image = np.ascontiguousarray(image)
 ```
 
-## 📈 Benchmarking Your Workload
+Public safe APIs accept many strided views, but contiguous arrays are usually faster and are required by strict zero-copy paths.
+
+### Validate Your Own Workload
+
+Image shape, interpolation, batch size, and memory bandwidth can change results. Benchmark the exact pipeline you intend to ship:
 
 ```python
 import time
-import trainingsample as tsr
-
-# Benchmark your specific use case
-images = load_your_images()
 
 start = time.perf_counter()
-results = tsr.batch_operation_zero_copy(images, params)
+results = tsr.batch_resize_images(images, sizes)
 duration = time.perf_counter() - start
 
-print(f"Processed {len(images)} images in {duration*1000:.2f}ms")
-print(f"Throughput: {len(images)/duration:.1f} images/sec")
+print(f"{len(images) / duration:.1f} images/sec")
 ```
 
-## 🏆 Summary
-
-TrainingSample provides:
-- **memory efficiency**: reduced Python object overhead in batch operations
-- **computational efficiency**: SIMD vectorization and parallel processing
-- **API compatibility**: drop-in replacement for common cv2 operations
-- **zero-copy semantics**: direct buffer manipulation for maximum performance
-
-**INDUSTRY-LEADING Performance Gains:**
-- **BEATS OpenCV** for single image operations (1.12x faster resize)
-- **2.4x faster** batch processing vs OpenCV individual calls
-- **17,204+ images/sec** batch resize throughput
-- **True zero-copy iteration** with lazy conversion
-- **100% API compatibility** with OpenCV - drop-in replacement
-- **Intelligent auto-batching** - same function handles single + batch
-- **Memory usage reduction** through buffer pooling + lazy conversion
-
-**Limitations:**
-- **memory overhead**: batch processing requires significant RAM for large images
-- **startup cost**: small overhead for very small batches (<5 images)
-- **Python GIL**: some operations still limited by Python's global interpreter lock
-
-For maximum performance gains, use the zero-copy batch operations with mixed-size image datasets on multi-core systems.
+## Limitations
+
+- This is not a complete `cv2` replacement.
+- Batch APIs allocate owned output arrays.
+- Small inputs can be dominated by call overhead.
+- Zero-copy functions require contiguous arrays for crop and resize paths.
+- System OpenCV and wheel build configuration can affect performance and available codecs.
diff --git a/docs/BUILDING_STATIC_OPENCV.md b/docs/BUILDING_STATIC_OPENCV.md
index dd853be..e63e672 100644
--- a/docs/BUILDING_STATIC_OPENCV.md
+++ b/docs/BUILDING_STATIC_OPENCV.md
@@ -2,7 +2,7 @@
 
 The `opencv` crate expects to find an existing OpenCV toolkit and, by default, it
 links against the dynamic libraries that come with a system installation
-(`libopencv_core.dylib`, `libopencv_core.so`, …). To ship the `trainingsample`
+(`libopencv_core.dylib`, `libopencv_core.so`, etc.). To ship the `trainingsample`
 crate without asking end users to install OpenCV themselves, build a static
 OpenCV distribution once and point Cargo at it during compilation.
 
@@ -104,7 +104,7 @@ OpenCV distribution once and point Cargo at it during compilation.
    ```bash
    cp ~/Downloads/opencv-build-static/3rdparty/lib/liblibjpeg-turbo.a third_party/opencv-static/lib/
    ln -sf liblibjpeg-turbo.a third_party/opencv-static/lib/libjpeg.a
-   # Repeat for liblibpng.a→libpng.a, liblibtiff.a→libtiff.a, liblibwebp.a→libwebp.a, libzlib.a→libz.a, liblibjasper.a→libjasper.a
+   # Repeat for liblibpng.a -> libpng.a, liblibtiff.a -> libtiff.a, liblibwebp.a -> libwebp.a, libzlib.a -> libz.a, liblibjasper.a -> libjasper.a
    ```
 
 3. After installation you should have:
@@ -130,7 +130,7 @@ OpenCV distribution once and point Cargo at it during compilation.
 > library. Replace `static=stdc++` with `dylib=c++` (or `framework=Accelerate`
 > when required) in the linking step below.
 
-## 2. Point Cargo at the static toolchain
+## 3. Point Cargo at the static toolchain
 
 Add a `.cargo/config.toml` (kept inside the repo) with the environment variables
 that the `opencv` build script understands:
@@ -159,7 +159,7 @@ file so subsequent runs do not skip regeneration.
 
 If you elected to install the individual module archives instead of
 `opencv_world`, list each one (`static=opencv_core`, `static=opencv_imgproc`,
-…). Keep the order roughly from high- to low-level modules so the linker can
+and so on). Keep the order roughly from high- to low-level modules so the linker can
 resolve symbols in one pass.
 
 For cross-compilation add target-specific sections, e.g.:
@@ -169,7 +169,7 @@ For cross-compilation add target-specific sections, e.g.:
 OPENCV_LINK_LIBS = "static=opencv_world,static=avformat,static=avcodec,static=avfilter,static=swresample,static=swscale,static=avutil,static=png,static=jpeg,static=tiff,static=z,static=jasper,dylib=c++"
 ```
 
-## 3. Build the crate
+## 4. Build the crate
 
 With the static bundle in place you can now build the crate without touching the
 system OpenCV installation:
@@ -182,7 +182,7 @@ The resulting `libtrainingsample.{so,dylib}` (or the wheels produced by the
 Python bindings) now embed the OpenCV symbols directly, so end users do not need
 `opencv_core` on their machines.
 
-## 4. Regenerating the bundle
+## 5. Regenerating the bundle
 
 Whenever you need to update OpenCV:
 
@@ -190,8 +190,8 @@ Whenever you need to update OpenCV:
    tree.
 2. Verify that the list in `OPENCV_LINK_LIBS` still matches the archives produced.
 3. Commit the regenerated contents of `third_party/opencv-static/` if you keep
-   it under version control (or upload it to your release pipeline’s artifact
+   it under version control (or upload it to your release pipeline's artifact
    store).
 
-That is all Cargo needs—no changes to `Cargo.toml` are required beyond enabling
+That is all Cargo needs. No changes to `Cargo.toml` are required beyond enabling
 the `opencv` feature when you want the acceleration path.
diff --git a/src/luminance.rs b/src/luminance.rs
index 38e4cac..daffe48 100644
--- a/src/luminance.rs
+++ b/src/luminance.rs
@@ -1,37 +1,51 @@
 use ndarray::ArrayView3;
 
 #[cfg(feature = "simd")]
-pub use crate::luminance_simd::{
-    calculate_luminance_optimized, calculate_luminance_optimized_sequential, LuminanceMetrics,
-};
+pub use crate::luminance_simd::{calculate_luminance_optimized, LuminanceMetrics};
 
 /// Main luminance calculation function with automatic SIMD optimization
 pub fn calculate_luminance_array(image: &ArrayView3<u8>) -> f64 {
-    #[cfg(feature = "simd")]
-    {
-        let (result, _metrics) = calculate_luminance_optimized(image);
-        result
+    if let Some(result) = calculate_luminance_contiguous(image) {
+        return result;
     }
 
-    #[cfg(not(feature = "simd"))]
-    {
-        calculate_luminance_scalar(image)
-    }
+    calculate_luminance_scalar(image)
 }
 
 /// Single-threaded luminance calculation to avoid nested parallelism in batch operations
 pub fn calculate_luminance_array_sequential(image: &ArrayView3<u8>) -> f64 {
-    #[cfg(feature = "simd")]
-    {
-        // Use single-threaded SIMD optimization to avoid nested parallelism
-        let (result, _metrics) = calculate_luminance_optimized_sequential(image);
-        result
+    if let Some(result) = calculate_luminance_contiguous(image) {
+        return result;
     }
 
-    #[cfg(not(feature = "simd"))]
-    {
-        calculate_luminance_scalar(image)
+    calculate_luminance_scalar(image)
+}
+
+fn calculate_luminance_contiguous(image: &ArrayView3<u8>) -> Option<f64> {
+    let (height, width, channels) = image.dim();
+    let data = image.as_slice()?;
+
+    if height == 0 || width == 0 || channels == 0 {
+        return Some(0.0);
+    }
+
+    if channels < 3 {
+        let sum: u64 = data.iter().map(|&x| x as u64).sum();
+        return Some(sum as f64 / data.len() as f64);
     }
+
+    let pixel_count = height * width;
+    let mut r_sum = 0u64;
+    let mut g_sum = 0u64;
+    let mut b_sum = 0u64;
+
+    for pixel in data.chunks_exact(channels).take(pixel_count) {
+        r_sum += pixel[0] as u64;
+        g_sum += pixel[1] as u64;
+        b_sum += pixel[2] as u64;
+    }
+
+    Some((0.299 * r_sum as f64 + 0.587 * g_sum as f64 + 0.114 * b_sum as f64) / pixel_count as f64)
 }
 
 /// Ultra-fast adaptive luminance calculation with automatic SIMD/scalar selection
diff --git a/src/python_bindings.rs b/src/python_bindings.rs
index 869e030..106939a 100644
--- a/src/python_bindings.rs
+++ b/src/python_bindings.rs
@@ -155,7 +155,7 @@ pub unsafe fn batch_crop_images_zero_copy<'py>(
 
         let array = ndarray::Array3::from_shape_vec((height, width, channels), output_buffer)
             .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)))?;
-        let py_array = PyArray3::from_array_bound(py, &array);
+        let py_array = PyArray3::from_owned_array_bound(py, array);
         py_results.push(py_array);
     }
 
@@ -206,7 +206,7 @@ pub unsafe fn batch_center_crop_images_zero_copy<'py>(
                 .map_err(|e| {
                     pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e))
                 })?;
-        let py_array = PyArray3::from_array_bound(py, &array);
+        let py_array = PyArray3::from_owned_array_bound(py, array);
         py_results.push(py_array);
     }
 
@@ -389,8 +389,8 @@ pub fn batch_resize_images_zero_copy<'py>(
 
     // Convert to PyArray3 and return as Python list
     let py_results: Vec<Bound<'py, PyArray3<u8>>> = results
-        .iter()
-        .map(|array| PyArray3::from_array_bound(py, array))
+        .into_iter()
+        .map(|array| PyArray3::from_owned_array_bound(py, array))
         .collect();
 
     Ok(PyList::new_bound(py, py_results).into_any())
@@ -552,7 +552,7 @@ fn resize_single_image_direct<'py>(
     })?;
 
     // DIRECT return - no Vec wrapper overhead!
-    Ok(PyArray3::from_array_bound(py, &result))
+    Ok(PyArray3::from_owned_array_bound(py, result))
 }
 
 #[cfg(not(feature = "opencv"))]
@@ -622,7 +622,7 @@ impl ResizeIterator {
 
         // Convert raw buffer directly to PyArray3 - ZERO intermediate steps!
         match ndarray::Array3::from_shape_vec((*height, *width, *channels), buffer.clone()) {
-            Ok(array) => Some(PyArray3::from_array_bound(py, &array)),
+            Ok(array) => Some(PyArray3::from_owned_array_bound(py, array)),
             Err(_) => None, // Skip malformed arrays
         }
     }
@@ -874,7 +874,7 @@ pub fn batch_crop_images<'py>(
         let img_view = image.as_array();
         match crop_image_array(&img_view, x, y, width, height) {
             Ok(cropped) => {
-                let py_array = PyArray3::from_array_bound(py, &cropped);
+                let py_array = PyArray3::from_owned_array_bound(py, cropped);
                 py_results.push(py_array);
             }
             Err(e) => {
@@ -907,7 +907,7 @@ pub fn batch_center_crop_images<'py>(
         let img_view = image.as_array();
         match crate::cropping::center_crop_image_array(&img_view, target_width, target_height) {
             Ok(cropped) => {
-                let py_array = PyArray3::from_array_bound(py, &cropped);
+                let py_array = PyArray3::from_owned_array_bound(py, cropped);
                 py_results.push(py_array);
             }
             Err(e) => {
@@ -934,7 +934,7 @@ pub fn batch_random_crop_images<'py>(
         let img_view = image.as_array();
         match random_crop_image_array(&img_view, target_width, target_height) {
             Ok(cropped) => {
-                let py_array = PyArray3::from_array_bound(py, &cropped);
+                let py_array = PyArray3::from_owned_array_bound(py, cropped);
                 py_results.push(py_array);
             }
             Err(e) => {
@@ -986,7 +986,7 @@ pub fn rgb_to_rgba_optimized<'py>(
     let rgba_array = ndarray::Array3::from_shape_vec((height, width, 4), rgba_data)
         .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)))?;
 
-    let py_array = PyArray3::from_array_bound(py, &rgba_array);
+    let py_array = PyArray3::from_owned_array_bound(py, rgba_array);
     Ok((py_array, metrics.throughput_mpixels_per_sec))
 }
 
@@ -1015,7 +1015,7 @@ pub fn rgba_to_rgb_optimized<'py>(
     let rgb_array = ndarray::Array3::from_shape_vec((height, width, 3), rgb_data)
         .map_err(|e| pyo3::exceptions::PyValueError::new_err(format!("Shape error: {}", e)))?;
 
-    let py_array = PyArray3::from_array_bound(py, &rgb_array);
+    let py_array = PyArray3::from_owned_array_bound(py, rgb_array);
     Ok((py_array, metrics.throughput_mpixels_per_sec))
 }
 
@@ -1059,7 +1059,7 @@ pub fn batch_resize_images<'py>(
         Ok(resized_images) => {
             let py_results: Vec<_> = resized_images
                 .into_iter()
-                .map(|resized| PyArray3::from_array_bound(py, &resized))
+                .map(|resized| PyArray3::from_owned_array_bound(py, resized))
                 .collect();
             Ok(py_results)
         }
@@ -1097,7 +1097,7 @@ pub fn batch_resize_videos<'py>(
         Ok(resized_videos) => {
             let py_results: Vec<_> = resized_videos
                 .into_iter()
-                .map(|resized| PyArray4::from_array_bound(py, &resized))
+                .map(|resized| PyArray4::from_owned_array_bound(py, resized))
                 .collect();
             Ok(py_results)
         }
@@ -1136,7 +1136,7 @@ pub fn resize_bilinear_opencv<'py>(
 
     match resize_bilinear_opencv(&image_array, target_width, target_height) {
         Ok(resized) => {
-            let py_array = PyArray3::from_array_bound(py, &resized);
+            let py_array = PyArray3::from_owned_array_bound(py, resized);
             Ok(py_array)
         }
         Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!(
@@ -1160,7 +1160,7 @@ pub fn resize_lanczos4_opencv<'py>(
 
     match resize_lanczos4_opencv(&image_array, target_width, target_height) {
         Ok(resized) => {
-            let py_array = PyArray3::from_array_bound(py, &resized);
+            let py_array = PyArray3::from_owned_array_bound(py, resized);
             Ok(py_array)
         }
         Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!(
@@ -1222,7 +1222,7 @@ pub fn imdecode_py<'py>(
 
     match imdecode(buf, imread_flags) {
         Ok(image) => {
-            let py_array = PyArray3::from_array_bound(py, &image);
+            let py_array = PyArray3::from_owned_array_bound(py, image);
             Ok(py_array)
         }
         Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!(
@@ -1258,7 +1258,7 @@ pub fn cvt_color_py<'py>(
     let src_array = src.as_array();
     match cvt_color(&src_array, color_code) {
         Ok(converted) => {
-            let py_array = PyArray3::from_array_bound(py, &converted);
+            let py_array = PyArray3::from_owned_array_bound(py, converted);
             Ok(py_array)
         }
         Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!(
@@ -1281,7 +1281,7 @@ pub fn canny_py<'py>(
     let image_array = image.as_array();
     match canny(&image_array, threshold1, threshold2) {
         Ok(edges) => {
-            let py_array = PyArray3::from_array_bound(py, &edges);
+            let py_array = PyArray3::from_owned_array_bound(py, edges);
             Ok(py_array)
         }
         Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!(
@@ -1317,7 +1317,7 @@ pub fn resize_py<'py>(
     let src_array = src.as_array();
     match resize(&src_array, dsize, interp) {
         Ok(resized) => {
-            let py_array = PyArray3::from_array_bound(py, &resized);
+            let py_array = PyArray3::from_owned_array_bound(py, resized);
             Ok(py_array)
         }
         Err(e) => Err(pyo3::exceptions::PyRuntimeError::new_err(format!(
@@ -1419,7 +1419,7 @@ impl PyVideoCapture {
     fn read<'py>(&mut self, py: Python<'py>) -> PyResult<(bool, Option<Bound<'py, PyArray3<u8>>>)> {
         let (ret, frame) = self.inner.read();
         if let Some(frame_data) = frame {
-            let py_array = PyArray3::from_array_bound(py, &frame_data);
+            let py_array = PyArray3::from_owned_array_bound(py, frame_data);
             Ok((ret, Some(py_array)))
         } else {
             Ok((ret, None))
@@ -1639,7 +1639,7 @@ impl PyBatchProcessor {
             Ok(results) => {
                 let py_results: Vec<_> = results
                     .into_iter()
-                    .map(|result| PyArray3::from_array_bound(py, &result))
+                    .map(|result| PyArray3::from_owned_array_bound(py, result))
                     .collect();
                 Ok(py_results)
             }
@@ -1679,7 +1679,7 @@ impl PyBatchProcessor {
             Ok(results) => {
                 let py_results: Vec<_> = results
                     .into_iter()
-                    .map(|result| PyArray3::from_array_bound(py, &result))
+                    .map(|result| PyArray3::from_owned_array_bound(py, result))
                     .collect();
                 Ok(py_results)
             }
@@ -1704,7 +1704,7 @@ impl PyBatchProcessor {
             Ok(results) => {
                 let py_results: Vec<_> = results
                     .into_iter()
-                    .map(|result| PyArray3::from_array_bound(py, &result))
+                    .map(|result| PyArray3::from_owned_array_bound(py, result))
                     .collect();
                 Ok(py_results)
             }
@@ -1782,7 +1782,7 @@ impl PyBatchProcessor {
             Ok(results) => {
                 let py_results: Vec<_> = results
                     .into_iter()
-                    .map(|result| PyArray3::from_array_bound(py, &result))
+                    .map(|result| PyArray3::from_owned_array_bound(py, result))
                     .collect();
                 Ok(py_results)
             }
@@ -1903,7 +1903,7 @@ impl PyTrueBatchProcessor {
             Ok(results) => {
                 let py_results: Vec<_> = results
                     .into_iter()
-                    .map(|result| PyArray3::from_array_bound(py, &result))
+                    .map(|result| PyArray3::from_owned_array_bound(py, result))
                     .collect();
                 Ok(py_results)
             }
@@ -1939,7 +1939,7 @@ impl PyTrueBatchProcessor {
             Ok(results) => {
                 let py_results: Vec<_> = results
                     .into_iter()
-                    .map(|result| PyArray3::from_array_bound(py, &result))
+                    .map(|result| PyArray3::from_owned_array_bound(py, result))
                     .collect();
                 Ok(py_results)
             }