Description
All core computation loops (IoU, GIoU, DIoU, box areas, NMS) are currently scalar. There is an opportunity to use SIMD intrinsics to process multiple boxes or box pairs per instruction.
Candidate hot paths
Box areas (box_areas_slice)
Currently computes (x2-x1) * (y2-y1) one box at a time. With AVX2, 4 f64 boxes can be processed per iteration (load 4 x1/y1/x2/y2, subtract, multiply).
IoU inner loop (iou_distance_slice)
The N1×N2 loop computes min/max for intersection then division for each pair. The inner loop (iterating over boxes2 for a fixed boxes1 row) can be vectorized:
- Load 4 boxes2 at once
- SIMD
min/max for intersection coordinates
- SIMD multiply + subtract for area
- SIMD divide for IoU
NMS suppression check
After sorting by score, the suppression loop checks IoU against all remaining candidates. The IoU comparison can be vectorized similarly.
Approach options
- Auto-vectorization hints — restructure loops so LLVM auto-vectorizes (SoA layout instead of AoS,
#[target_feature] annotations). Lowest effort, portable.
std::simd (nightly) — use Rust's portable SIMD API behind a feature flag. Clean but requires nightly.
std::arch intrinsics — manual SSE4.1/AVX2 with #[cfg(target_arch)] fallback. Maximum control, stable Rust.
pulp or wide crate — safe SIMD wrappers that work on stable. Good middle ground.
Suggested plan
- Add a
simd feature flag (off by default)
- Start with
box_areas_slice as a benchmark-driven proof of concept
- Benchmark against the scalar version with criterion
- If gains are significant (>2x), extend to
iou_distance_slice inner loop
- Keep Rayon parallelism orthogonal (SIMD within each thread)
Description
All core computation loops (IoU, GIoU, DIoU, box areas, NMS) are currently scalar. There is an opportunity to use SIMD intrinsics to process multiple boxes or box pairs per instruction.
Candidate hot paths
Box areas (
box_areas_slice)Currently computes
(x2-x1) * (y2-y1)one box at a time. With AVX2, 4f64boxes can be processed per iteration (load 4 x1/y1/x2/y2, subtract, multiply).IoU inner loop (
iou_distance_slice)The N1×N2 loop computes min/max for intersection then division for each pair. The inner loop (iterating over boxes2 for a fixed boxes1 row) can be vectorized:
min/maxfor intersection coordinatesNMS suppression check
After sorting by score, the suppression loop checks IoU against all remaining candidates. The IoU comparison can be vectorized similarly.
Approach options
#[target_feature]annotations). Lowest effort, portable.std::simd(nightly) — use Rust's portable SIMD API behind a feature flag. Clean but requires nightly.std::archintrinsics — manual SSE4.1/AVX2 with#[cfg(target_arch)]fallback. Maximum control, stable Rust.pulporwidecrate — safe SIMD wrappers that work on stable. Good middle ground.Suggested plan
simdfeature flag (off by default)box_areas_sliceas a benchmark-driven proof of conceptiou_distance_sliceinner loop