Skip to content

feat: SIMD-accelerated box operations #65

@Smirkey

Description

@Smirkey

Description

All core computation loops (IoU, GIoU, DIoU, box areas, NMS) are currently scalar. There is an opportunity to use SIMD intrinsics to process multiple boxes or box pairs per instruction.

Candidate hot paths

Box areas (box_areas_slice)

Currently computes (x2-x1) * (y2-y1) one box at a time. With AVX2, 4 f64 boxes can be processed per iteration (load 4 x1/y1/x2/y2, subtract, multiply).

IoU inner loop (iou_distance_slice)

The N1×N2 loop computes min/max for intersection then division for each pair. The inner loop (iterating over boxes2 for a fixed boxes1 row) can be vectorized:

  • Load 4 boxes2 at once
  • SIMD min/max for intersection coordinates
  • SIMD multiply + subtract for area
  • SIMD divide for IoU

NMS suppression check

After sorting by score, the suppression loop checks IoU against all remaining candidates. The IoU comparison can be vectorized similarly.

Approach options

  1. Auto-vectorization hints — restructure loops so LLVM auto-vectorizes (SoA layout instead of AoS, #[target_feature] annotations). Lowest effort, portable.
  2. std::simd (nightly) — use Rust's portable SIMD API behind a feature flag. Clean but requires nightly.
  3. std::arch intrinsics — manual SSE4.1/AVX2 with #[cfg(target_arch)] fallback. Maximum control, stable Rust.
  4. pulp or wide crate — safe SIMD wrappers that work on stable. Good middle ground.

Suggested plan

  • Add a simd feature flag (off by default)
  • Start with box_areas_slice as a benchmark-driven proof of concept
  • Benchmark against the scalar version with criterion
  • If gains are significant (>2x), extend to iou_distance_slice inner loop
  • Keep Rayon parallelism orthogonal (SIMD within each thread)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions