Skip to content

perf: AVX2 gather for PixmapRef::gather#176

Open
bhark wants to merge 1 commit into
linebender:mainfrom
bhark:perf/avx2-gather
Open

perf: AVX2 gather for PixmapRef::gather#176
bhark wants to merge 1 commit into
linebender:mainfrom
bhark:perf/avx2-gather

Conversation

@bhark
Copy link
Copy Markdown

@bhark bhark commented May 23, 2026

This PR is part of a series linked to #174.

What

Replaces 8 scalar pixel loads in PixmapRef::gather with _mm256_i32gather_epi32 (on AVX2). Again gated on feature = "simd", target_feature = "avx2", meaning SSE2/SSE4.1/AVX, Neon etc. are unchanged.

Why

Image sampling takes a hard hit from gather, because bicubic patterns make 16 sample calls per pix batch, each loading 8 pixels. One AVX2 gather replaces 8 dependent scalar loads.

Results

speedup
Geomean (56 benches) 1.04x
patterns::hq (bicubic) 1.44x
patterns::lq (bilinear) 1.33x
patterns::plain (nearest) 1.18x
fill::rect, hairline, blend::destination_over (don't hit gather) 1.00x

No regressions. A bunch of unrelated benches landed at ~1.1x, probably just noise/variance.

Notes

  • _mm256_i32gather_epi32 does no bounds check; callers camp indices to [0, w*h] so the invariant holds
  • Style-wise inspired by existing src/wide/i32x8_t.rs / u32x8_t.rs
  • Only seen when built with -Ctarget-cpu=haswell, -Ctarget-feature=+avx2 or target-cpu=native
  • Benched on i5-13400F (raptor lake). Since gather is microcoded, the win on Zen 1/2 may be smaller or nothing at all. Still shouldn't regress.

@RazrFalcon
Copy link
Copy Markdown
Collaborator

Looks good, but again, we should wrap it nicely in wide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants