Skip to content

perf: AVX2 8888 load/store + f32x16 save specializations#175

Open
bhark wants to merge 1 commit into
linebender:mainfrom
bhark:perf/avx2-8888
Open

perf: AVX2 8888 load/store + f32x16 save specializations#175
bhark wants to merge 1 commit into
linebender:mainfrom
bhark:perf/avx2-8888

Conversation

@bhark
Copy link
Copy Markdown

@bhark bhark commented May 23, 2026

This PR is part of a series linked to #174.

What

AVX2 specializations for these primitives:

  • lowp::load_8888 / lowp::store_8888 (these are the bulk of source_over_rbga)
  • lowp::load_8
  • highp::load_8888 / highp::store_8888
  • f32x16::save_to_u16x16

Gated on feature = "simd", target_feature = "avx2". So SSe2/SSE4.1/AVX, Neon WASM-SIMD etc. are unchanged.

Why

These primitives, on AVX2, can collapse a long scalar shuffle into a handful of instrinsics. This means lowp::source_over_rgba drops from ~350 to about 50 instructions.

Results

speedup
Geomean (55 benches) 1.95x
blend::destination_atop 3.83x
gradients::two_stops_linear_pad 3.56x
blend::source_over 2.53x
fill::rect 2.04x
gradients::three_stops_linear_even 1.92x
hairline::aa 1.34x
blend::clear, fill::opaque (don't hit specialized paths) 1.00x

I found no regressions anywhere.

Notes

Some nuances to this:

  • lowp::store_8888 AVX2 ORs channels instead of truncating with as u8. Shouldn't be an issue, unless something produces channel values > 255 in lowp, in which case AVX2 would surface a visual artifact instead of truncating silently.
  • Inspired style-wise by the existing src/wide/i32x8_t.rs / u32x8_t.rs patterns
  • Default cargo builds naturally won't see this. You'll have to build with -Ctarget-cpu=haswell, -Ctarget-feature=+avx2 or target-cpu=native.

@RazrFalcon
Copy link
Copy Markdown
Collaborator

Looks good, but I would avoid direct SIMD intrinsics calls and cfg_if in highp/lowp code. All of them should be nicely wrapped in the wide module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants