Summary
CPU masked_scatter currently routes all cases through ContiguousIterator, including the common case where mask, src, and out are row-contiguous. That adds per-element iterator/stride bookkeeping to a mask walk that can otherwise use direct pointer indexing.
Impact
A contiguous CPU fast path improves large 1D masked scatter workloads. Local CPU-only benchmark on float32 arrays showed:
- 4M elements, 1% mask density: 5.746 ms -> 2.548 ms
- 4M elements, 10% mask density: 8.320 ms -> 4.467 ms
- 4M elements, 50% mask density: 23.816 ms -> 14.140 ms
Notes
This tracks the scoped contiguous fast path. A more ambitious follow-up could explore chunked prefix counts to parallelize the mask walk further.
Summary
CPU
masked_scattercurrently routes all cases throughContiguousIterator, including the common case wheremask,src, andoutare row-contiguous. That adds per-element iterator/stride bookkeeping to a mask walk that can otherwise use direct pointer indexing.Impact
A contiguous CPU fast path improves large 1D masked scatter workloads. Local CPU-only benchmark on float32 arrays showed:
Notes
This tracks the scoped contiguous fast path. A more ambitious follow-up could explore chunked prefix counts to parallelize the mask walk further.