Skip to content

Optimize CPU masked_scatter for contiguous inputs #3669

@AK-Khan02

Description

@AK-Khan02

Summary

CPU masked_scatter currently routes all cases through ContiguousIterator, including the common case where mask, src, and out are row-contiguous. That adds per-element iterator/stride bookkeeping to a mask walk that can otherwise use direct pointer indexing.

Impact

A contiguous CPU fast path improves large 1D masked scatter workloads. Local CPU-only benchmark on float32 arrays showed:

  • 4M elements, 1% mask density: 5.746 ms -> 2.548 ms
  • 4M elements, 10% mask density: 8.320 ms -> 4.467 ms
  • 4M elements, 50% mask density: 23.816 ms -> 14.140 ms

Notes

This tracks the scoped contiguous fast path. A more ambitious follow-up could explore chunked prefix counts to parallelize the mask walk further.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions