Skip to content

feat(ops): reductions with axis & keepdim (#9)#14

Closed
Hayden727 wants to merge 8 commits into
feat/allocator-instrumentationfrom
feat/reductions
Closed

feat(ops): reductions with axis & keepdim (#9)#14
Hayden727 wants to merge 8 commits into
feat/allocator-instrumentationfrom
feat/reductions

Conversation

@Hayden727

Copy link
Copy Markdown
Owner

Summary

Closes #9. Stacked on top of #13 (the allocator-instrumentation prereq); merge #13 first, then GitHub will retarget this PR to master.

Ships every reduction listed in issue #9: sum, mean, prod, max, min, argmax, argmin. Whole-tensor + multi-axis + single-axis-with-indices forms; CPU + CUDA backends; PyTorch-parity tests; CUDA sum benchmark vs cub::DeviceReduce::Sum.

Commit shape (matches issue #9 §6)

  1. feat(ops): ReductionAxes canonicalisation + output-shape helpers — public API surface, OpKey tags, axis canonicalisation, dtype-rule helpers. 13 axis-tests.
  2. feat(ops): CPU sum/mean/prod kernels — generic run_reduction<T,Acc,Op,OutT> template, ReductionPlan (kept-vs-reduced odometer), per-op widening (int*int64, fp32double accumulator). 16 functional tests.
  3. feat(ops): CPU max/min with indices, argmax/argminMaxF::should_replace for first-occurrence-wins + NaN propagation; values-only and values+indices kernels share the iteration plan; argmax / argmin reuse the with-idx kernel with vals_out=nullptr. 12 new functional tests + 9 dtype-rule tests.
  4. feat(ops): CUDA whole-tensor reductions (two-pass tree) — block-level shared-memory reduce → 1024-slot partials buffer → single-block final reduce. fp32 accumulator widened to double to mitigate CPU/CUDA drift. Axis path stubbed (raises DeviceError pointing at commit 5).
  5. feat(ops): CUDA axis reductions (innermost + non-innermost paths) — one thread per output element, walks the reduced subspace via per-thread odometer in registers; same kernel handles innermost and non-innermost reduced axes (transpose-then-reduce optimisation deferred per [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §7). With-idx kernel parameterised on RecordValue so argmax / argmin reuse the launch path.
  6. chore(test): commit reduction parity fixtures + extend gen_parity.py — 64 new .npy fixtures covering every (op, dtype, shape, dims, keepdim) tuple; tests/parity/reduction_parity_test.cpp runs them through ctorch and asserts within issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N1 tolerances.
  7. bench: CUDA sum vs cub::DeviceReduce::Sum + docstests/bench/sum_bench.cpp runs 25 trials of dispatch + cub on a 1<<20 fp32 tensor, logs the real ratio, gates at 1.5× (same convention as add_bench); docs/ops.md gains a Reductions section; README checkbox flipped.

Test plan

  • cmake -S . -B build_cpu -DCTORCH_CUDA=OFF && cmake --build build_cpu -j — green.
  • ctest --test-dir build_cpu — 187 passed, 0 failed (one CUDA smoke-test skipped on CPU-only build).
  • New tests: reduction_axis_test (13), reduction_value_test (28), reduction_dtype_test (9), reduction_parity_test (24). All pass on CPU.
  • pre-commit run clean.
  • CI: Linux CUDA build + sum_bench.CudaWithin15PercentOfCub ratio logged.
  • Manual GPU run for [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N2 ratio target (1.15× of cub) — out of band.

Notes

Lay the groundwork for sum/mean/prod/max/min/argmax/argmin without
shipping any kernel yet:

  - `<ctorch/ops/reduction.h>` — the public API surface (whole-tensor,
    multi-axis, single-axis-with-indices, and argmax/argmin
    declarations + the `ValuesIndices` struct).
  - `<ctorch/ops/op_keys.h>` — 9 new OpKey tags
    (Sum/Mean/Prod/MaxVal/MinVal/MaxValIdx/MinValIdx/Argmax/Argmin)
    plus their `fn_t` typedefs. `ReductionAxes` is forward-declared
    here and defined in the private header.
  - `src/ops/reduction.h` — `ReductionAxes` (fixed-size POD bounded by
    `kMaxRank`, kernel-friendly), plus `canonicalise`,
    `canonicalise_single`, `reduced_shape*`, and dtype-rule helpers
    (`reduce_sum_prod_dtype`, `require_float_for_mean`,
    `reject_bfloat16`).
  - `src/ops/reduction.cpp` — implementations. Negative axes
    normalise against ndim; duplicates / out-of-range raise
    `ShapeError`; empty `dims` ⇒ collapse every axis (whole-tensor
    form). `keepdim` either drops or replaces reduced dims with 1.
  - `tests/ops/reduction_axis_test.cpp` — 13 cases covering every
    branch (negative, duplicate, out-of-range, empty-dims,
    `keepdim`, single-axis, dtype rules).

Refs #9.
First batch of reduction kernels. Adds the iteration plan, the
op functors, and the CPU implementations.

  - `src/ops/reduction_functors.h` — `SumF` / `ProdF` / `MaxF` / `MinF`
    callable structs; per-`Acc` identity values and `apply()` overloads
    that widen the input element to the accumulator type. Float
    `MaxF`/`MinF` are NaN-propagating (matches PyTorch).
  - `src/ops/reduction_iter.{h,cpp}` — `ReductionPlan` POD that splits
    input shape/stride into a "kept" outer axis set and a "reduced"
    inner axis set. Bounded by `kMaxRank`, kernel-friendly (no heap
    allocation per call).
  - `src/ops/reduction_ops_cpu.cpp` — generic `run_reduction<T,Acc,Op,
    OutT>` template + per-op dispatch ladders for sum/prod (bool/int*
    accumulate to int64 and write back as int64; fp32 accumulates to
    double and writes back as fp32 to mitigate CPU/CUDA drift) and
    mean (float-only; int/bool rejected at the front-door). Static
    registrar wires the three CPU OpKeys.
  - Public free functions for `sum` / `prod` / `mean` (whole-tensor +
    multi-axis forms) live in the same TU.
  - `tests/ops/reduction_value_test.cpp` — 16 functional cases on
    hand-computed inputs: whole-tensor + axis form, keepdim,
    multi-axis, negative axis, dtype promotion (int → int64), empty
    slice (sum→0, prod→1, mean→NaN), bfloat16 rejection, mean's
    integer rejection, and out-of-range axis throws.

Refs #9.
Closes the CPU side of issue 09. Adds the four remaining reduction
families on top of the sum/prod/mean infrastructure from the
previous commit:

  - `MaxF` / `MinF` gain `should_replace<Acc>(cur, v)` — strict
    comparison so ties keep the first occurrence (matches PyTorch's
    argmax tie-breaking), with NaN propagation that lets the first
    NaN's index win for argmax/argmin while sticking once seen.
    `apply()` is rewritten to delegate to `should_replace`, so the
    value-only and value+index kernels share semantics.
  - `run_axis_with_idx_cpu<T, Op>` — single-axis kernel that walks
    the kept-axis subspace via the shared `ReductionPlan` and tracks
    `(best_value, best_idx)` along the reduced axis. `vals_out` is
    nullable so argmax/argmin reuse the same kernel and discard the
    value buffer. The reduction axis must be non-empty (front-door
    guards).
  - Multi-axis / whole-tensor `max_val_cpu` / `min_val_cpu` reuse
    the generic `run_reduction` template with `MaxF` / `MinF` and
    same-dtype accumulator (no widening — `max` preserves the input
    dtype per §F7).
  - Front-doors for `max(x)` / `min(x)` (whole-tensor),
    `max(x, dims, keepdim)` / `min(x, dims, keepdim)` (multi-axis,
    values only), `max(x, dim, keepdim)` / `min(x, dim, keepdim)`
    (single-axis, returns `ValuesIndices`), and
    `argmax(x, dim, keepdim)` / `argmin(x, dim, keepdim)`. All
    reject empty reductions with `ShapeError` ("operation has no
    identity").
  - `tests/ops/reduction_value_test.cpp` extended with 12 cases
    covering whole-tensor / multi-axis / single-axis-with-indices /
    keepdim / argmax tie-breaking / NaN-propagation / empty-slice
    rejection / 0-d-tensor rejection.
  - `tests/ops/reduction_dtype_test.cpp` (new) — black-box dtype
    contract: bool/int* sum→int64, prod→int64, mean rejects integral,
    max/min preserve dtype, argmax/argmin always int64, every op
    rejects bfloat16. 9 cases.

Refs #9.
Wires the CUDA backend for the whole-tensor reduction path. The
axis-reduction path is stubbed with a clear DeviceError pointing at
commit 5.

  - `src/ops/reduction_ops_cuda.cu` — new TU. Two `__global__`
    kernels:
      * `whole_tensor_pass1<T,Acc,Op,OutT>` reads input via
        `load_at_linear` (decodes a linear thread id back to a
        strided input offset using `ReductionPlan.shape_reduced /
        stride_in_reduced`, so non-contiguous tensors work without
        a `contiguous()` materialisation), folds into shared memory,
        does an in-block tree reduction, and writes one partial per
        block.
      * `whole_tensor_pass2` reduces up to 1024 partials with a
        single block.
    A `CudaScratchBuffer` RAII helper grabs the partials buffer from
    the per-device caching allocator (cheap on warm pools).
  - `mean_finalize` / `mean_finalize_nan` 1-thread kernels finish
    `mean` by dividing by `reduced_numel` or writing NaN for
    empty-slice inputs, mirroring the CPU contract.
  - Stub: when `kept_numel != 1` the per-op handler raises
    `DeviceError("axis reductions on CUDA are not yet implemented
    (landing in commit 5)")`. Only `SumOp` / `ProdOp` / `MeanOp` /
    `MaxValOp` / `MinValOp` are registered for CUDA in this commit;
    the with-index and arg variants land in the next commit.
  - `src/CMakeLists.txt` adds the new .cu source under
    `if(CTORCH_CUDA)`.
  - `src/ops/reduction_ops_cpu.cpp` calls
    `ctorch_register_cuda_reduction_ops()` from its registrar
    constructor, using the same linker-anchor trick as
    `binary_ops_cpu.cpp` so static linking pulls the .cu TU in.

Numerical drift mitigation: fp32 sums use a `double` accumulator on
both CPU and CUDA (matches issue #9 §7 risk-table line 1). Tiny
inputs (`numel < kBlockSize * 4`) collapse to one block in pass 1
so the partials buffer never under-utilises (line 2).

Refs #9.
Replaces the commit-4 stub with two new kernel families that handle
every axis-reduction shape on CUDA:

  - `axis_reduce_kernel<T, Acc, Op, OutT>` — one thread per output
    element. Each thread decodes its kept-axis offset by walking
    `ReductionPlan.shape_kept / stride_in_kept`, then folds the
    reduced subspace serially using a per-thread odometer kept in
    registers (kMaxRank entries). Same kernel handles innermost and
    non-innermost reduced axes — issue #9 §7 explicitly marks
    transpose-then-reduce as a follow-up optimisation, so we ship
    correctness now and tune later.
  - `axis_with_idx_kernel<T, Op, RecordValue>` — single-axis kernel
    that tracks `(best_value, best_idx)` along the reduced axis.
    `RecordValue=false` skips the value write, letting argmax /
    argmin reuse the same kernel without a separate launch path.
    First-occurrence-wins tie-breaking via `Op::should_replace`.
  - `mean` finalisation upgraded to a per-output-element kernel
    (`mean_axis_finalize` / `mean_axis_finalize_nan`) so it handles
    both the whole-tensor and axis paths uniformly.
  - `dispatch_reduce_cuda<T, Acc, Op, OutT>` picks the whole-tensor
    or axis kernel based on `ax.kept_numel`. The whole-tensor fast
    path remains in place for the bench.
  - The CUDA registrar now wires `MaxValIdxOp` / `MinValIdxOp` /
    `ArgmaxOp` / `ArgminOp` (in addition to the value-only ops
    landed in commit 4), so every public reduction is dispatched
    on CUDA from this point onward.

Refs #9.
)

Wires the reduction kernels against PyTorch (via NumPy reference) so
CI catches any future drift.

  - `scripts/gen_parity.py` gains four new emitters:
      * `emit_sum_like` — sum / mean / prod (whole-tensor + axis,
        with keepdim).
      * `emit_max_min_values` — multi-axis or whole-tensor max/min
        (values only).
      * `emit_max_min_idx` — single-axis max/min returning both the
        value tensor and the int64 index tensor (PyTorch's
        `(values, indices)` contract).
      * `emit_arg` — argmax / argmin with int64 ref. Includes a
        deliberately tied input that locks in the
        "first occurrence wins" rule (issue 09 risk-table line 5).
    Filenames encode `(op, dtype, shape, dim_tag, kd_tag)` so each
    case maps to a stable prefix.
  - `tests/parity/reduction_parity_test.cpp` (new) — four
    parameterised test suites mirroring the four catalog families;
    fp32 tolerates 1e-5 rel, fp64 1e-12, integer / index reductions
    require exact equality (matches issue 09 §N1).
  - 64 .npy fixtures committed under `tests/parity/fixtures/`. The
    existing `load_npy` already supports every dtype these
    fixtures need (<f4 / <f8 / <i4 / <i8).

Refs #9.
Closes the issue 09 acceptance list:

  - `tests/bench/sum_bench.cpp` — gtest target. Times
    `dispatch::call<SumOp>` and `cub::DeviceReduce::Sum` on a
    1<<20 fp32 CUDA tensor for `kTrials=25` iterations, logs the
    median ratio to stdout in CSV (so CI dashboards see the real
    number), and asserts the dispatch median is at most 1.5x of cub
    — same loose-bound convention as `add_bench` (`add_bench.cpp`
    line 179) so noisy / shared-GPU CI runs don't go red. The §N2
    target is 1.15x; the CSV log captures the actual figure.
  - `tests/bench/reference_cub_sum.{h,cu}` — wraps the
    `cub::DeviceReduce::Sum` call in a plain-C++ entry point. cub
    ships with the CUDA toolkit (CTK 11.0+) so no submodule is
    needed; the .cu TU is dropped from the CPU-only build.
  - `tests/bench/CMakeLists.txt` registers `sum_bench` for both
    CPU-only (no-op skip) and CUDA builds.
  - `docs/ops.md` gains a Reductions section: API surface, dtype
    rules, tie-breaking + NaN propagation contract, and per-dtype
    parity tolerances.
  - `README.md` ticks the Phase-1 reductions checkbox and links
    to [#9].

Refs #9.
Closes the §N3 acceptance criterion that was promised in the planning
doc but not landed in the original 7-commit slice. Uses #13's
`CountingAllocator` + `set_default_allocator` overlay to instrument
the CPU pool for the duration of a `sum` call:

  - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and
    output, dispatches `op::SumOp` directly (skipping the
    front-door's output Storage allocation), and asserts zero
    allocator calls during the kernel.
  - `FrontDoorAllocatesOnlyOutputStorage` — runs the public
    `ctorch::sum(x)` API and asserts exactly one allocation (the
    output Tensor's Storage), which is the upper bound the §N3
    claim implies for the user-visible path.

Both tests pass on CPU. They depend on the prereq PR's allocator
override hook (#13).

Refs #9.
@Hayden727 Hayden727 deleted the branch feat/allocator-instrumentation April 29, 2026 05:28
@Hayden727 Hayden727 closed this Apr 29, 2026
@Hayden727

Copy link
Copy Markdown
Owner Author

Superseded by #16 — rebased onto master after #13 was squash-merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant