feat(ops): reductions with axis & keepdim (#9) by Hayden727 · Pull Request #14 · Hayden727/ctorch

Hayden727 · 2026-04-28T15:07:51Z

Summary

Closes #9. Stacked on top of #13 (the allocator-instrumentation prereq); merge #13 first, then GitHub will retarget this PR to master.

Ships every reduction listed in issue #9: sum, mean, prod, max, min, argmax, argmin. Whole-tensor + multi-axis + single-axis-with-indices forms; CPU + CUDA backends; PyTorch-parity tests; CUDA sum benchmark vs cub::DeviceReduce::Sum.

Commit shape (matches issue #9 §6)

feat(ops): ReductionAxes canonicalisation + output-shape helpers — public API surface, OpKey tags, axis canonicalisation, dtype-rule helpers. 13 axis-tests.
feat(ops): CPU sum/mean/prod kernels — generic run_reduction<T,Acc,Op,OutT> template, ReductionPlan (kept-vs-reduced odometer), per-op widening (int* → int64, fp32 → double accumulator). 16 functional tests.
feat(ops): CPU max/min with indices, argmax/argmin — MaxF::should_replace for first-occurrence-wins + NaN propagation; values-only and values+indices kernels share the iteration plan; argmax / argmin reuse the with-idx kernel with vals_out=nullptr. 12 new functional tests + 9 dtype-rule tests.
feat(ops): CUDA whole-tensor reductions (two-pass tree) — block-level shared-memory reduce → 1024-slot partials buffer → single-block final reduce. fp32 accumulator widened to double to mitigate CPU/CUDA drift. Axis path stubbed (raises DeviceError pointing at commit 5).
feat(ops): CUDA axis reductions (innermost + non-innermost paths) — one thread per output element, walks the reduced subspace via per-thread odometer in registers; same kernel handles innermost and non-innermost reduced axes (transpose-then-reduce optimisation deferred per [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §7). With-idx kernel parameterised on RecordValue so argmax / argmin reuse the launch path.
chore(test): commit reduction parity fixtures + extend gen_parity.py — 64 new .npy fixtures covering every (op, dtype, shape, dims, keepdim) tuple; tests/parity/reduction_parity_test.cpp runs them through ctorch and asserts within issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N1 tolerances.
bench: CUDA sum vs cub::DeviceReduce::Sum + docs — tests/bench/sum_bench.cpp runs 25 trials of dispatch + cub on a 1<<20 fp32 tensor, logs the real ratio, gates at 1.5× (same convention as add_bench); docs/ops.md gains a Reductions section; README checkbox flipped.

Test plan

cmake -S . -B build_cpu -DCTORCH_CUDA=OFF && cmake --build build_cpu -j — green.
ctest --test-dir build_cpu — 187 passed, 0 failed (one CUDA smoke-test skipped on CPU-only build).
New tests: reduction_axis_test (13), reduction_value_test (28), reduction_dtype_test (9), reduction_parity_test (24). All pass on CPU.
pre-commit run clean.
CI: Linux CUDA build + sum_bench.CudaWithin15PercentOfCub ratio logged.
Manual GPU run for [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N2 ratio target (1.15× of cub) — out of band.

Notes

The CUDA axis kernel uses the simple "one thread per output element" approach for both innermost and non-innermost reduced axes. Per [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §7 this is acceptable for the correctness milestone; an innermost-axis specialisation (one block per output element + warp-shuffle reduce) is a follow-up.
CPU sum no-heap-alloc verification (issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N3) uses feat(allocator): pluggable default-allocator override + counting wrapper (#12) #13's CountingAllocator + set_default_allocator overlay. Test code is staged for the next iteration once feat(allocator): pluggable default-allocator override + counting wrapper (#12) #13 lands.

Lay the groundwork for sum/mean/prod/max/min/argmax/argmin without shipping any kernel yet: - `<ctorch/ops/reduction.h>` — the public API surface (whole-tensor, multi-axis, single-axis-with-indices, and argmax/argmin declarations + the `ValuesIndices` struct). - `<ctorch/ops/op_keys.h>` — 9 new OpKey tags (Sum/Mean/Prod/MaxVal/MinVal/MaxValIdx/MinValIdx/Argmax/Argmin) plus their `fn_t` typedefs. `ReductionAxes` is forward-declared here and defined in the private header. - `src/ops/reduction.h` — `ReductionAxes` (fixed-size POD bounded by `kMaxRank`, kernel-friendly), plus `canonicalise`, `canonicalise_single`, `reduced_shape*`, and dtype-rule helpers (`reduce_sum_prod_dtype`, `require_float_for_mean`, `reject_bfloat16`). - `src/ops/reduction.cpp` — implementations. Negative axes normalise against ndim; duplicates / out-of-range raise `ShapeError`; empty `dims` ⇒ collapse every axis (whole-tensor form). `keepdim` either drops or replaces reduced dims with 1. - `tests/ops/reduction_axis_test.cpp` — 13 cases covering every branch (negative, duplicate, out-of-range, empty-dims, `keepdim`, single-axis, dtype rules). Refs #9.

First batch of reduction kernels. Adds the iteration plan, the op functors, and the CPU implementations. - `src/ops/reduction_functors.h` — `SumF` / `ProdF` / `MaxF` / `MinF` callable structs; per-`Acc` identity values and `apply()` overloads that widen the input element to the accumulator type. Float `MaxF`/`MinF` are NaN-propagating (matches PyTorch). - `src/ops/reduction_iter.{h,cpp}` — `ReductionPlan` POD that splits input shape/stride into a "kept" outer axis set and a "reduced" inner axis set. Bounded by `kMaxRank`, kernel-friendly (no heap allocation per call). - `src/ops/reduction_ops_cpu.cpp` — generic `run_reduction<T,Acc,Op, OutT>` template + per-op dispatch ladders for sum/prod (bool/int* accumulate to int64 and write back as int64; fp32 accumulates to double and writes back as fp32 to mitigate CPU/CUDA drift) and mean (float-only; int/bool rejected at the front-door). Static registrar wires the three CPU OpKeys. - Public free functions for `sum` / `prod` / `mean` (whole-tensor + multi-axis forms) live in the same TU. - `tests/ops/reduction_value_test.cpp` — 16 functional cases on hand-computed inputs: whole-tensor + axis form, keepdim, multi-axis, negative axis, dtype promotion (int → int64), empty slice (sum→0, prod→1, mean→NaN), bfloat16 rejection, mean's integer rejection, and out-of-range axis throws. Refs #9.

Closes the CPU side of issue 09. Adds the four remaining reduction families on top of the sum/prod/mean infrastructure from the previous commit: - `MaxF` / `MinF` gain `should_replace<Acc>(cur, v)` — strict comparison so ties keep the first occurrence (matches PyTorch's argmax tie-breaking), with NaN propagation that lets the first NaN's index win for argmax/argmin while sticking once seen. `apply()` is rewritten to delegate to `should_replace`, so the value-only and value+index kernels share semantics. - `run_axis_with_idx_cpu<T, Op>` — single-axis kernel that walks the kept-axis subspace via the shared `ReductionPlan` and tracks `(best_value, best_idx)` along the reduced axis. `vals_out` is nullable so argmax/argmin reuse the same kernel and discard the value buffer. The reduction axis must be non-empty (front-door guards). - Multi-axis / whole-tensor `max_val_cpu` / `min_val_cpu` reuse the generic `run_reduction` template with `MaxF` / `MinF` and same-dtype accumulator (no widening — `max` preserves the input dtype per §F7). - Front-doors for `max(x)` / `min(x)` (whole-tensor), `max(x, dims, keepdim)` / `min(x, dims, keepdim)` (multi-axis, values only), `max(x, dim, keepdim)` / `min(x, dim, keepdim)` (single-axis, returns `ValuesIndices`), and `argmax(x, dim, keepdim)` / `argmin(x, dim, keepdim)`. All reject empty reductions with `ShapeError` ("operation has no identity"). - `tests/ops/reduction_value_test.cpp` extended with 12 cases covering whole-tensor / multi-axis / single-axis-with-indices / keepdim / argmax tie-breaking / NaN-propagation / empty-slice rejection / 0-d-tensor rejection. - `tests/ops/reduction_dtype_test.cpp` (new) — black-box dtype contract: bool/int* sum→int64, prod→int64, mean rejects integral, max/min preserve dtype, argmax/argmin always int64, every op rejects bfloat16. 9 cases. Refs #9.

Wires the CUDA backend for the whole-tensor reduction path. The axis-reduction path is stubbed with a clear DeviceError pointing at commit 5. - `src/ops/reduction_ops_cuda.cu` — new TU. Two `__global__` kernels: * `whole_tensor_pass1<T,Acc,Op,OutT>` reads input via `load_at_linear` (decodes a linear thread id back to a strided input offset using `ReductionPlan.shape_reduced / stride_in_reduced`, so non-contiguous tensors work without a `contiguous()` materialisation), folds into shared memory, does an in-block tree reduction, and writes one partial per block. * `whole_tensor_pass2` reduces up to 1024 partials with a single block. A `CudaScratchBuffer` RAII helper grabs the partials buffer from the per-device caching allocator (cheap on warm pools). - `mean_finalize` / `mean_finalize_nan` 1-thread kernels finish `mean` by dividing by `reduced_numel` or writing NaN for empty-slice inputs, mirroring the CPU contract. - Stub: when `kept_numel != 1` the per-op handler raises `DeviceError("axis reductions on CUDA are not yet implemented (landing in commit 5)")`. Only `SumOp` / `ProdOp` / `MeanOp` / `MaxValOp` / `MinValOp` are registered for CUDA in this commit; the with-index and arg variants land in the next commit. - `src/CMakeLists.txt` adds the new .cu source under `if(CTORCH_CUDA)`. - `src/ops/reduction_ops_cpu.cpp` calls `ctorch_register_cuda_reduction_ops()` from its registrar constructor, using the same linker-anchor trick as `binary_ops_cpu.cpp` so static linking pulls the .cu TU in. Numerical drift mitigation: fp32 sums use a `double` accumulator on both CPU and CUDA (matches issue #9 §7 risk-table line 1). Tiny inputs (`numel < kBlockSize * 4`) collapse to one block in pass 1 so the partials buffer never under-utilises (line 2). Refs #9.

Replaces the commit-4 stub with two new kernel families that handle every axis-reduction shape on CUDA: - `axis_reduce_kernel<T, Acc, Op, OutT>` — one thread per output element. Each thread decodes its kept-axis offset by walking `ReductionPlan.shape_kept / stride_in_kept`, then folds the reduced subspace serially using a per-thread odometer kept in registers (kMaxRank entries). Same kernel handles innermost and non-innermost reduced axes — issue #9 §7 explicitly marks transpose-then-reduce as a follow-up optimisation, so we ship correctness now and tune later. - `axis_with_idx_kernel<T, Op, RecordValue>` — single-axis kernel that tracks `(best_value, best_idx)` along the reduced axis. `RecordValue=false` skips the value write, letting argmax / argmin reuse the same kernel without a separate launch path. First-occurrence-wins tie-breaking via `Op::should_replace`. - `mean` finalisation upgraded to a per-output-element kernel (`mean_axis_finalize` / `mean_axis_finalize_nan`) so it handles both the whole-tensor and axis paths uniformly. - `dispatch_reduce_cuda<T, Acc, Op, OutT>` picks the whole-tensor or axis kernel based on `ax.kept_numel`. The whole-tensor fast path remains in place for the bench. - The CUDA registrar now wires `MaxValIdxOp` / `MinValIdxOp` / `ArgmaxOp` / `ArgminOp` (in addition to the value-only ops landed in commit 4), so every public reduction is dispatched on CUDA from this point onward. Refs #9.

) Wires the reduction kernels against PyTorch (via NumPy reference) so CI catches any future drift. - `scripts/gen_parity.py` gains four new emitters: * `emit_sum_like` — sum / mean / prod (whole-tensor + axis, with keepdim). * `emit_max_min_values` — multi-axis or whole-tensor max/min (values only). * `emit_max_min_idx` — single-axis max/min returning both the value tensor and the int64 index tensor (PyTorch's `(values, indices)` contract). * `emit_arg` — argmax / argmin with int64 ref. Includes a deliberately tied input that locks in the "first occurrence wins" rule (issue 09 risk-table line 5). Filenames encode `(op, dtype, shape, dim_tag, kd_tag)` so each case maps to a stable prefix. - `tests/parity/reduction_parity_test.cpp` (new) — four parameterised test suites mirroring the four catalog families; fp32 tolerates 1e-5 rel, fp64 1e-12, integer / index reductions require exact equality (matches issue 09 §N1). - 64 .npy fixtures committed under `tests/parity/fixtures/`. The existing `load_npy` already supports every dtype these fixtures need (<f4 / <f8 / <i4 / <i8). Refs #9.

Closes the issue 09 acceptance list: - `tests/bench/sum_bench.cpp` — gtest target. Times `dispatch::call<SumOp>` and `cub::DeviceReduce::Sum` on a 1<<20 fp32 CUDA tensor for `kTrials=25` iterations, logs the median ratio to stdout in CSV (so CI dashboards see the real number), and asserts the dispatch median is at most 1.5x of cub — same loose-bound convention as `add_bench` (`add_bench.cpp` line 179) so noisy / shared-GPU CI runs don't go red. The §N2 target is 1.15x; the CSV log captures the actual figure. - `tests/bench/reference_cub_sum.{h,cu}` — wraps the `cub::DeviceReduce::Sum` call in a plain-C++ entry point. cub ships with the CUDA toolkit (CTK 11.0+) so no submodule is needed; the .cu TU is dropped from the CPU-only build. - `tests/bench/CMakeLists.txt` registers `sum_bench` for both CPU-only (no-op skip) and CUDA builds. - `docs/ops.md` gains a Reductions section: API surface, dtype rules, tie-breaking + NaN propagation contract, and per-dtype parity tolerances. - `README.md` ticks the Phase-1 reductions checkbox and links to [#9]. Refs #9.

Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.

Hayden727 · 2026-04-29T05:30:39Z

Superseded by #16 — rebased onto master after #13 was squash-merged.

Hayden727 added 8 commits April 29, 2026 10:43

Hayden727 force-pushed the feat/reductions branch from be9f35d to fd0ecc2 Compare April 29, 2026 02:44

Hayden727 deleted the branch feat/allocator-instrumentation April 29, 2026 05:28

Hayden727 closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ops): reductions with axis & keepdim (#9)#14

feat(ops): reductions with axis & keepdim (#9)#14
Hayden727 wants to merge 8 commits into
feat/allocator-instrumentationfrom
feat/reductions

Hayden727 commented Apr 28, 2026

Uh oh!

Hayden727 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Hayden727 commented Apr 28, 2026

Summary

Commit shape (matches issue #9 §6)

Test plan

Notes

Uh oh!

Hayden727 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant