feat(ops): reductions with axis & keepdim (#9)#14
Closed
Hayden727 wants to merge 8 commits into
Closed
Conversation
Lay the groundwork for sum/mean/prod/max/min/argmax/argmin without
shipping any kernel yet:
- `<ctorch/ops/reduction.h>` — the public API surface (whole-tensor,
multi-axis, single-axis-with-indices, and argmax/argmin
declarations + the `ValuesIndices` struct).
- `<ctorch/ops/op_keys.h>` — 9 new OpKey tags
(Sum/Mean/Prod/MaxVal/MinVal/MaxValIdx/MinValIdx/Argmax/Argmin)
plus their `fn_t` typedefs. `ReductionAxes` is forward-declared
here and defined in the private header.
- `src/ops/reduction.h` — `ReductionAxes` (fixed-size POD bounded by
`kMaxRank`, kernel-friendly), plus `canonicalise`,
`canonicalise_single`, `reduced_shape*`, and dtype-rule helpers
(`reduce_sum_prod_dtype`, `require_float_for_mean`,
`reject_bfloat16`).
- `src/ops/reduction.cpp` — implementations. Negative axes
normalise against ndim; duplicates / out-of-range raise
`ShapeError`; empty `dims` ⇒ collapse every axis (whole-tensor
form). `keepdim` either drops or replaces reduced dims with 1.
- `tests/ops/reduction_axis_test.cpp` — 13 cases covering every
branch (negative, duplicate, out-of-range, empty-dims,
`keepdim`, single-axis, dtype rules).
Refs #9.
First batch of reduction kernels. Adds the iteration plan, the
op functors, and the CPU implementations.
- `src/ops/reduction_functors.h` — `SumF` / `ProdF` / `MaxF` / `MinF`
callable structs; per-`Acc` identity values and `apply()` overloads
that widen the input element to the accumulator type. Float
`MaxF`/`MinF` are NaN-propagating (matches PyTorch).
- `src/ops/reduction_iter.{h,cpp}` — `ReductionPlan` POD that splits
input shape/stride into a "kept" outer axis set and a "reduced"
inner axis set. Bounded by `kMaxRank`, kernel-friendly (no heap
allocation per call).
- `src/ops/reduction_ops_cpu.cpp` — generic `run_reduction<T,Acc,Op,
OutT>` template + per-op dispatch ladders for sum/prod (bool/int*
accumulate to int64 and write back as int64; fp32 accumulates to
double and writes back as fp32 to mitigate CPU/CUDA drift) and
mean (float-only; int/bool rejected at the front-door). Static
registrar wires the three CPU OpKeys.
- Public free functions for `sum` / `prod` / `mean` (whole-tensor +
multi-axis forms) live in the same TU.
- `tests/ops/reduction_value_test.cpp` — 16 functional cases on
hand-computed inputs: whole-tensor + axis form, keepdim,
multi-axis, negative axis, dtype promotion (int → int64), empty
slice (sum→0, prod→1, mean→NaN), bfloat16 rejection, mean's
integer rejection, and out-of-range axis throws.
Refs #9.
Closes the CPU side of issue 09. Adds the four remaining reduction
families on top of the sum/prod/mean infrastructure from the
previous commit:
- `MaxF` / `MinF` gain `should_replace<Acc>(cur, v)` — strict
comparison so ties keep the first occurrence (matches PyTorch's
argmax tie-breaking), with NaN propagation that lets the first
NaN's index win for argmax/argmin while sticking once seen.
`apply()` is rewritten to delegate to `should_replace`, so the
value-only and value+index kernels share semantics.
- `run_axis_with_idx_cpu<T, Op>` — single-axis kernel that walks
the kept-axis subspace via the shared `ReductionPlan` and tracks
`(best_value, best_idx)` along the reduced axis. `vals_out` is
nullable so argmax/argmin reuse the same kernel and discard the
value buffer. The reduction axis must be non-empty (front-door
guards).
- Multi-axis / whole-tensor `max_val_cpu` / `min_val_cpu` reuse
the generic `run_reduction` template with `MaxF` / `MinF` and
same-dtype accumulator (no widening — `max` preserves the input
dtype per §F7).
- Front-doors for `max(x)` / `min(x)` (whole-tensor),
`max(x, dims, keepdim)` / `min(x, dims, keepdim)` (multi-axis,
values only), `max(x, dim, keepdim)` / `min(x, dim, keepdim)`
(single-axis, returns `ValuesIndices`), and
`argmax(x, dim, keepdim)` / `argmin(x, dim, keepdim)`. All
reject empty reductions with `ShapeError` ("operation has no
identity").
- `tests/ops/reduction_value_test.cpp` extended with 12 cases
covering whole-tensor / multi-axis / single-axis-with-indices /
keepdim / argmax tie-breaking / NaN-propagation / empty-slice
rejection / 0-d-tensor rejection.
- `tests/ops/reduction_dtype_test.cpp` (new) — black-box dtype
contract: bool/int* sum→int64, prod→int64, mean rejects integral,
max/min preserve dtype, argmax/argmin always int64, every op
rejects bfloat16. 9 cases.
Refs #9.
Wires the CUDA backend for the whole-tensor reduction path. The
axis-reduction path is stubbed with a clear DeviceError pointing at
commit 5.
- `src/ops/reduction_ops_cuda.cu` — new TU. Two `__global__`
kernels:
* `whole_tensor_pass1<T,Acc,Op,OutT>` reads input via
`load_at_linear` (decodes a linear thread id back to a
strided input offset using `ReductionPlan.shape_reduced /
stride_in_reduced`, so non-contiguous tensors work without
a `contiguous()` materialisation), folds into shared memory,
does an in-block tree reduction, and writes one partial per
block.
* `whole_tensor_pass2` reduces up to 1024 partials with a
single block.
A `CudaScratchBuffer` RAII helper grabs the partials buffer from
the per-device caching allocator (cheap on warm pools).
- `mean_finalize` / `mean_finalize_nan` 1-thread kernels finish
`mean` by dividing by `reduced_numel` or writing NaN for
empty-slice inputs, mirroring the CPU contract.
- Stub: when `kept_numel != 1` the per-op handler raises
`DeviceError("axis reductions on CUDA are not yet implemented
(landing in commit 5)")`. Only `SumOp` / `ProdOp` / `MeanOp` /
`MaxValOp` / `MinValOp` are registered for CUDA in this commit;
the with-index and arg variants land in the next commit.
- `src/CMakeLists.txt` adds the new .cu source under
`if(CTORCH_CUDA)`.
- `src/ops/reduction_ops_cpu.cpp` calls
`ctorch_register_cuda_reduction_ops()` from its registrar
constructor, using the same linker-anchor trick as
`binary_ops_cpu.cpp` so static linking pulls the .cu TU in.
Numerical drift mitigation: fp32 sums use a `double` accumulator on
both CPU and CUDA (matches issue #9 §7 risk-table line 1). Tiny
inputs (`numel < kBlockSize * 4`) collapse to one block in pass 1
so the partials buffer never under-utilises (line 2).
Refs #9.
Replaces the commit-4 stub with two new kernel families that handle
every axis-reduction shape on CUDA:
- `axis_reduce_kernel<T, Acc, Op, OutT>` — one thread per output
element. Each thread decodes its kept-axis offset by walking
`ReductionPlan.shape_kept / stride_in_kept`, then folds the
reduced subspace serially using a per-thread odometer kept in
registers (kMaxRank entries). Same kernel handles innermost and
non-innermost reduced axes — issue #9 §7 explicitly marks
transpose-then-reduce as a follow-up optimisation, so we ship
correctness now and tune later.
- `axis_with_idx_kernel<T, Op, RecordValue>` — single-axis kernel
that tracks `(best_value, best_idx)` along the reduced axis.
`RecordValue=false` skips the value write, letting argmax /
argmin reuse the same kernel without a separate launch path.
First-occurrence-wins tie-breaking via `Op::should_replace`.
- `mean` finalisation upgraded to a per-output-element kernel
(`mean_axis_finalize` / `mean_axis_finalize_nan`) so it handles
both the whole-tensor and axis paths uniformly.
- `dispatch_reduce_cuda<T, Acc, Op, OutT>` picks the whole-tensor
or axis kernel based on `ax.kept_numel`. The whole-tensor fast
path remains in place for the bench.
- The CUDA registrar now wires `MaxValIdxOp` / `MinValIdxOp` /
`ArgmaxOp` / `ArgminOp` (in addition to the value-only ops
landed in commit 4), so every public reduction is dispatched
on CUDA from this point onward.
Refs #9.
) Wires the reduction kernels against PyTorch (via NumPy reference) so CI catches any future drift. - `scripts/gen_parity.py` gains four new emitters: * `emit_sum_like` — sum / mean / prod (whole-tensor + axis, with keepdim). * `emit_max_min_values` — multi-axis or whole-tensor max/min (values only). * `emit_max_min_idx` — single-axis max/min returning both the value tensor and the int64 index tensor (PyTorch's `(values, indices)` contract). * `emit_arg` — argmax / argmin with int64 ref. Includes a deliberately tied input that locks in the "first occurrence wins" rule (issue 09 risk-table line 5). Filenames encode `(op, dtype, shape, dim_tag, kd_tag)` so each case maps to a stable prefix. - `tests/parity/reduction_parity_test.cpp` (new) — four parameterised test suites mirroring the four catalog families; fp32 tolerates 1e-5 rel, fp64 1e-12, integer / index reductions require exact equality (matches issue 09 §N1). - 64 .npy fixtures committed under `tests/parity/fixtures/`. The existing `load_npy` already supports every dtype these fixtures need (<f4 / <f8 / <i4 / <i8). Refs #9.
Closes the issue 09 acceptance list:
- `tests/bench/sum_bench.cpp` — gtest target. Times
`dispatch::call<SumOp>` and `cub::DeviceReduce::Sum` on a
1<<20 fp32 CUDA tensor for `kTrials=25` iterations, logs the
median ratio to stdout in CSV (so CI dashboards see the real
number), and asserts the dispatch median is at most 1.5x of cub
— same loose-bound convention as `add_bench` (`add_bench.cpp`
line 179) so noisy / shared-GPU CI runs don't go red. The §N2
target is 1.15x; the CSV log captures the actual figure.
- `tests/bench/reference_cub_sum.{h,cu}` — wraps the
`cub::DeviceReduce::Sum` call in a plain-C++ entry point. cub
ships with the CUDA toolkit (CTK 11.0+) so no submodule is
needed; the .cu TU is dropped from the CPU-only build.
- `tests/bench/CMakeLists.txt` registers `sum_bench` for both
CPU-only (no-op skip) and CUDA builds.
- `docs/ops.md` gains a Reductions section: API surface, dtype
rules, tie-breaking + NaN propagation contract, and per-dtype
parity tolerances.
- `README.md` ticks the Phase-1 reductions checkbox and links
to [#9].
Refs #9.
Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.
be9f35d to
fd0ecc2
Compare
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #9. Stacked on top of #13 (the allocator-instrumentation prereq); merge #13 first, then GitHub will retarget this PR to master.
Ships every reduction listed in issue #9:
sum,mean,prod,max,min,argmax,argmin. Whole-tensor + multi-axis + single-axis-with-indices forms; CPU + CUDA backends; PyTorch-parity tests; CUDAsumbenchmark vscub::DeviceReduce::Sum.Commit shape (matches issue #9 §6)
feat(ops): ReductionAxes canonicalisation + output-shape helpers— public API surface, OpKey tags, axis canonicalisation, dtype-rule helpers. 13 axis-tests.feat(ops): CPU sum/mean/prod kernels— genericrun_reduction<T,Acc,Op,OutT>template,ReductionPlan(kept-vs-reduced odometer), per-op widening (int*→int64,fp32→doubleaccumulator). 16 functional tests.feat(ops): CPU max/min with indices, argmax/argmin—MaxF::should_replacefor first-occurrence-wins + NaN propagation; values-only and values+indices kernels share the iteration plan; argmax / argmin reuse the with-idx kernel withvals_out=nullptr. 12 new functional tests + 9 dtype-rule tests.feat(ops): CUDA whole-tensor reductions (two-pass tree)— block-level shared-memory reduce → 1024-slot partials buffer → single-block final reduce. fp32 accumulator widened todoubleto mitigate CPU/CUDA drift. Axis path stubbed (raisesDeviceErrorpointing at commit 5).feat(ops): CUDA axis reductions (innermost + non-innermost paths)— one thread per output element, walks the reduced subspace via per-thread odometer in registers; same kernel handles innermost and non-innermost reduced axes (transpose-then-reduce optimisation deferred per [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §7). With-idx kernel parameterised onRecordValueso argmax / argmin reuse the launch path.chore(test): commit reduction parity fixtures + extend gen_parity.py— 64 new.npyfixtures covering every (op, dtype, shape, dims, keepdim) tuple;tests/parity/reduction_parity_test.cppruns them through ctorch and asserts within issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N1 tolerances.bench: CUDA sum vs cub::DeviceReduce::Sum + docs—tests/bench/sum_bench.cppruns 25 trials of dispatch + cub on a 1<<20 fp32 tensor, logs the real ratio, gates at 1.5× (same convention asadd_bench);docs/ops.mdgains a Reductions section; README checkbox flipped.Test plan
cmake -S . -B build_cpu -DCTORCH_CUDA=OFF && cmake --build build_cpu -j— green.ctest --test-dir build_cpu— 187 passed, 0 failed (one CUDA smoke-test skipped on CPU-only build).reduction_axis_test(13),reduction_value_test(28),reduction_dtype_test(9),reduction_parity_test(24). All pass on CPU.pre-commit runclean.sum_bench.CudaWithin15PercentOfCubratio logged.Notes
sumno-heap-alloc verification (issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9 §N3) uses feat(allocator): pluggable default-allocator override + counting wrapper (#12) #13'sCountingAllocator+set_default_allocatoroverlay. Test code is staged for the next iteration once feat(allocator): pluggable default-allocator override + counting wrapper (#12) #13 lands.