feat(allocator): pluggable default-allocator override + counting wrapper (#12)#13
Conversation
Add `set_default_allocator(Device, Allocator*)` plus a per-Kind overlay slot consulted by `default_allocator(Device)`. The hot path costs one relaxed atomic load when no override is installed, so existing callers see no measurable change. Tests and instrumentation can now swap in a custom allocator for the duration of a scope — this is the testability hook Issue #9 §N3 needs to verify the no-heap-alloc property of the CPU `sum` kernel. Refs #12.
`CountingAllocator` wraps any base `Allocator*` and exposes atomic `alloc_calls()` / `dealloc_calls()` / `live_bytes()` counters. It mirrors the CPU pool's "0-byte allocate returns nullptr" contract so the wrapper does not invent allocations the base does not do. Combined with #12's overlay slot, tests can now assert exact allocation counts around any block of ctorch code: ctorch::CountingAllocator c(default_allocator(Device::cpu())); auto* prev = set_default_allocator(Device::cpu(), &c); ... set_default_allocator(Device::cpu(), prev); The new `counting_allocator_test` covers tally correctness, the zero-byte-not-counted contract, overlay install/restore, end-to-end counting around a `Tensor` allocation, and concurrent access. Closes #12.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50682c5224
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| std::atomic<Allocator*>& override_slot(Device::Kind kind) { | ||
| static std::atomic<Allocator*> slots[kNumDeviceKinds]{}; | ||
| return slots[static_cast<std::size_t>(kind)]; |
There was a problem hiding this comment.
Validate kind before indexing override slots
override_slot indexes a fixed-size array using static_cast<std::size_t>(kind) without checking that kind is a valid Device::Kind value. If a caller passes a malformed Device (e.g., via deserialization/FFI with an out-of-range enum), both default_allocator and set_default_allocator now perform out-of-bounds access and trigger undefined behavior instead of the previous invalid_argument path. Please guard kind (or route through a checked switch) before indexing the slot table.
Useful? React with 👍 / 👎.
Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.
…ot (#12) Codex flagged a P1 on PR #13: `override_slot()` indexed a fixed-size slot table via `static_cast<std::size_t>(kind)` without validating that `kind` is one of the declared enumerators. `enum class` is not a closed set in C++ — a malformed value can arrive via memcpy / FFI / out-of-range cast — and the new lookup is called *before* the switch that previously caught unknown kinds, so the regression is real: both `default_allocator` and `set_default_allocator` would read out-of-bounds and trigger UB instead of throwing `invalid_argument`. Fix: guard the cast inside `override_slot()`. If the index is past `kNumDeviceKinds`, throw `std::invalid_argument`, mirroring the existing terminal throw at the bottom of `default_allocator()`. Adds `RejectsOutOfRangeDeviceKind` to `counting_allocator_test`, which constructs a deliberately invalid Kind via a numeric cast and asserts both entry points throw.
Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.
|
@codex review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 04415914ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #endif // CTORCH_HAS_CUDA | ||
|
|
||
| Allocator* default_allocator(Device device) { | ||
| if (auto* override = override_slot(device.kind).load(std::memory_order_relaxed)) { |
There was a problem hiding this comment.
Load allocator override with acquire semantics
set_default_allocator publishes an allocator pointer with exchange(..., std::memory_order_acq_rel), but default_allocator reads it with a relaxed load. In a multithreaded test/instrumentation setup, a thread can install a freshly constructed override while another thread immediately consumes it, and the relaxed load does not establish a synchronizes-with edge for the override object's non-atomic state (for example CountingAllocator::base_). That means the consumer can observe partially published state and dereference invalid data. Use an acquire load here so publication via set_default_allocator is actually visible to readers.
Useful? React with 👍 / 👎.
The override slot's publishing exchange uses acq_rel, but the consuming load was relaxed. A relaxed load does not synchronize-with the release side, so a reader could see the new pointer while still observing stale or uninitialized bytes for the override object's non-atomic members (e.g. CountingAllocator::base_). Switch the load to acquire so the publication of set_default_allocator is actually visible to readers. Addresses Codex review on PR #13.
|
@codex review. |
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.
* feat(ops): ReductionAxes canonicalisation + output-shape helpers (#9) Lay the groundwork for sum/mean/prod/max/min/argmax/argmin without shipping any kernel yet: - `<ctorch/ops/reduction.h>` — the public API surface (whole-tensor, multi-axis, single-axis-with-indices, and argmax/argmin declarations + the `ValuesIndices` struct). - `<ctorch/ops/op_keys.h>` — 9 new OpKey tags (Sum/Mean/Prod/MaxVal/MinVal/MaxValIdx/MinValIdx/Argmax/Argmin) plus their `fn_t` typedefs. `ReductionAxes` is forward-declared here and defined in the private header. - `src/ops/reduction.h` — `ReductionAxes` (fixed-size POD bounded by `kMaxRank`, kernel-friendly), plus `canonicalise`, `canonicalise_single`, `reduced_shape*`, and dtype-rule helpers (`reduce_sum_prod_dtype`, `require_float_for_mean`, `reject_bfloat16`). - `src/ops/reduction.cpp` — implementations. Negative axes normalise against ndim; duplicates / out-of-range raise `ShapeError`; empty `dims` ⇒ collapse every axis (whole-tensor form). `keepdim` either drops or replaces reduced dims with 1. - `tests/ops/reduction_axis_test.cpp` — 13 cases covering every branch (negative, duplicate, out-of-range, empty-dims, `keepdim`, single-axis, dtype rules). Refs #9. * feat(ops): CPU sum/mean/prod kernels (#9) First batch of reduction kernels. Adds the iteration plan, the op functors, and the CPU implementations. - `src/ops/reduction_functors.h` — `SumF` / `ProdF` / `MaxF` / `MinF` callable structs; per-`Acc` identity values and `apply()` overloads that widen the input element to the accumulator type. Float `MaxF`/`MinF` are NaN-propagating (matches PyTorch). - `src/ops/reduction_iter.{h,cpp}` — `ReductionPlan` POD that splits input shape/stride into a "kept" outer axis set and a "reduced" inner axis set. Bounded by `kMaxRank`, kernel-friendly (no heap allocation per call). - `src/ops/reduction_ops_cpu.cpp` — generic `run_reduction<T,Acc,Op, OutT>` template + per-op dispatch ladders for sum/prod (bool/int* accumulate to int64 and write back as int64; fp32 accumulates to double and writes back as fp32 to mitigate CPU/CUDA drift) and mean (float-only; int/bool rejected at the front-door). Static registrar wires the three CPU OpKeys. - Public free functions for `sum` / `prod` / `mean` (whole-tensor + multi-axis forms) live in the same TU. - `tests/ops/reduction_value_test.cpp` — 16 functional cases on hand-computed inputs: whole-tensor + axis form, keepdim, multi-axis, negative axis, dtype promotion (int → int64), empty slice (sum→0, prod→1, mean→NaN), bfloat16 rejection, mean's integer rejection, and out-of-range axis throws. Refs #9. * feat(ops): CPU max/min with indices, argmax/argmin (#9) Closes the CPU side of issue 09. Adds the four remaining reduction families on top of the sum/prod/mean infrastructure from the previous commit: - `MaxF` / `MinF` gain `should_replace<Acc>(cur, v)` — strict comparison so ties keep the first occurrence (matches PyTorch's argmax tie-breaking), with NaN propagation that lets the first NaN's index win for argmax/argmin while sticking once seen. `apply()` is rewritten to delegate to `should_replace`, so the value-only and value+index kernels share semantics. - `run_axis_with_idx_cpu<T, Op>` — single-axis kernel that walks the kept-axis subspace via the shared `ReductionPlan` and tracks `(best_value, best_idx)` along the reduced axis. `vals_out` is nullable so argmax/argmin reuse the same kernel and discard the value buffer. The reduction axis must be non-empty (front-door guards). - Multi-axis / whole-tensor `max_val_cpu` / `min_val_cpu` reuse the generic `run_reduction` template with `MaxF` / `MinF` and same-dtype accumulator (no widening — `max` preserves the input dtype per §F7). - Front-doors for `max(x)` / `min(x)` (whole-tensor), `max(x, dims, keepdim)` / `min(x, dims, keepdim)` (multi-axis, values only), `max(x, dim, keepdim)` / `min(x, dim, keepdim)` (single-axis, returns `ValuesIndices`), and `argmax(x, dim, keepdim)` / `argmin(x, dim, keepdim)`. All reject empty reductions with `ShapeError` ("operation has no identity"). - `tests/ops/reduction_value_test.cpp` extended with 12 cases covering whole-tensor / multi-axis / single-axis-with-indices / keepdim / argmax tie-breaking / NaN-propagation / empty-slice rejection / 0-d-tensor rejection. - `tests/ops/reduction_dtype_test.cpp` (new) — black-box dtype contract: bool/int* sum→int64, prod→int64, mean rejects integral, max/min preserve dtype, argmax/argmin always int64, every op rejects bfloat16. 9 cases. Refs #9. * feat(ops): CUDA whole-tensor reductions (two-pass tree) (#9) Wires the CUDA backend for the whole-tensor reduction path. The axis-reduction path is stubbed with a clear DeviceError pointing at commit 5. - `src/ops/reduction_ops_cuda.cu` — new TU. Two `__global__` kernels: * `whole_tensor_pass1<T,Acc,Op,OutT>` reads input via `load_at_linear` (decodes a linear thread id back to a strided input offset using `ReductionPlan.shape_reduced / stride_in_reduced`, so non-contiguous tensors work without a `contiguous()` materialisation), folds into shared memory, does an in-block tree reduction, and writes one partial per block. * `whole_tensor_pass2` reduces up to 1024 partials with a single block. A `CudaScratchBuffer` RAII helper grabs the partials buffer from the per-device caching allocator (cheap on warm pools). - `mean_finalize` / `mean_finalize_nan` 1-thread kernels finish `mean` by dividing by `reduced_numel` or writing NaN for empty-slice inputs, mirroring the CPU contract. - Stub: when `kept_numel != 1` the per-op handler raises `DeviceError("axis reductions on CUDA are not yet implemented (landing in commit 5)")`. Only `SumOp` / `ProdOp` / `MeanOp` / `MaxValOp` / `MinValOp` are registered for CUDA in this commit; the with-index and arg variants land in the next commit. - `src/CMakeLists.txt` adds the new .cu source under `if(CTORCH_CUDA)`. - `src/ops/reduction_ops_cpu.cpp` calls `ctorch_register_cuda_reduction_ops()` from its registrar constructor, using the same linker-anchor trick as `binary_ops_cpu.cpp` so static linking pulls the .cu TU in. Numerical drift mitigation: fp32 sums use a `double` accumulator on both CPU and CUDA (matches issue #9 §7 risk-table line 1). Tiny inputs (`numel < kBlockSize * 4`) collapse to one block in pass 1 so the partials buffer never under-utilises (line 2). Refs #9. * feat(ops): CUDA axis reductions (innermost + non-innermost paths) (#9) Replaces the commit-4 stub with two new kernel families that handle every axis-reduction shape on CUDA: - `axis_reduce_kernel<T, Acc, Op, OutT>` — one thread per output element. Each thread decodes its kept-axis offset by walking `ReductionPlan.shape_kept / stride_in_kept`, then folds the reduced subspace serially using a per-thread odometer kept in registers (kMaxRank entries). Same kernel handles innermost and non-innermost reduced axes — issue #9 §7 explicitly marks transpose-then-reduce as a follow-up optimisation, so we ship correctness now and tune later. - `axis_with_idx_kernel<T, Op, RecordValue>` — single-axis kernel that tracks `(best_value, best_idx)` along the reduced axis. `RecordValue=false` skips the value write, letting argmax / argmin reuse the same kernel without a separate launch path. First-occurrence-wins tie-breaking via `Op::should_replace`. - `mean` finalisation upgraded to a per-output-element kernel (`mean_axis_finalize` / `mean_axis_finalize_nan`) so it handles both the whole-tensor and axis paths uniformly. - `dispatch_reduce_cuda<T, Acc, Op, OutT>` picks the whole-tensor or axis kernel based on `ax.kept_numel`. The whole-tensor fast path remains in place for the bench. - The CUDA registrar now wires `MaxValIdxOp` / `MinValIdxOp` / `ArgmaxOp` / `ArgminOp` (in addition to the value-only ops landed in commit 4), so every public reduction is dispatched on CUDA from this point onward. Refs #9. * chore(test): commit reduction parity fixtures + extend gen_parity.py (#9) Wires the reduction kernels against PyTorch (via NumPy reference) so CI catches any future drift. - `scripts/gen_parity.py` gains four new emitters: * `emit_sum_like` — sum / mean / prod (whole-tensor + axis, with keepdim). * `emit_max_min_values` — multi-axis or whole-tensor max/min (values only). * `emit_max_min_idx` — single-axis max/min returning both the value tensor and the int64 index tensor (PyTorch's `(values, indices)` contract). * `emit_arg` — argmax / argmin with int64 ref. Includes a deliberately tied input that locks in the "first occurrence wins" rule (issue 09 risk-table line 5). Filenames encode `(op, dtype, shape, dim_tag, kd_tag)` so each case maps to a stable prefix. - `tests/parity/reduction_parity_test.cpp` (new) — four parameterised test suites mirroring the four catalog families; fp32 tolerates 1e-5 rel, fp64 1e-12, integer / index reductions require exact equality (matches issue 09 §N1). - 64 .npy fixtures committed under `tests/parity/fixtures/`. The existing `load_npy` already supports every dtype these fixtures need (<f4 / <f8 / <i4 / <i8). Refs #9. * bench: CUDA sum vs cub::DeviceReduce::Sum + docs (#9) Closes the issue 09 acceptance list: - `tests/bench/sum_bench.cpp` — gtest target. Times `dispatch::call<SumOp>` and `cub::DeviceReduce::Sum` on a 1<<20 fp32 CUDA tensor for `kTrials=25` iterations, logs the median ratio to stdout in CSV (so CI dashboards see the real number), and asserts the dispatch median is at most 1.5x of cub — same loose-bound convention as `add_bench` (`add_bench.cpp` line 179) so noisy / shared-GPU CI runs don't go red. The §N2 target is 1.15x; the CSV log captures the actual figure. - `tests/bench/reference_cub_sum.{h,cu}` — wraps the `cub::DeviceReduce::Sum` call in a plain-C++ entry point. cub ships with the CUDA toolkit (CTK 11.0+) so no submodule is needed; the .cu TU is dropped from the CPU-only build. - `tests/bench/CMakeLists.txt` registers `sum_bench` for both CPU-only (no-op skip) and CUDA builds. - `docs/ops.md` gains a Reductions section: API surface, dtype rules, tie-breaking + NaN propagation contract, and per-dtype parity tolerances. - `README.md` ticks the Phase-1 reductions checkbox and links to [#9]. Refs #9. * test(ops): verify CPU sum has zero heap allocations on the hot path (#9) Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.
Summary
Closes #12. Prereq for #9 §N3 (no-heap-alloc verification of CPU
sum).set_default_allocator(Device, Allocator*)returning the previous overlay;default_allocator(Device)consults a per-Kindstd::atomic<Allocator*>slot first and falls back to the existing CPU pool / CUDA caching pool.CountingAllocatorthat wraps any baseAllocator*and exposes atomicalloc_calls()/dealloc_calls()/live_bytes()counters. Mirrors the CPU pool's "0-byte allocate returns nullptr" contract so the wrapper does not invent allocations the base does not do.The shape mirrors Issue #12's two-commit suggestion:
feat(allocator): per-device override slot in default_allocatorfeat(allocator): header-only CountingAllocator wrapper + testsTest plan
ctest --test-dir build_cpu --output-on-failure— 113 passed, 0 failed (one CUDA smoke test skipped on CPU-only build).counting_allocator_test(5 cases) covers: tally correctness, the zero-byte-not-counted contract, overlay install/restore round-trip, end-to-end count around aTensorallocation, and concurrent access from 8 threads.pre-commit runclean (clang-format applied to one one-liner).