feat(allocator): pluggable default-allocator override + counting wrapper (#12) by Hayden727 · Pull Request #13 · Hayden727/ctorch

Hayden727 · 2026-04-28T14:31:56Z

Summary

Closes #12. Prereq for #9 §N3 (no-heap-alloc verification of CPU sum).

Add set_default_allocator(Device, Allocator*) returning the previous overlay; default_allocator(Device) consults a per-Kind std::atomic<Allocator*> slot first and falls back to the existing CPU pool / CUDA caching pool.
Add header-only CountingAllocator that wraps any base Allocator* and exposes atomic alloc_calls() / dealloc_calls() / live_bytes() counters. Mirrors the CPU pool's "0-byte allocate returns nullptr" contract so the wrapper does not invent allocations the base does not do.
Hot-path cost when no override is installed: one relaxed atomic load.

The shape mirrors Issue #12's two-commit suggestion:

feat(allocator): per-device override slot in default_allocator
feat(allocator): header-only CountingAllocator wrapper + tests

Test plan

ctest --test-dir build_cpu --output-on-failure — 113 passed, 0 failed (one CUDA smoke test skipped on CPU-only build).
New counting_allocator_test (5 cases) covers: tally correctness, the zero-byte-not-counted contract, overlay install/restore round-trip, end-to-end count around a Tensor allocation, and concurrent access from 8 threads.
pre-commit run clean (clang-format applied to one one-liner).
CI green on Linux (CPU + CUDA matrix).

Add `set_default_allocator(Device, Allocator*)` plus a per-Kind overlay slot consulted by `default_allocator(Device)`. The hot path costs one relaxed atomic load when no override is installed, so existing callers see no measurable change. Tests and instrumentation can now swap in a custom allocator for the duration of a scope — this is the testability hook Issue #9 §N3 needs to verify the no-heap-alloc property of the CPU `sum` kernel. Refs #12.

`CountingAllocator` wraps any base `Allocator*` and exposes atomic `alloc_calls()` / `dealloc_calls()` / `live_bytes()` counters. It mirrors the CPU pool's "0-byte allocate returns nullptr" contract so the wrapper does not invent allocations the base does not do. Combined with #12's overlay slot, tests can now assert exact allocation counts around any block of ctorch code: ctorch::CountingAllocator c(default_allocator(Device::cpu())); auto* prev = set_default_allocator(Device::cpu(), &c); ... set_default_allocator(Device::cpu(), prev); The new `counting_allocator_test` covers tally correctness, the zero-byte-not-counted contract, overlay install/restore, end-to-end counting around a `Tensor` allocation, and concurrent access. Closes #12.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50682c5224

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T14:34:25Z

+std::atomic<Allocator*>& override_slot(Device::Kind kind) {
+    static std::atomic<Allocator*> slots[kNumDeviceKinds]{};
+    return slots[static_cast<std::size_t>(kind)];


Validate kind before indexing override slots

override_slot indexes a fixed-size array using static_cast<std::size_t>(kind) without checking that kind is a valid Device::Kind value. If a caller passes a malformed Device (e.g., via deserialization/FFI with an out-of-range enum), both default_allocator and set_default_allocator now perform out-of-bounds access and trigger undefined behavior instead of the previous invalid_argument path. Please guard kind (or route through a checked switch) before indexing the slot table.

Useful? React with 👍 / 👎.

Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.

…ot (#12) Codex flagged a P1 on PR #13: `override_slot()` indexed a fixed-size slot table via `static_cast<std::size_t>(kind)` without validating that `kind` is one of the declared enumerators. `enum class` is not a closed set in C++ — a malformed value can arrive via memcpy / FFI / out-of-range cast — and the new lookup is called *before* the switch that previously caught unknown kinds, so the regression is real: both `default_allocator` and `set_default_allocator` would read out-of-bounds and trigger UB instead of throwing `invalid_argument`. Fix: guard the cast inside `override_slot()`. If the index is past `kNumDeviceKinds`, throw `std::invalid_argument`, mirroring the existing terminal throw at the bottom of `default_allocator()`. Adds `RejectsOutOfRangeDeviceKind` to `counting_allocator_test`, which constructs a deliberately invalid Kind via a numeric cast and asserts both entry points throw.

Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.

Hayden727 · 2026-04-29T02:52:33Z

@codex review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04415914ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T02:57:11Z

 #endif // CTORCH_HAS_CUDA

 Allocator* default_allocator(Device device) {
+    if (auto* override = override_slot(device.kind).load(std::memory_order_relaxed)) {


Load allocator override with acquire semantics

set_default_allocator publishes an allocator pointer with exchange(..., std::memory_order_acq_rel), but default_allocator reads it with a relaxed load. In a multithreaded test/instrumentation setup, a thread can install a freshly constructed override while another thread immediately consumes it, and the relaxed load does not establish a synchronizes-with edge for the override object's non-atomic state (for example CountingAllocator::base_). That means the consumer can observe partially published state and dereference invalid data. Use an acquire load here so publication via set_default_allocator is actually visible to readers.

Useful? React with 👍 / 👎.

The override slot's publishing exchange uses acq_rel, but the consuming load was relaxed. A relaxed load does not synchronize-with the release side, so a reader could see the new pointer while still observing stale or uninitialized bytes for the override object's non-atomic members (e.g. CountingAllocator::base_). Switch the load to acquire so the publication of set_default_allocator is actually visible to readers. Addresses Codex review on PR #13.

Hayden727 · 2026-04-29T05:23:18Z

@codex review.

chatgpt-codex-connector · 2026-04-29T05:25:41Z

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.

* feat(ops): ReductionAxes canonicalisation + output-shape helpers (#9) Lay the groundwork for sum/mean/prod/max/min/argmax/argmin without shipping any kernel yet: - `<ctorch/ops/reduction.h>` — the public API surface (whole-tensor, multi-axis, single-axis-with-indices, and argmax/argmin declarations + the `ValuesIndices` struct). - `<ctorch/ops/op_keys.h>` — 9 new OpKey tags (Sum/Mean/Prod/MaxVal/MinVal/MaxValIdx/MinValIdx/Argmax/Argmin) plus their `fn_t` typedefs. `ReductionAxes` is forward-declared here and defined in the private header. - `src/ops/reduction.h` — `ReductionAxes` (fixed-size POD bounded by `kMaxRank`, kernel-friendly), plus `canonicalise`, `canonicalise_single`, `reduced_shape*`, and dtype-rule helpers (`reduce_sum_prod_dtype`, `require_float_for_mean`, `reject_bfloat16`). - `src/ops/reduction.cpp` — implementations. Negative axes normalise against ndim; duplicates / out-of-range raise `ShapeError`; empty `dims` ⇒ collapse every axis (whole-tensor form). `keepdim` either drops or replaces reduced dims with 1. - `tests/ops/reduction_axis_test.cpp` — 13 cases covering every branch (negative, duplicate, out-of-range, empty-dims, `keepdim`, single-axis, dtype rules). Refs #9. * feat(ops): CPU sum/mean/prod kernels (#9) First batch of reduction kernels. Adds the iteration plan, the op functors, and the CPU implementations. - `src/ops/reduction_functors.h` — `SumF` / `ProdF` / `MaxF` / `MinF` callable structs; per-`Acc` identity values and `apply()` overloads that widen the input element to the accumulator type. Float `MaxF`/`MinF` are NaN-propagating (matches PyTorch). - `src/ops/reduction_iter.{h,cpp}` — `ReductionPlan` POD that splits input shape/stride into a "kept" outer axis set and a "reduced" inner axis set. Bounded by `kMaxRank`, kernel-friendly (no heap allocation per call). - `src/ops/reduction_ops_cpu.cpp` — generic `run_reduction<T,Acc,Op, OutT>` template + per-op dispatch ladders for sum/prod (bool/int* accumulate to int64 and write back as int64; fp32 accumulates to double and writes back as fp32 to mitigate CPU/CUDA drift) and mean (float-only; int/bool rejected at the front-door). Static registrar wires the three CPU OpKeys. - Public free functions for `sum` / `prod` / `mean` (whole-tensor + multi-axis forms) live in the same TU. - `tests/ops/reduction_value_test.cpp` — 16 functional cases on hand-computed inputs: whole-tensor + axis form, keepdim, multi-axis, negative axis, dtype promotion (int → int64), empty slice (sum→0, prod→1, mean→NaN), bfloat16 rejection, mean's integer rejection, and out-of-range axis throws. Refs #9. * feat(ops): CPU max/min with indices, argmax/argmin (#9) Closes the CPU side of issue 09. Adds the four remaining reduction families on top of the sum/prod/mean infrastructure from the previous commit: - `MaxF` / `MinF` gain `should_replace<Acc>(cur, v)` — strict comparison so ties keep the first occurrence (matches PyTorch's argmax tie-breaking), with NaN propagation that lets the first NaN's index win for argmax/argmin while sticking once seen. `apply()` is rewritten to delegate to `should_replace`, so the value-only and value+index kernels share semantics. - `run_axis_with_idx_cpu<T, Op>` — single-axis kernel that walks the kept-axis subspace via the shared `ReductionPlan` and tracks `(best_value, best_idx)` along the reduced axis. `vals_out` is nullable so argmax/argmin reuse the same kernel and discard the value buffer. The reduction axis must be non-empty (front-door guards). - Multi-axis / whole-tensor `max_val_cpu` / `min_val_cpu` reuse the generic `run_reduction` template with `MaxF` / `MinF` and same-dtype accumulator (no widening — `max` preserves the input dtype per §F7). - Front-doors for `max(x)` / `min(x)` (whole-tensor), `max(x, dims, keepdim)` / `min(x, dims, keepdim)` (multi-axis, values only), `max(x, dim, keepdim)` / `min(x, dim, keepdim)` (single-axis, returns `ValuesIndices`), and `argmax(x, dim, keepdim)` / `argmin(x, dim, keepdim)`. All reject empty reductions with `ShapeError` ("operation has no identity"). - `tests/ops/reduction_value_test.cpp` extended with 12 cases covering whole-tensor / multi-axis / single-axis-with-indices / keepdim / argmax tie-breaking / NaN-propagation / empty-slice rejection / 0-d-tensor rejection. - `tests/ops/reduction_dtype_test.cpp` (new) — black-box dtype contract: bool/int* sum→int64, prod→int64, mean rejects integral, max/min preserve dtype, argmax/argmin always int64, every op rejects bfloat16. 9 cases. Refs #9. * feat(ops): CUDA whole-tensor reductions (two-pass tree) (#9) Wires the CUDA backend for the whole-tensor reduction path. The axis-reduction path is stubbed with a clear DeviceError pointing at commit 5. - `src/ops/reduction_ops_cuda.cu` — new TU. Two `__global__` kernels: * `whole_tensor_pass1<T,Acc,Op,OutT>` reads input via `load_at_linear` (decodes a linear thread id back to a strided input offset using `ReductionPlan.shape_reduced / stride_in_reduced`, so non-contiguous tensors work without a `contiguous()` materialisation), folds into shared memory, does an in-block tree reduction, and writes one partial per block. * `whole_tensor_pass2` reduces up to 1024 partials with a single block. A `CudaScratchBuffer` RAII helper grabs the partials buffer from the per-device caching allocator (cheap on warm pools). - `mean_finalize` / `mean_finalize_nan` 1-thread kernels finish `mean` by dividing by `reduced_numel` or writing NaN for empty-slice inputs, mirroring the CPU contract. - Stub: when `kept_numel != 1` the per-op handler raises `DeviceError("axis reductions on CUDA are not yet implemented (landing in commit 5)")`. Only `SumOp` / `ProdOp` / `MeanOp` / `MaxValOp` / `MinValOp` are registered for CUDA in this commit; the with-index and arg variants land in the next commit. - `src/CMakeLists.txt` adds the new .cu source under `if(CTORCH_CUDA)`. - `src/ops/reduction_ops_cpu.cpp` calls `ctorch_register_cuda_reduction_ops()` from its registrar constructor, using the same linker-anchor trick as `binary_ops_cpu.cpp` so static linking pulls the .cu TU in. Numerical drift mitigation: fp32 sums use a `double` accumulator on both CPU and CUDA (matches issue #9 §7 risk-table line 1). Tiny inputs (`numel < kBlockSize * 4`) collapse to one block in pass 1 so the partials buffer never under-utilises (line 2). Refs #9. * feat(ops): CUDA axis reductions (innermost + non-innermost paths) (#9) Replaces the commit-4 stub with two new kernel families that handle every axis-reduction shape on CUDA: - `axis_reduce_kernel<T, Acc, Op, OutT>` — one thread per output element. Each thread decodes its kept-axis offset by walking `ReductionPlan.shape_kept / stride_in_kept`, then folds the reduced subspace serially using a per-thread odometer kept in registers (kMaxRank entries). Same kernel handles innermost and non-innermost reduced axes — issue #9 §7 explicitly marks transpose-then-reduce as a follow-up optimisation, so we ship correctness now and tune later. - `axis_with_idx_kernel<T, Op, RecordValue>` — single-axis kernel that tracks `(best_value, best_idx)` along the reduced axis. `RecordValue=false` skips the value write, letting argmax / argmin reuse the same kernel without a separate launch path. First-occurrence-wins tie-breaking via `Op::should_replace`. - `mean` finalisation upgraded to a per-output-element kernel (`mean_axis_finalize` / `mean_axis_finalize_nan`) so it handles both the whole-tensor and axis paths uniformly. - `dispatch_reduce_cuda<T, Acc, Op, OutT>` picks the whole-tensor or axis kernel based on `ax.kept_numel`. The whole-tensor fast path remains in place for the bench. - The CUDA registrar now wires `MaxValIdxOp` / `MinValIdxOp` / `ArgmaxOp` / `ArgminOp` (in addition to the value-only ops landed in commit 4), so every public reduction is dispatched on CUDA from this point onward. Refs #9. * chore(test): commit reduction parity fixtures + extend gen_parity.py (#9) Wires the reduction kernels against PyTorch (via NumPy reference) so CI catches any future drift. - `scripts/gen_parity.py` gains four new emitters: * `emit_sum_like` — sum / mean / prod (whole-tensor + axis, with keepdim). * `emit_max_min_values` — multi-axis or whole-tensor max/min (values only). * `emit_max_min_idx` — single-axis max/min returning both the value tensor and the int64 index tensor (PyTorch's `(values, indices)` contract). * `emit_arg` — argmax / argmin with int64 ref. Includes a deliberately tied input that locks in the "first occurrence wins" rule (issue 09 risk-table line 5). Filenames encode `(op, dtype, shape, dim_tag, kd_tag)` so each case maps to a stable prefix. - `tests/parity/reduction_parity_test.cpp` (new) — four parameterised test suites mirroring the four catalog families; fp32 tolerates 1e-5 rel, fp64 1e-12, integer / index reductions require exact equality (matches issue 09 §N1). - 64 .npy fixtures committed under `tests/parity/fixtures/`. The existing `load_npy` already supports every dtype these fixtures need (<f4 / <f8 / <i4 / <i8). Refs #9. * bench: CUDA sum vs cub::DeviceReduce::Sum + docs (#9) Closes the issue 09 acceptance list: - `tests/bench/sum_bench.cpp` — gtest target. Times `dispatch::call<SumOp>` and `cub::DeviceReduce::Sum` on a 1<<20 fp32 CUDA tensor for `kTrials=25` iterations, logs the median ratio to stdout in CSV (so CI dashboards see the real number), and asserts the dispatch median is at most 1.5x of cub — same loose-bound convention as `add_bench` (`add_bench.cpp` line 179) so noisy / shared-GPU CI runs don't go red. The §N2 target is 1.15x; the CSV log captures the actual figure. - `tests/bench/reference_cub_sum.{h,cu}` — wraps the `cub::DeviceReduce::Sum` call in a plain-C++ entry point. cub ships with the CUDA toolkit (CTK 11.0+) so no submodule is needed; the .cu TU is dropped from the CPU-only build. - `tests/bench/CMakeLists.txt` registers `sum_bench` for both CPU-only (no-op skip) and CUDA builds. - `docs/ops.md` gains a Reductions section: API surface, dtype rules, tie-breaking + NaN propagation contract, and per-dtype parity tolerances. - `README.md` ticks the Phase-1 reductions checkbox and links to [#9]. Refs #9. * test(ops): verify CPU sum has zero heap allocations on the hot path (#9) Closes the §N3 acceptance criterion that was promised in the planning doc but not landed in the original 7-commit slice. Uses #13's `CountingAllocator` + `set_default_allocator` overlay to instrument the CPU pool for the duration of a `sum` call: - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and output, dispatches `op::SumOp` directly (skipping the front-door's output Storage allocation), and asserts zero allocator calls during the kernel. - `FrontDoorAllocatesOnlyOutputStorage` — runs the public `ctorch::sum(x)` API and asserts exactly one allocation (the output Tensor's Storage), which is the upper bound the §N3 claim implies for the user-visible path. Both tests pass on CPU. They depend on the prereq PR's allocator override hook (#13). Refs #9.

Hayden727 added 2 commits April 28, 2026 22:31

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Hayden727 mentioned this pull request Apr 28, 2026

feat(ops): reductions with axis & keepdim (#9) #14

Closed

6 tasks

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

Hayden727 mentioned this pull request Apr 29, 2026

chore: ignore build_*/ output directories #15

Merged

2 tasks

Hayden727 merged commit 7b8732a into master Apr 29, 2026
7 checks passed

Hayden727 deleted the feat/allocator-instrumentation branch April 29, 2026 05:28

Hayden727 mentioned this pull request Apr 29, 2026

feat(ops): reductions with axis & keepdim (#9) #16

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(allocator): pluggable default-allocator override + counting wrapper (#12)#13

feat(allocator): pluggable default-allocator override + counting wrapper (#12)#13
Hayden727 merged 4 commits into
masterfrom
feat/allocator-instrumentation

Hayden727 commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

Hayden727 commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Uh oh!

Hayden727 commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Hayden727 commented Apr 28, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Hayden727 commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Hayden727 commented Apr 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant