Skip to content

feat(allocator): pluggable default-allocator override + counting wrapper (#12)#13

Merged
Hayden727 merged 4 commits into
masterfrom
feat/allocator-instrumentation
Apr 29, 2026
Merged

feat(allocator): pluggable default-allocator override + counting wrapper (#12)#13
Hayden727 merged 4 commits into
masterfrom
feat/allocator-instrumentation

Conversation

@Hayden727

Copy link
Copy Markdown
Owner

Summary

Closes #12. Prereq for #9 §N3 (no-heap-alloc verification of CPU sum).

  • Add set_default_allocator(Device, Allocator*) returning the previous overlay; default_allocator(Device) consults a per-Kind std::atomic<Allocator*> slot first and falls back to the existing CPU pool / CUDA caching pool.
  • Add header-only CountingAllocator that wraps any base Allocator* and exposes atomic alloc_calls() / dealloc_calls() / live_bytes() counters. Mirrors the CPU pool's "0-byte allocate returns nullptr" contract so the wrapper does not invent allocations the base does not do.
  • Hot-path cost when no override is installed: one relaxed atomic load.

The shape mirrors Issue #12's two-commit suggestion:

  1. feat(allocator): per-device override slot in default_allocator
  2. feat(allocator): header-only CountingAllocator wrapper + tests

Test plan

  • ctest --test-dir build_cpu --output-on-failure — 113 passed, 0 failed (one CUDA smoke test skipped on CPU-only build).
  • New counting_allocator_test (5 cases) covers: tally correctness, the zero-byte-not-counted contract, overlay install/restore round-trip, end-to-end count around a Tensor allocation, and concurrent access from 8 threads.
  • pre-commit run clean (clang-format applied to one one-liner).
  • CI green on Linux (CPU + CUDA matrix).

Add `set_default_allocator(Device, Allocator*)` plus a per-Kind
overlay slot consulted by `default_allocator(Device)`. The hot path
costs one relaxed atomic load when no override is installed, so
existing callers see no measurable change. Tests and instrumentation
can now swap in a custom allocator for the duration of a scope —
this is the testability hook Issue #9 §N3 needs to verify the
no-heap-alloc property of the CPU `sum` kernel.

Refs #12.
`CountingAllocator` wraps any base `Allocator*` and exposes atomic
`alloc_calls()` / `dealloc_calls()` / `live_bytes()` counters. It
mirrors the CPU pool's "0-byte allocate returns nullptr" contract
so the wrapper does not invent allocations the base does not do.

Combined with #12's overlay slot, tests can now assert exact
allocation counts around any block of ctorch code:

    ctorch::CountingAllocator c(default_allocator(Device::cpu()));
    auto* prev = set_default_allocator(Device::cpu(), &c);
    ...
    set_default_allocator(Device::cpu(), prev);

The new `counting_allocator_test` covers tally correctness, the
zero-byte-not-counted contract, overlay install/restore, end-to-end
counting around a `Tensor` allocation, and concurrent access.

Closes #12.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50682c5224

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/allocators/default_allocator.cpp Outdated
Comment on lines +45 to +47
std::atomic<Allocator*>& override_slot(Device::Kind kind) {
static std::atomic<Allocator*> slots[kNumDeviceKinds]{};
return slots[static_cast<std::size_t>(kind)];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate kind before indexing override slots

override_slot indexes a fixed-size array using static_cast<std::size_t>(kind) without checking that kind is a valid Device::Kind value. If a caller passes a malformed Device (e.g., via deserialization/FFI with an out-of-range enum), both default_allocator and set_default_allocator now perform out-of-bounds access and trigger undefined behavior instead of the previous invalid_argument path. Please guard kind (or route through a checked switch) before indexing the slot table.

Useful? React with 👍 / 👎.

Hayden727 added a commit that referenced this pull request Apr 28, 2026
Closes the §N3 acceptance criterion that was promised in the planning
doc but not landed in the original 7-commit slice. Uses #13's
`CountingAllocator` + `set_default_allocator` overlay to instrument
the CPU pool for the duration of a `sum` call:

  - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and
    output, dispatches `op::SumOp` directly (skipping the
    front-door's output Storage allocation), and asserts zero
    allocator calls during the kernel.
  - `FrontDoorAllocatesOnlyOutputStorage` — runs the public
    `ctorch::sum(x)` API and asserts exactly one allocation (the
    output Tensor's Storage), which is the upper bound the §N3
    claim implies for the user-visible path.

Both tests pass on CPU. They depend on the prereq PR's allocator
override hook (#13).

Refs #9.
…ot (#12)

Codex flagged a P1 on PR #13: `override_slot()` indexed a fixed-size
slot table via `static_cast<std::size_t>(kind)` without validating
that `kind` is one of the declared enumerators. `enum class` is not
a closed set in C++ — a malformed value can arrive via memcpy / FFI /
out-of-range cast — and the new lookup is called *before* the switch
that previously caught unknown kinds, so the regression is real:
both `default_allocator` and `set_default_allocator` would read
out-of-bounds and trigger UB instead of throwing
`invalid_argument`.

Fix: guard the cast inside `override_slot()`. If the index is past
`kNumDeviceKinds`, throw `std::invalid_argument`, mirroring the
existing terminal throw at the bottom of `default_allocator()`.

Adds `RejectsOutOfRangeDeviceKind` to `counting_allocator_test`,
which constructs a deliberately invalid Kind via a numeric cast
and asserts both entry points throw.
Hayden727 added a commit that referenced this pull request Apr 29, 2026
Closes the §N3 acceptance criterion that was promised in the planning
doc but not landed in the original 7-commit slice. Uses #13's
`CountingAllocator` + `set_default_allocator` overlay to instrument
the CPU pool for the duration of a `sum` call:

  - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and
    output, dispatches `op::SumOp` directly (skipping the
    front-door's output Storage allocation), and asserts zero
    allocator calls during the kernel.
  - `FrontDoorAllocatesOnlyOutputStorage` — runs the public
    `ctorch::sum(x)` API and asserts exactly one allocation (the
    output Tensor's Storage), which is the upper bound the §N3
    claim implies for the user-visible path.

Both tests pass on CPU. They depend on the prereq PR's allocator
override hook (#13).

Refs #9.
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04415914ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/allocators/default_allocator.cpp Outdated
#endif // CTORCH_HAS_CUDA

Allocator* default_allocator(Device device) {
if (auto* override = override_slot(device.kind).load(std::memory_order_relaxed)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Load allocator override with acquire semantics

set_default_allocator publishes an allocator pointer with exchange(..., std::memory_order_acq_rel), but default_allocator reads it with a relaxed load. In a multithreaded test/instrumentation setup, a thread can install a freshly constructed override while another thread immediately consumes it, and the relaxed load does not establish a synchronizes-with edge for the override object's non-atomic state (for example CountingAllocator::base_). That means the consumer can observe partially published state and dereference invalid data. Use an acquire load here so publication via set_default_allocator is actually visible to readers.

Useful? React with 👍 / 👎.

The override slot's publishing exchange uses acq_rel, but the consuming
load was relaxed. A relaxed load does not synchronize-with the release
side, so a reader could see the new pointer while still observing stale
or uninitialized bytes for the override object's non-atomic members
(e.g. CountingAllocator::base_). Switch the load to acquire so the
publication of set_default_allocator is actually visible to readers.

Addresses Codex review on PR #13.
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Hayden727 Hayden727 merged commit 7b8732a into master Apr 29, 2026
7 checks passed
@Hayden727 Hayden727 deleted the feat/allocator-instrumentation branch April 29, 2026 05:28
Hayden727 added a commit that referenced this pull request Apr 29, 2026
Closes the §N3 acceptance criterion that was promised in the planning
doc but not landed in the original 7-commit slice. Uses #13's
`CountingAllocator` + `set_default_allocator` overlay to instrument
the CPU pool for the duration of a `sum` call:

  - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and
    output, dispatches `op::SumOp` directly (skipping the
    front-door's output Storage allocation), and asserts zero
    allocator calls during the kernel.
  - `FrontDoorAllocatesOnlyOutputStorage` — runs the public
    `ctorch::sum(x)` API and asserts exactly one allocation (the
    output Tensor's Storage), which is the upper bound the §N3
    claim implies for the user-visible path.

Both tests pass on CPU. They depend on the prereq PR's allocator
override hook (#13).

Refs #9.
Hayden727 added a commit that referenced this pull request Apr 29, 2026
* feat(ops): ReductionAxes canonicalisation + output-shape helpers (#9)

Lay the groundwork for sum/mean/prod/max/min/argmax/argmin without
shipping any kernel yet:

  - `<ctorch/ops/reduction.h>` — the public API surface (whole-tensor,
    multi-axis, single-axis-with-indices, and argmax/argmin
    declarations + the `ValuesIndices` struct).
  - `<ctorch/ops/op_keys.h>` — 9 new OpKey tags
    (Sum/Mean/Prod/MaxVal/MinVal/MaxValIdx/MinValIdx/Argmax/Argmin)
    plus their `fn_t` typedefs. `ReductionAxes` is forward-declared
    here and defined in the private header.
  - `src/ops/reduction.h` — `ReductionAxes` (fixed-size POD bounded by
    `kMaxRank`, kernel-friendly), plus `canonicalise`,
    `canonicalise_single`, `reduced_shape*`, and dtype-rule helpers
    (`reduce_sum_prod_dtype`, `require_float_for_mean`,
    `reject_bfloat16`).
  - `src/ops/reduction.cpp` — implementations. Negative axes
    normalise against ndim; duplicates / out-of-range raise
    `ShapeError`; empty `dims` ⇒ collapse every axis (whole-tensor
    form). `keepdim` either drops or replaces reduced dims with 1.
  - `tests/ops/reduction_axis_test.cpp` — 13 cases covering every
    branch (negative, duplicate, out-of-range, empty-dims,
    `keepdim`, single-axis, dtype rules).

Refs #9.

* feat(ops): CPU sum/mean/prod kernels (#9)

First batch of reduction kernels. Adds the iteration plan, the
op functors, and the CPU implementations.

  - `src/ops/reduction_functors.h` — `SumF` / `ProdF` / `MaxF` / `MinF`
    callable structs; per-`Acc` identity values and `apply()` overloads
    that widen the input element to the accumulator type. Float
    `MaxF`/`MinF` are NaN-propagating (matches PyTorch).
  - `src/ops/reduction_iter.{h,cpp}` — `ReductionPlan` POD that splits
    input shape/stride into a "kept" outer axis set and a "reduced"
    inner axis set. Bounded by `kMaxRank`, kernel-friendly (no heap
    allocation per call).
  - `src/ops/reduction_ops_cpu.cpp` — generic `run_reduction<T,Acc,Op,
    OutT>` template + per-op dispatch ladders for sum/prod (bool/int*
    accumulate to int64 and write back as int64; fp32 accumulates to
    double and writes back as fp32 to mitigate CPU/CUDA drift) and
    mean (float-only; int/bool rejected at the front-door). Static
    registrar wires the three CPU OpKeys.
  - Public free functions for `sum` / `prod` / `mean` (whole-tensor +
    multi-axis forms) live in the same TU.
  - `tests/ops/reduction_value_test.cpp` — 16 functional cases on
    hand-computed inputs: whole-tensor + axis form, keepdim,
    multi-axis, negative axis, dtype promotion (int → int64), empty
    slice (sum→0, prod→1, mean→NaN), bfloat16 rejection, mean's
    integer rejection, and out-of-range axis throws.

Refs #9.

* feat(ops): CPU max/min with indices, argmax/argmin (#9)

Closes the CPU side of issue 09. Adds the four remaining reduction
families on top of the sum/prod/mean infrastructure from the
previous commit:

  - `MaxF` / `MinF` gain `should_replace<Acc>(cur, v)` — strict
    comparison so ties keep the first occurrence (matches PyTorch's
    argmax tie-breaking), with NaN propagation that lets the first
    NaN's index win for argmax/argmin while sticking once seen.
    `apply()` is rewritten to delegate to `should_replace`, so the
    value-only and value+index kernels share semantics.
  - `run_axis_with_idx_cpu<T, Op>` — single-axis kernel that walks
    the kept-axis subspace via the shared `ReductionPlan` and tracks
    `(best_value, best_idx)` along the reduced axis. `vals_out` is
    nullable so argmax/argmin reuse the same kernel and discard the
    value buffer. The reduction axis must be non-empty (front-door
    guards).
  - Multi-axis / whole-tensor `max_val_cpu` / `min_val_cpu` reuse
    the generic `run_reduction` template with `MaxF` / `MinF` and
    same-dtype accumulator (no widening — `max` preserves the input
    dtype per §F7).
  - Front-doors for `max(x)` / `min(x)` (whole-tensor),
    `max(x, dims, keepdim)` / `min(x, dims, keepdim)` (multi-axis,
    values only), `max(x, dim, keepdim)` / `min(x, dim, keepdim)`
    (single-axis, returns `ValuesIndices`), and
    `argmax(x, dim, keepdim)` / `argmin(x, dim, keepdim)`. All
    reject empty reductions with `ShapeError` ("operation has no
    identity").
  - `tests/ops/reduction_value_test.cpp` extended with 12 cases
    covering whole-tensor / multi-axis / single-axis-with-indices /
    keepdim / argmax tie-breaking / NaN-propagation / empty-slice
    rejection / 0-d-tensor rejection.
  - `tests/ops/reduction_dtype_test.cpp` (new) — black-box dtype
    contract: bool/int* sum→int64, prod→int64, mean rejects integral,
    max/min preserve dtype, argmax/argmin always int64, every op
    rejects bfloat16. 9 cases.

Refs #9.

* feat(ops): CUDA whole-tensor reductions (two-pass tree) (#9)

Wires the CUDA backend for the whole-tensor reduction path. The
axis-reduction path is stubbed with a clear DeviceError pointing at
commit 5.

  - `src/ops/reduction_ops_cuda.cu` — new TU. Two `__global__`
    kernels:
      * `whole_tensor_pass1<T,Acc,Op,OutT>` reads input via
        `load_at_linear` (decodes a linear thread id back to a
        strided input offset using `ReductionPlan.shape_reduced /
        stride_in_reduced`, so non-contiguous tensors work without
        a `contiguous()` materialisation), folds into shared memory,
        does an in-block tree reduction, and writes one partial per
        block.
      * `whole_tensor_pass2` reduces up to 1024 partials with a
        single block.
    A `CudaScratchBuffer` RAII helper grabs the partials buffer from
    the per-device caching allocator (cheap on warm pools).
  - `mean_finalize` / `mean_finalize_nan` 1-thread kernels finish
    `mean` by dividing by `reduced_numel` or writing NaN for
    empty-slice inputs, mirroring the CPU contract.
  - Stub: when `kept_numel != 1` the per-op handler raises
    `DeviceError("axis reductions on CUDA are not yet implemented
    (landing in commit 5)")`. Only `SumOp` / `ProdOp` / `MeanOp` /
    `MaxValOp` / `MinValOp` are registered for CUDA in this commit;
    the with-index and arg variants land in the next commit.
  - `src/CMakeLists.txt` adds the new .cu source under
    `if(CTORCH_CUDA)`.
  - `src/ops/reduction_ops_cpu.cpp` calls
    `ctorch_register_cuda_reduction_ops()` from its registrar
    constructor, using the same linker-anchor trick as
    `binary_ops_cpu.cpp` so static linking pulls the .cu TU in.

Numerical drift mitigation: fp32 sums use a `double` accumulator on
both CPU and CUDA (matches issue #9 §7 risk-table line 1). Tiny
inputs (`numel < kBlockSize * 4`) collapse to one block in pass 1
so the partials buffer never under-utilises (line 2).

Refs #9.

* feat(ops): CUDA axis reductions (innermost + non-innermost paths) (#9)

Replaces the commit-4 stub with two new kernel families that handle
every axis-reduction shape on CUDA:

  - `axis_reduce_kernel<T, Acc, Op, OutT>` — one thread per output
    element. Each thread decodes its kept-axis offset by walking
    `ReductionPlan.shape_kept / stride_in_kept`, then folds the
    reduced subspace serially using a per-thread odometer kept in
    registers (kMaxRank entries). Same kernel handles innermost and
    non-innermost reduced axes — issue #9 §7 explicitly marks
    transpose-then-reduce as a follow-up optimisation, so we ship
    correctness now and tune later.
  - `axis_with_idx_kernel<T, Op, RecordValue>` — single-axis kernel
    that tracks `(best_value, best_idx)` along the reduced axis.
    `RecordValue=false` skips the value write, letting argmax /
    argmin reuse the same kernel without a separate launch path.
    First-occurrence-wins tie-breaking via `Op::should_replace`.
  - `mean` finalisation upgraded to a per-output-element kernel
    (`mean_axis_finalize` / `mean_axis_finalize_nan`) so it handles
    both the whole-tensor and axis paths uniformly.
  - `dispatch_reduce_cuda<T, Acc, Op, OutT>` picks the whole-tensor
    or axis kernel based on `ax.kept_numel`. The whole-tensor fast
    path remains in place for the bench.
  - The CUDA registrar now wires `MaxValIdxOp` / `MinValIdxOp` /
    `ArgmaxOp` / `ArgminOp` (in addition to the value-only ops
    landed in commit 4), so every public reduction is dispatched
    on CUDA from this point onward.

Refs #9.

* chore(test): commit reduction parity fixtures + extend gen_parity.py (#9)

Wires the reduction kernels against PyTorch (via NumPy reference) so
CI catches any future drift.

  - `scripts/gen_parity.py` gains four new emitters:
      * `emit_sum_like` — sum / mean / prod (whole-tensor + axis,
        with keepdim).
      * `emit_max_min_values` — multi-axis or whole-tensor max/min
        (values only).
      * `emit_max_min_idx` — single-axis max/min returning both the
        value tensor and the int64 index tensor (PyTorch's
        `(values, indices)` contract).
      * `emit_arg` — argmax / argmin with int64 ref. Includes a
        deliberately tied input that locks in the
        "first occurrence wins" rule (issue 09 risk-table line 5).
    Filenames encode `(op, dtype, shape, dim_tag, kd_tag)` so each
    case maps to a stable prefix.
  - `tests/parity/reduction_parity_test.cpp` (new) — four
    parameterised test suites mirroring the four catalog families;
    fp32 tolerates 1e-5 rel, fp64 1e-12, integer / index reductions
    require exact equality (matches issue 09 §N1).
  - 64 .npy fixtures committed under `tests/parity/fixtures/`. The
    existing `load_npy` already supports every dtype these
    fixtures need (<f4 / <f8 / <i4 / <i8).

Refs #9.

* bench: CUDA sum vs cub::DeviceReduce::Sum + docs (#9)

Closes the issue 09 acceptance list:

  - `tests/bench/sum_bench.cpp` — gtest target. Times
    `dispatch::call<SumOp>` and `cub::DeviceReduce::Sum` on a
    1<<20 fp32 CUDA tensor for `kTrials=25` iterations, logs the
    median ratio to stdout in CSV (so CI dashboards see the real
    number), and asserts the dispatch median is at most 1.5x of cub
    — same loose-bound convention as `add_bench` (`add_bench.cpp`
    line 179) so noisy / shared-GPU CI runs don't go red. The §N2
    target is 1.15x; the CSV log captures the actual figure.
  - `tests/bench/reference_cub_sum.{h,cu}` — wraps the
    `cub::DeviceReduce::Sum` call in a plain-C++ entry point. cub
    ships with the CUDA toolkit (CTK 11.0+) so no submodule is
    needed; the .cu TU is dropped from the CPU-only build.
  - `tests/bench/CMakeLists.txt` registers `sum_bench` for both
    CPU-only (no-op skip) and CUDA builds.
  - `docs/ops.md` gains a Reductions section: API surface, dtype
    rules, tie-breaking + NaN propagation contract, and per-dtype
    parity tolerances.
  - `README.md` ticks the Phase-1 reductions checkbox and links
    to [#9].

Refs #9.

* test(ops): verify CPU sum has zero heap allocations on the hot path (#9)

Closes the §N3 acceptance criterion that was promised in the planning
doc but not landed in the original 7-commit slice. Uses #13's
`CountingAllocator` + `set_default_allocator` overlay to instrument
the CPU pool for the duration of a `sum` call:

  - `KernelDoesNotHeapAllocateOnHotPath` — pre-allocates input and
    output, dispatches `op::SumOp` directly (skipping the
    front-door's output Storage allocation), and asserts zero
    allocator calls during the kernel.
  - `FrontDoorAllocatesOnlyOutputStorage` — runs the public
    `ctorch::sum(x)` API and asserts exactly one allocation (the
    output Tensor's Storage), which is the upper bound the §N3
    claim implies for the user-visible path.

Both tests pass on CPU. They depend on the prereq PR's allocator
override hook (#13).

Refs #9.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Infra] Pluggable default-allocator override + counting allocator (prereq for #9 §N3)

1 participant