diff --git a/docs/CONVENTIONS.md b/docs/CONVENTIONS.md index 2323ec5..f5187a4 100644 --- a/docs/CONVENTIONS.md +++ b/docs/CONVENTIONS.md @@ -1,567 +1,9 @@ # Project Conventions -## Data Shape Convention - -Datasets deliver samples in their natural geometric shape (e.g. `[C, H, W]` -for images, `[C, L]` for time series). Any `reshape`, `flatten`, or `view` -operation is the **first layer of the model**, not a preprocessing step in -the dataset. This: - -- keeps dataset code independent of downstream model topology -- allows one dataset to feed models with different input ranks -- matches the PyTorch / Keras / elastic-ai.creator IR convention, so a future - ir2c can compile each shape transform to a corresponding C layer - -For flatten-to-2D, use `flattenLayerInit()` from `FlattenApi.h`. - -## Sanitizer-driven memory bug detection - -The C unit-test suite is run twice in CI: once normally (`c-build-and-test`), -and once under AddressSanitizer + UndefinedBehaviorSanitizer -(`c-asan-build-and-test`). The sanitizer job is a hard gate — any heap-OOB, -use-after-free, double-free, or UB diagnoses fails the PR. LeakSanitizer is -deliberately **off** (`detect_leaks=0`) in CI; see the opt-in recipe below. - -### Local reproduction - -The `unit_test_asan` preset is the source of truth. Same flags, same runtime -options as CI: - -```bash -cmake --preset unit_test_asan -cmake --build --preset unit_test_asan -ctest --preset unit_test_asan -``` - -Or, in the devenv shell, the composite script: - -```bash -run_asan_tests -``` - -Sanitizer flags (`-fsanitize=address,undefined -fno-sanitize=function --fno-omit-frame-pointer -fno-sanitize-recover=all -g -O1`) propagate to every -target in the link graph via the configure preset — there is no opt-in per -target. - -Runtime options the test preset sets: - -- `ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:halt_on_error=1:strict_string_checks=1:check_initialization_order=1` -- `UBSAN_OPTIONS=print_stacktrace=1:halt_on_error=1` - -`halt_on_error=1` plus `-fno-sanitize-recover=all` means the **first** finding -aborts the test binary — earlier tests must run cleanly to surface later ones. -When triaging multiple unrelated failures, isolate by running individual test -binaries from `build/unit_test_asan/test/unit/...` directly. - -### macOS toolchain requirement (LLVM ≥ 22) - -macOS 26.4 changed the dyld shared-cache layout in a way that hangs -AddressSanitizer startup — `__asan_init` livelocks before `main()` (zero output, -~100% CPU) — for any compiler-rt **≤ 21.1.8**, which is the nixpkgs Darwin -default that `pkgs.clang` would otherwise provide. The upstream fix (LLVM -PR #182943, backported to `release/22.x`) ships in **LLVM ≥ 22**, so the devenv -`run_asan_tests` and `ci` scripts pin the ASan compiler to clang 22 (the -`nixpkgs-llvm22` input → `asanClang` in `devenv.nix`). The normal `gcc` build -and CI (Linux / apt-clang) are unaffected. - -Running ASan outside devenv on macOS? Use clang ≥ 22, or Apple Command Line -Tools ≥ 26.5 (Apple backported the same fix into their clang 21). Apple CLT -≤ 26.3 will hang. - -### Opt-in LeakSanitizer recipe - -LSan is staged separately because it requires a cleanup convention every test -honours; see #82 for the umbrella. To run a single test or directory under LSan -during incremental cleanup work, override `detect_leaks` at the call site: - -```bash -ASAN_OPTIONS="detect_leaks=1:abort_on_error=1:halt_on_error=1" \ - build/unit_test_asan/test/unit//UnitTest -``` - -For broader recon (e.g. surveying which tests currently leak), prefer the -valgrind-based recipe in `docs/superpowers/tools/lsan-recon/` — it produces -reproducible, fully-attributed per-test reports. - -## Allocation Locality - -Only `src/userApi/` may call `malloc`, `calloc`, `realloc`, or `free` directly. All other code (sub-layers under `src/`, tests under `test/`) must route allocations through `reserveMemory` and `freeReservedMemory` in `src/userApi/StorageApi.{c,h}`. - -Why: -- MCU stack overflows are silent killers; routing through StorageApi keeps stack usage predictable and small. -- Reviewers know exactly where to look for memory issues: `src/userApi/`. -- A future handle-based allocator can subsume the entire allocation surface in one API change instead of touching every call site. - -Enforcement: -- A CI job (`alloc-locality` in `.github/workflows/ci.yml`) runs `git grep` against `src/` and `test/` (excluding `src/userApi/`) and fails the build on any match. Comments are excluded from the match. -- Exceptions: none today. If a use-case arises that genuinely needs a direct alloc primitive outside `src/userApi/`, escalate via a PR comment so the rule itself can be revisited. - -## Test memory discipline - -Unit tests in `test/unit/**` follow a tiered idiom for memory cleanup. The -tier boundary is mechanical: tests that contain no `*Init*` calls (i.e., -purely stack-allocated `tensor_t`/`shape_t`/`quantization_t` designated -initializers) stay in the **stack-only tier** and need no cleanup. Any test -that calls `*Init*` (= heap allocation through `reserveMemory`) is in the -**heap tier** and follows three rules. - -### Rule 1 — Build via the post-#106 primitives - -Heap tensors are built by: - -```c -size_t *dims = reserveMemory(N * sizeof(size_t)); -/* ... populate dims[i] ... */ -size_t *order = reserveMemory(N * sizeof(size_t)); -setOrderOfDimsForNewTensor(N, order); -shape_t *s = reserveMemory(sizeof(shape_t)); -setShape(s, dims, N, order); -tensor_t *t = initTensor(s, quantizationInitFloat(), NULL); -tensorFillFromFloatBuffer(t, src, count); /* or initDistribution(t, &d); */ -``` - -The deprecated `tensorInitFloat` / `tensorInitSymInt32` / `tensorInit*` -family must not be used in new tests. Their attributes emit -`-Wdeprecated-declarations` to surface accidental adoption. - -A file-local factory like `makeFloatTensorForDistTest` in -`test/unit/tensor/UnitTestTensorApi.c` is fine when 3+ tests in the same -file repeat the construction. A *cross-file* helper is deferred until 3+ -test files repeat the same construction. - -### Rule 2 — Free in reverse-init order - -`freeTensor` cascades to data + shape (with its dims and order blocks) + -quantization + sparsity + the tensor struct itself. Do not call -`freeShape` or `freeQuantization` on a shape/quantization that was already -consumed by `initTensor` — that is a double-free. The cascade table: - -| Allocation | Cleanup call | Cascades to | -|-------------------------------------------|----------------------|-------------------------------------| -| `initTensor(s, q, sp)` | `freeTensor(t)` | data, shape (+dims, +order), q, sp | -| `parameterInit(p, g)` | `freeParameter(par)` | param tensor + grad (if non-NULL) | -| `linearLayerInitLegacy(...)` | `freeLinearLayerLegacy(l)` | layer config wrapper only | -| `reluLayerInitLegacy(...)` | `freeReluLayerLegacy(l)` | layer config wrapper only | -| `softmaxLayerInit(...)` | `freeSoftmaxLayer(l)`| layer config wrapper only | -| `sgdMCreateOptim(...)` | `freeOptimSgdM(o)` | all registered parameters + states | -| `inference(...)` (returns `tensor_t *`) | `freeTensor(out)` | as above | -| `inferenceWithLoss(...)` | `freeInferenceStats` | stats struct + output tensor | -| `calculateGradsSequential(...)` | `freeTrainingStats` | stats struct | - -Layer free-functions release only the config wrapper, not the parameters -they reference. When an optimizer is in play, `freeOptimSgdM` takes -ownership of the parameter cleanup — do not also call `freeParameter` on -the same pointers. - -### Rule 3 — Assert-last (capture, free, assert) - -ODT's Unity build defines `UNITY_INCLUDE_SETJMP`, so a failing -`TEST_ASSERT_*` longjmps out of the test function and any code after it -does not run. To keep LSan output meaningful — failing tests should still -report zero leaks attributable to the test fixture — every heap-tier test -follows this three-block shape: - -```c -void testFoo(void) { - /* 1. Build heap fixtures (Rule 1). */ - quantization_t *q = quantizationInitFloat(); - /* ... etc ... */ - - /* 2. Exercise the system, capture every assertion value into a - * stack local. Do not assert here. */ - float capturedLoss = inferenceWithLoss(model, ...)->loss; - /* (capture more if needed) */ - - /* 3. Free in reverse-init order (Rule 2). */ - freeTensor(t); - /* ... etc ... */ - - /* 4. Assert on the captured locals. */ - TEST_ASSERT_FLOAT_WITHIN(1e-4f, EXPECTED_LOSS, capturedLoss); -} -``` - -Reference exemplars in the tree: `test/unit/userAPI/UnitTestInferenceApi.c`, -`test/unit/userAPI/UnitTestMultiLayerTraining.c`, -`test/unit/tensor/UnitTestTensorApi.c::testInitDistribution_*`. - -### Verification - -A test file is considered idiom-compliant when, run under valgrind in the -`odt-lsan-recon:2026-04-22` Docker image with -`--leak-check=full --show-leak-kinds=all`, all four LEAK SUMMARY -categories report 0 bytes in 0 blocks (or valgrind emits "All heap blocks -were freed -- no leaks are possible"). The reproducible recipe and -container Dockerfile live in `docs/superpowers/tools/lsan-recon/`. - -## Build-time gold-value generators (CMake + uv + PyTorch) - -Some unit tests compare C-side numerics against PyTorch reference values. The -references are not committed: a Python script in the test directory emits a C -header (`expected_*.h`) at build time, which the test then `#include`s. - -The wiring lives in `test/unit//CMakeLists.txt`: - -```cmake -add_custom_command( - OUTPUT ${GEN_HEADER} - COMMAND uv run ${CMAKE_CURRENT_SOURCE_DIR}/generate_expected_.py - --out ${GEN_HEADER} - DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/generate_expected_.py - VERBATIM -) -add_custom_target(generate_expected_ DEPENDS ${GEN_HEADER}) -add_dependencies(UnitTest generate_expected_) -target_include_directories(UnitTest PRIVATE ${CMAKE_CURRENT_BINARY_DIR}) -``` - -Reference exemplars: -`test/unit/arithmetic/generate_expected_conv1d_kernel.py`, -`test/unit/arithmetic/generate_expected_conv_transpose_1d_kernel.py`. - -### Generator-script conventions - -- Use `repr(v) + "f"` to format C float literals, **not** `f"{v:.9g}"`. - `repr` always preserves a decimal point or exponent, so `10.0f` stays valid. - `:.9g` produces `10` and the trailing `f` then makes it an invalid integer - suffix that gcc rejects. -- Self-check fixtures with `assert torch.allclose(...)` before emitting them, - so generator-side numerical drift fails the build instead of silently - shifting expected values. -- `torch` and `torchvision` are declared as direct dependencies in - `pyproject.toml`. The decoupling is intentional: generator scripts - import `torch` directly, so the dependency belongs at the project - level rather than inherited from `elasticai-creator`. - -### CI implication: every job that runs `cmake --build` MUST install uv - -The custom command above is invoked by ninja during the build phase, not by -configure. Any CI job that produces or runs targets depending on a generated -header must therefore have `uv` on `PATH` at build time. In -`.github/workflows/ci.yml` this is `c-build-and-test` and -`c-asan-build-and-test`; both install uv via `astral-sh/setup-uv@v6` and -`uv sync` before `cmake --preset ...`. - -Locally this is silent: `devenv.nix` puts `uv` on `PATH` for the whole shell, -so `cmake --build` finds it without any explicit setup. CI is stricter and -catches drift here before merge. - -When introducing a new generator under a new test target, audit every CI job -that builds the affected preset and add the uv setup steps if missing. - -## Loss API: microbatch contracts - -Each loss function in `src/loss_functions/` exposes: - -- `forward(modelOutput, label, reduction) → float` -- `backward(modelOutput, label, result) → void` -- `computeMeanScale(totalSamples, modelOutput) → float` - -### Reduction split - -`lossConfig_t.backwardReduction` is the user's training-strategy choice — it -drives whether `scaleOptimizerGradients` runs between `trainingBatchDefault` -and `optimFns.step`. It is a config field. - -`forwardReduction` is a per-call parameter on every aggregator -(`trainingBatchDefault`, `evaluationBatch`, `evaluationEpoch`, `inferenceWithLoss`, -`calculateGradsFn_t`). It controls how the per-microbatch loss value is -reported. `trainingRun` is the only function that hardcodes it -(to `REDUCTION_MEAN`) so train and eval losses are comparable; lower-level -callers pick freely. - -### Microbatch shape - -`modelOutput->shape->dimensions[0]` is the microbatch dimension `B`. For -`B=1` today, output shape is `[F]` (the leading 1 is implicit). For `B>=1` -in the future, output shape is `[B, F]` and `numFeaturesPerSample = numElements / B`. - -**Uniform-B assumption** (DataLoader contract): all microbatches in one -macro batch have equal `B`. The MEAN aggregator divides by total samples -(`Σ batch->size`) rather than by `(numberOfBatches × B)`, so non-uniform B -would skew the mean. ODT's DataLoader currently always produces uniform -batches via `dropLast=true`; non-uniform B is out of contract. - -### Backward macro-scaling - -Backward writes raw per-element gradients (`2(o-l)` for MSE, `(p-y)` for CE). -The macro-batch divisor lives at the optimizer: - -- `lossFunctions[lossConfig.funcType].computeMeanScale(N, modelOutput)` - returns the PyTorch-parity divisor (`1/(N*F)` for MSE, `1/N` for CE). -- `scaleOptimizerGradients(optimizer, factor)` multiplies every parameter's - `grad` field by the factor in place. -- `trainingEpochDefault` calls these between accumulation and `step`, - but only when `backwardReduction == REDUCTION_MEAN`. - -For SUM (or future per-sample weighted variants — see #150), the backward -gradient flows through unscaled. - -### Shape assertion (deferred) - -Runtime assertion of the `dimensions[0] >= 1` contract is deferred to the -microbatch-B>1 umbrella (#152) — specifically #153. Today (B=1 only) the -assertion would be effectively a no-op; the protective value materialises -when B>1 becomes a real feature target. - -## Quantized gradient accumulation — known precision Open Problem - -As of the quantized-gradient prerequisite (`gradInit`, 2026-06-05) a trainable -layer's parameter gradient can be stored in the dtype its `backwardMath` -declares. For SYM_INT32 grads, the per-microbatch accumulation reuses the -existing `addSymInt32TensorsInplace` ("strategy A", dynamic-rescale): it -dequantizes both the running grad and the new microbatch grad to float, adds, -and re-quantizes the running sum to a new absmax-derived scale **on every -microbatch**. - -This is functionally correct end-to-end today, but **not** numerically ideal: - -- Quantization noise compounds with the number of microbatches M. -- The running-sum absmax is pinned by the heaviest microbatch, coarsening the - LSB for the accumulated small-gradient mass. - -Preliminary characterization (internal simulation, M=100, N=64, σ=1e-3 with a -10% ×50 heavy tail — *problem characterization only, not a basis for a chosen -solution*): - -| Strategy | Final rel. error vs float64 | Float-free? | -|---|---|---| -| A — dynamic-rescale (current) | ~1.5e-4, **grows with M** (2.0e-5 @ step1 → 1.7e-4 @ step100) | No | -| B — fixed-scale integer accum | ~9.9e-5 | Yes | -| C — float accum, quantize-at-read | ~2.2e-5 | No | - -We deliberately ship strategy A now and do **not** adopt B/C or any homegrown -numerical scheme. The resolution path is a literature review (stochastic-rounding -accumulators, error-feedback / residual accumulation, higher-precision master -grads, block/group scaling, …) → implement or improve a **published** technique. -Tracked as a separate research task (#218). This note is intentionally public -(not buried in a private spec) so contributors hitting accuracy issues in -quantized training know this is a known, expected limitation rather than a bug. - -### Two accumulation schemes in-tree (both intentional) - -- **Strategy A (dynamic-rescale)** — Linear SYM weight grads and LayerNorm - gamma/beta grads: per-microbatch `addSymInt32TensorsInplace` (dequantize - both operands with their own scales, float-add, requantize the running sum - to a fresh absmax scale). Not float-free. -- **Fixed-scale integer accumulation** — Linear SYM bias grads - (`linearCalcBiasGradsSymInt32`): increments are rescaled into the running - grad's EXISTING scale and added in integer arithmetic; the scale is never - re-derived during accumulation. The coarser resolution (LSB pinned by the - running scale, which inits to 1.0) is inherent to the scheme. - - **Attribution note:** this fixed-scale integer bias-GRADIENT accumulation is - ODT's own construction and is NOT prescribed by Deutel et al. - (arXiv:2407.10734). The paper's quantization is *dynamic*: scales are - re-derived from observed data — weights every SGD update (Eqs. 6-7) — and the - method is framed throughout as "dynamic adaptation of the zero-point and - scale parameters" (Sec. IV-E). The paper has a forward bias (int32 bias on - the int32 MAC accumulator, Fig. 2) but describes no bias-*gradient* - accumulation, and it nowhere states that any scale is held static *during - training* (the only static/PTQ mention is post-training, at deployment) — so - absent evidence to the contrary, assume its scales are dynamic. ODT's - fixed-scale bias-grad scheme, which never re-derives the scale during - accumulation, therefore DEVIATES from the paper's dynamic scaling; the ODT - scheme that corresponds to Deutel is Strategy A (dynamic-rescale, above). - What ODT also follows from Deutel: per-layer error requant (~Eq. 4) and the - float-space SGD step (~Eqs. 5-7). Scheme choice + the init-scale resolution - limit: #218. - -This is a research framework: deliberate scheme differences like this one -MUST be documented here, so experimental design stays separable from -accidental inconsistency. LayerNorm uses strategy A for BOTH gamma and beta -per the 2026-06-05 LayerNorm spec. - -## SYM_INT32 seed-rescale + the #189 guard - -A SYM_INT32 parameter that must enter an integer accumulator at a *different* -scale — the forward bias seed (Matmul today; Conv when #45 lands) and the -LayerNorm affine beta seed — is converted via `rescaleIntoAccumulatorScale` -(`src/arithmetic/Rounding.c`): `seed = round(param_q * param_scale / -accumulator_scale)`. The `float -> int32` cast is data-dependent and is UB on -overflow (#189); the helper guards it NaN-robustly (`!(x <= T)`, reserving one -worst-case int16 product `32768*32767` of headroom) under `-DODT_SEED_GUARD` -(default ON; a future MCU/release build disables it, with UBSan #204 covering -occurrences). All seed-rescale sites route through this one helper. - -This refold is deliberate, not a wart: it holds the real-valued bias **constant** -under ODT's dynamic per-input activation scaling. A fixed integer added raw -(`seed = b_int`, ignoring the bias scale) would apply the bias at -`s_acc / s_bias` of its value (≈0.01-0.05% on real layers — effectively deleting -it) and make it co-scale with input magnitude; the refold recomputes the seed -each forward (`∝ 1/s_acc`) so the bias stays a constant offset. The bias stays -SYM_INT32 (never a float master — the optimizer is single-dtype); a wide -raw-integer bias (qMaxBits=32, scale=1) would need a structurally different -scheme and is out of scope. - -## Conv1d / Conv1dTransposed SYM_INT32 (#45) - -Two integer sliding-window cores live in `src/arithmetic/`, siblings of the -FLOAT kernels with identical loop nest + `SlidingWindow1d` geometry: - -- `conv1dKernelSymInt32` — gather forward; Conv1d forward, and Conv1dTransposed's - `dx` adjoint in PR3. -- `convTranspose1dKernelSymInt32` — scatter forward; Conv1d's `dx` adjoint, and - Conv1dTransposed's forward in PR3. - -Both emit **raw accumulator-range int32 mantissas** at output scale `s_in·s_w` -(NOT range-restored). An explicitly-chained Quantization layer (#192) restores -the operand width downstream — the same contract as Linear/LayerNorm. Per-output- -channel bias is refolded into the product scale via `rescaleIntoAccumulatorScale` -(the #189 guarded helper); never raw-added. - -Conv1d backward dispatches on **three independent qConfigs** (`weightGradQ`, -`biasGradQ`, `propLossQ`), like `linearBackward`: - -- **weightGrad (SYM)** = strategy A: integer gather into a fresh `reserveMemory` - intermediate at scale `s_loss·s_in`, then `addSymInt32TensorsInplace` into the - SYM grad accumulator (fresh absmax scale). -- **biasGrad (SYM)** = an int32 `(batch × outputLength)` accumulator per output - channel, then `rescaleIntoAccumulatorScale(sum, s_loss, s_bg, mode)` at the - bias-grad's fixed scale (the #218 scheme). -- **dx / propLoss (SYM)** = `convTranspose1dKernelSymInt32(lossGrad, weights)`, - scale `s_loss·s_w`, guarded by the #187 fail-fast if `propLoss` is not SYM. - -### Operand bit-width: int12, not int16 (int32-accumulator soundness) - -SYM kernels accumulate **products** of operands in an **int32** accumulator (no -int64 — hard rule). For symmetric `b`-bit operands each product is ≤ 2^(2b−2), -so an int32 accumulator (~2^31) holds only ~2^(33−2b) worst-case product terms -before signed overflow (UB): - -| operand width | max product | int32 term at which overflow first occurs | -|---|---|---| -| int16 (qMaxBits=16) | 2^30 | 2 | -| int12 (qMaxBits=12) | 2^22 | 512 | -| int8 (qMaxBits=8) | 2^14 | 131072 | - -The number of worst-case terms that still **fit** is one less: int16 survives 1, -int12 survives **511**, int8 survives 131071 — i.e. int12 is sound for reductions -of length **N ≤ 511** (`512·2^22 = 2^31 > INT32_MAX`). - -int16×int16→int32 is **unsound for product-accumulation** (forward, dx, -weightGrad) — it overflows after ~2 full-scale terms; it is sound only for -*value* sums (biasGrad). Conv SYM therefore uses **int12 operands** -(`quantizationInitSymInt32WithBits(rm, 12)`): products ≤ 2047² ≈ 4.2e6, ~512-term -int32 headroom — ample for the batch=1 MCU regime ODT targets, matching the -low-bit×low-bit→int32 arithmetic of the Deutel FQT paper (arXiv:2407.10734) / -TFLite. The **grad accumulators stay int16** (wider accumulator, free since SYM -stores int32 regardless of qMaxBits). The **kernels are bit-width-agnostic** — -only the quantization configs change; the int32 accumulator (no int64) is kept. - -**Realized framework-wide int12 contract (PR-A, #227):** - -- The SYM_INT32 **operand** default is int12 via the compile-time knob - `ODT_SYM_OPERAND_QMAXBITS` (=12), set in `initSymInt32QConfig` - (`src/tensor/include/Quantization.h`). Override per-build with - `-DODT_SYM_OPERAND_QMAXBITS=N` (e.g. =8 for layers wider than 511). -- `matmulIntCore` (Linear forward / propLoss / weightGrad) and the LayerNorm - **affine product** now run on int12 operands, enforced by op-entry guards - (`matmulValidateSymOperand` at both Matmul SYM entries; - `layerNormValidateSymTensor` lowered to the knob). LayerNorm's per-group - mantissa-sum is a value-sum and stays sound at any qMaxBits ≤ 16. -- **Grad accumulators stay int16** via `ODT_SYM_GRAD_QMAXBITS` (=16), pinned - in `gradInitSymInt32` (`getQLike` preserves the source width). They are - value-sums; wider is free. -- int12 is sound only for reductions **N ≤ 511**; the runtime N-vs-budget check - is a deferred follow-up. The #189 policy (release runs free, CI UBSan #204) - backstops residual overflow. -- Note: the conv weightGrad product mixes an int12 input with an int16 grad - operand under the #218 grad-accumulator scheme — its budget is governed by - #218/#45, not closed by this operand flip. -- The unit-test gold suite validates the **default** int12/int16 contract - (`ODT_SYM_OPERAND_QMAXBITS=12`, `ODT_SYM_GRAD_QMAXBITS=16`); building with a - knob override (e.g. `-DODT_SYM_OPERAND_QMAXBITS=8`) diverges from those gold - fixtures, which is expected and intentional. - -The training loop (`CalculateGradsSequential.c`) allocates grad/activation -tensors from the **forward** qConfig, not the backward qConfigs — so a full-SYM -chain needs each layer's `propLossQ` to agree with the forward-derived grad dtype -(else the #187 guard fires), exactly as for Linear. The Conv→Quant→…→MSE chain -wiring + FLOAT32-twin convergence check is PR3. - -### Conv1dTransposed SYM_INT32 (PR3) - -Conv1dTransposed is Conv1d's adjoint with roles swapped, so it reuses BOTH PR2 -cores — no new kernels: - -- **forward** = `convTranspose1dKernelSymInt32` (the scatter core; its internal - per-channel bias-seed refold gives ConvT bias for free). Pass `outputPadding`. -- **dx / propLoss** = `conv1dKernelSymInt32` (the gather core, the VALID adjoint), - guarded by the #187 fail-fast if `propLoss` is not SYM_INT32. -- **weightGrad** = strategy A: a scatter-style integer gather (ConvT weight layout - `[Cin, Cout/groups, K]`, index `(ic·outChPerGroup + ocOffset)·K + k`) into a fresh - `reserveMemory` int32 intermediate at scale `s_in·s_loss`, then - `addSymInt32TensorsInplace` into the SYM grad accumulator. -- **biasGrad** = the same fixed-scale refold as Conv1d (`rescaleIntoAccumulatorScale` - over the `batch × outputLength` int32 sum). - -Backward dispatches on three independent qConfigs (`weightGradQ`/`biasGradQ`/ -`propLossQ`), like `conv1dBackward`/`linearBackward`. Operands are int12, grad -accumulators int16, accumulators int32 — no int64. Conv1dTransposed is VALID-only -(Phase 1), so the adjoint never hits a SAME/EXPLICIT padLeft. - -### Validator (PR3) - -`producerForwardQ` (`ModelValidationApi.c`) now returns the conv layer's `forwardQ` -for CONV1D and CONV1D_TRANSPOSED, bringing SYM-producing conv layers under the -int16 inter-layer contract: a SYM conv producer must be followed by a Quantization -layer (or sit in the last position). - -### SYM training chains - -The training loop allocates every grad/activation tensor from the FORWARD output -qConfig (`initGradTensor`), so a uniformly-SYM chain (every `forwardQ` SYM_INT32) -makes every grad tensor SYM_INT32 and every layer's `propLossQ` match — the #187 -guard passes. SYM-trainable conv layers are built via the low-level -`initConv1dTransposedConfigWithWeightsAndBias` with SYM `parameter_t`s (the -high-level factory keeps grads FLOAT32, matching the Linear KAIMING factory). -`Conv1dTransposed → Quant → MSE` trains under -`calculateGradsSequential` + `sgdStepM(SYM_INT32)`. - -## SYM ↔ * conversion bridge (#227) - -`SYM` is the sub-byte bit-packed **storage** dtype; `SYM_INT32` is the int32-slot -**compute** dtype. The MCU lifecycle is store-packed (`SYM`) → unpack to int32 -(`SYM_INT32`) → compute → repack. `conversionMatrix` -(`src/tensor/TensorConversion.c`) fills these cells: PR-B implements the **unpack -row** (`SYM → {SYM_INT32, FLOAT32, INT32, ASYM}`); the pack column (`* → SYM`) is -PR-C. - -**Sign-extend on unpack.** `byteConversion` is a pure bit-copy that ZERO-FILLS on -widen, so a packed signed mantissa (e.g. `−3` at qBits=6 = `0b111101`) would read -back as `61`. Every `SYM →` cell routes through the shared -`unpackSignExtend(src, srcBits, dst, n)` helper, which widens then sign-extends the -two's-complement payload from `srcBits` (`(v ^ signBit) − signBit`). ASYM codes are -non-negative, so the ASYM **pack** path does not sign-extend. - -**`int_repr` vs `dequantize` (deliberate, documented asymmetry).** A conversion -whose destination is `INT32` emits the integer **codes** and drops the scale -(`int_repr`); a conversion whose destination is `FLOAT32` emits the **values** with -the scale applied (`dequantize`). This mirrors PyTorch `int_repr()` vs -`dequantize()` and is consistent across both source dtypes: `SYM → INT32` and -`SYM_INT32 → INT32` are both `int_repr`; `SYM → FLOAT32` and `SYM_INT32 → FLOAT32` -are both `dequantize`. No value-rounding `→INT32` variant exists (YAGNI; -near-useless for `scale ≪ 1`). - -**Rescale on the symmetric↔asymmetric transition.** `SYM → ASYM` always rescales -(dequantize → derive a fresh asym `scale`+`zeroPoint` from min/max → requantize → -pack): a symmetric code grid cannot hold an off-center `+zeroPoint` band at the -carried scale, independent of width. - -**Asymmetric quantization convention (#243).** Every `* → ASYM` cell builds a float -buffer (from its own preamble) and routes through one shared helper, -`quantizeFloatToAsym` (`src/tensor/TensorConversion.c`) — the single source of truth. -Standard affine: `scale = (max − min) / (2^qBits − 1)`, `zeroPoint = round(min/scale)`, -`code = clamp(round(v/scale − zeroPoint), 0, 2^qBits − 1)` (HALF_AWAY). Dequant is -`(code + zeroPoint)·scale` — note the **additive** `zeroPoint` (ODT's sign convention, -the inverse of PyTorch's `q − zeroPoint`). A constant tensor (`min == max`) uses -`scale = (min != 0) ? |min| : 1` to avoid divide-by-zero. The denominator is -`2^qBits − 1`, **not** `2^qBits` — the latter is an off-by-one that leaves the top code -unreachable. New asym-producing converters MUST call this helper and never re-derive the -grid inline: hand-rolled copies are exactly how the four `*→ASYM` converters drifted -before #243. The float→SYM pack sibling is `packFloatBufferAsSym`. +Contributor conventions for OnDeviceTraining. Detailed per-subsystem conventions +live under `docs/conventions/`; this file is the index and the cross-cutting +vision. (Claude sessions receive each subsystem's conventions +path-scoped automatically via `.claude/rules/`.) ## Vision: memory over float accuracy @@ -570,3 +12,20 @@ may be deliberately inaccurate with no float-matching — that is by design, not a defect. FLOAT32-twin comparisons are a **ballpark sanity check**, not a tight acceptance gate; SYM acceptance is "trains and converges to a useful model". This does not license UB — overflow/garbage is still a bug (hence the #189 guard). + +## Subsystem conventions + +- [`conventions/tensor.md`](conventions/tensor.md) — `SYM_INT32` is a compute + format, not storage (#261); the `SYM ↔ *` conversion bridge (#227). +- [`conventions/arithmetic-sym.md`](conventions/arithmetic-sym.md) — #189 + seed-rescale guard; Conv1d/Conv1dTransposed SYM_INT32 (#45); the int12-operand / + int32-accumulator contract (no int64); the quantized grad-accumulation open + problem (#218). +- [`conventions/loss.md`](conventions/loss.md) — loss forward/backward/reduction + microbatch contracts; where the macro-batch divisor lives. +- [`conventions/allocation.md`](conventions/allocation.md) — allocation locality + (alloc primitives only in `src/userApi/`; everything else via StorageApi). +- [`conventions/testing.md`](conventions/testing.md) — sanitizer gating; heap-tier + test memory discipline; build-time gold-value generators. +- [`conventions/data-shape.md`](conventions/data-shape.md) — datasets deliver the + natural geometric shape; reshape/flatten is the first model layer. diff --git a/docs/conventions/allocation.md b/docs/conventions/allocation.md new file mode 100644 index 0000000..ea8ccc3 --- /dev/null +++ b/docs/conventions/allocation.md @@ -0,0 +1,15 @@ +# Allocation locality + +## Allocation Locality + +Only `src/userApi/` may call `malloc`, `calloc`, `realloc`, or `free` directly. All other code (sub-layers under `src/`, tests under `test/`) must route allocations through `reserveMemory` and `freeReservedMemory` in `src/userApi/StorageApi.{c,h}`. + +Why: +- MCU stack overflows are silent killers; routing through StorageApi keeps stack usage predictable and small. +- Reviewers know exactly where to look for memory issues: `src/userApi/`. +- A future handle-based allocator can subsume the entire allocation surface in one API change instead of touching every call site. + +Enforcement: +- A CI job (`alloc-locality` in `.github/workflows/ci.yml`) runs `git grep` against `src/` and `test/` (excluding `src/userApi/`) and fails the build on any match. Comments are excluded from the match. +- Exceptions: none today. If a use-case arises that genuinely needs a direct alloc primitive outside `src/userApi/`, escalate via a PR comment so the rule itself can be revisited. + diff --git a/docs/conventions/arithmetic-sym.md b/docs/conventions/arithmetic-sym.md new file mode 100644 index 0000000..bae1d1f --- /dev/null +++ b/docs/conventions/arithmetic-sym.md @@ -0,0 +1,222 @@ +# Arithmetic & SYM_INT32 kernels + +Conventions for the integer-math path: `src/arithmetic/**` and the SYM kernels of +`src/layer/{Conv1d,Conv1dTransposed,Linear,LayerNorm}*`. Path-scoped for Claude +via `.claude/rules/arithmetic-sym.md`. + +## SYM_INT32 seed-rescale + the #189 guard + +A SYM_INT32 parameter that must enter an integer accumulator at a *different* +scale — the forward bias seed (Matmul today; Conv when #45 lands) and the +LayerNorm affine beta seed — is converted via `rescaleIntoAccumulatorScale` +(`src/arithmetic/Rounding.c`): `seed = round(param_q * param_scale / +accumulator_scale)`. The `float -> int32` cast is data-dependent and is UB on +overflow (#189); the helper guards it NaN-robustly (`!(x <= T)`, reserving one +worst-case int16 product `32768*32767` of headroom) under `-DODT_SEED_GUARD` +(default ON; a future MCU/release build disables it, with UBSan #204 covering +occurrences). All seed-rescale sites route through this one helper. + +This refold is deliberate, not a wart: it holds the real-valued bias **constant** +under ODT's dynamic per-input activation scaling. A fixed integer added raw +(`seed = b_int`, ignoring the bias scale) would apply the bias at +`s_acc / s_bias` of its value (≈0.01-0.05% on real layers — effectively deleting +it) and make it co-scale with input magnitude; the refold recomputes the seed +each forward (`∝ 1/s_acc`) so the bias stays a constant offset. The bias stays +SYM_INT32 (never a float master — the optimizer is single-dtype); a wide +raw-integer bias (qMaxBits=32, scale=1) would need a structurally different +scheme and is out of scope. + +## Conv1d / Conv1dTransposed SYM_INT32 (#45) + +Two integer sliding-window cores live in `src/arithmetic/`, siblings of the +FLOAT kernels with identical loop nest + `SlidingWindow1d` geometry: + +- `conv1dKernelSymInt32` — gather forward; Conv1d forward, and Conv1dTransposed's + `dx` adjoint in PR3. +- `convTranspose1dKernelSymInt32` — scatter forward; Conv1d's `dx` adjoint, and + Conv1dTransposed's forward in PR3. + +Both emit **raw accumulator-range int32 mantissas** at output scale `s_in·s_w` +(NOT range-restored). An explicitly-chained Quantization layer (#192) restores +the operand width downstream — the same contract as Linear/LayerNorm. Per-output- +channel bias is refolded into the product scale via `rescaleIntoAccumulatorScale` +(the #189 guarded helper); never raw-added. + +Conv1d backward dispatches on **three independent qConfigs** (`weightGradQ`, +`biasGradQ`, `propLossQ`), like `linearBackward`: + +- **weightGrad (SYM)** = strategy A: integer gather into a fresh `reserveMemory` + intermediate at scale `s_loss·s_in`, then `addSymInt32TensorsInplace` into the + SYM grad accumulator (fresh absmax scale). +- **biasGrad (SYM)** = an int32 `(batch × outputLength)` accumulator per output + channel, then `rescaleIntoAccumulatorScale(sum, s_loss, s_bg, mode)` at the + bias-grad's fixed scale (the #218 scheme). +- **dx / propLoss (SYM)** = `convTranspose1dKernelSymInt32(lossGrad, weights)`, + scale `s_loss·s_w`, guarded by the #187 fail-fast if `propLoss` is not SYM. + +### Operand bit-width: int12, not int16 (int32-accumulator soundness) + +SYM kernels accumulate **products** of operands in an **int32** accumulator (no +int64 — hard rule). For symmetric `b`-bit operands each product is ≤ 2^(2b−2), +so an int32 accumulator (~2^31) holds only ~2^(33−2b) worst-case product terms +before signed overflow (UB): + +| operand width | max product | int32 term at which overflow first occurs | +|---|---|---| +| int16 (qMaxBits=16) | 2^30 | 2 | +| int12 (qMaxBits=12) | 2^22 | 512 | +| int8 (qMaxBits=8) | 2^14 | 131072 | + +The number of worst-case terms that still **fit** is one less: int16 survives 1, +int12 survives **511**, int8 survives 131071 — i.e. int12 is sound for reductions +of length **N ≤ 511** (`512·2^22 = 2^31 > INT32_MAX`). + +int16×int16→int32 is **unsound for product-accumulation** (forward, dx, +weightGrad) — it overflows after ~2 full-scale terms; it is sound only for +*value* sums (biasGrad). Conv SYM therefore uses **int12 operands** +(`quantizationInitSymInt32WithBits(rm, 12)`): products ≤ 2047² ≈ 4.2e6, ~512-term +int32 headroom — ample for the batch=1 MCU regime ODT targets, matching the +low-bit×low-bit→int32 arithmetic of the Deutel FQT paper (arXiv:2407.10734) / +TFLite. The **grad accumulators stay int16** (wider accumulator, free since SYM +stores int32 regardless of qMaxBits). The **kernels are bit-width-agnostic** — +only the quantization configs change; the int32 accumulator (no int64) is kept. + +**Realized framework-wide int12 contract (PR-A, #227):** + +- The SYM_INT32 **operand** default is int12 via the compile-time knob + `ODT_SYM_OPERAND_QMAXBITS` (=12), set in `initSymInt32QConfig` + (`src/tensor/include/Quantization.h`). Override per-build with + `-DODT_SYM_OPERAND_QMAXBITS=N` (e.g. =8 for layers wider than 511). +- `matmulIntCore` (Linear forward / propLoss / weightGrad) and the LayerNorm + **affine product** now run on int12 operands, enforced by op-entry guards + (`matmulValidateSymOperand` at both Matmul SYM entries; + `layerNormValidateSymTensor` lowered to the knob). LayerNorm's per-group + mantissa-sum is a value-sum and stays sound at any qMaxBits ≤ 16. +- **Grad accumulators stay int16** via `ODT_SYM_GRAD_QMAXBITS` (=16), pinned + in `gradInitSymInt32` (`getQLike` preserves the source width). biasGrad is a + value-sum; weightGrad is a sum of products (int32 accumulate → requantize). + Whether grads should be stored SYM_INT32 at all is under redesign — #261. +- int12 is sound only for reductions **N ≤ 511**; the runtime N-vs-budget check + is a deferred follow-up. The #189 policy (release runs free, CI UBSan #204) + backstops residual overflow. +- Note: the conv weightGrad product mixes an int12 input with an int16 grad + operand under the #218 grad-accumulator scheme — its budget is governed by + #218/#45, not closed by this operand flip. +- The unit-test gold suite validates the **default** int12/int16 contract + (`ODT_SYM_OPERAND_QMAXBITS=12`, `ODT_SYM_GRAD_QMAXBITS=16`); building with a + knob override (e.g. `-DODT_SYM_OPERAND_QMAXBITS=8`) diverges from those gold + fixtures, which is expected and intentional. + +The training loop (`CalculateGradsSequential.c`) allocates grad/activation +tensors from the **forward** qConfig, not the backward qConfigs — so a full-SYM +chain needs each layer's `propLossQ` to agree with the forward-derived grad dtype +(else the #187 guard fires), exactly as for Linear. The Conv→Quant→…→MSE chain +wiring + FLOAT32-twin convergence check is PR3. + +### Conv1dTransposed SYM_INT32 (PR3) + +Conv1dTransposed is Conv1d's adjoint with roles swapped, so it reuses BOTH PR2 +cores — no new kernels: + +- **forward** = `convTranspose1dKernelSymInt32` (the scatter core; its internal + per-channel bias-seed refold gives ConvT bias for free). Pass `outputPadding`. +- **dx / propLoss** = `conv1dKernelSymInt32` (the gather core, the VALID adjoint), + guarded by the #187 fail-fast if `propLoss` is not SYM_INT32. +- **weightGrad** = strategy A: a scatter-style integer gather (ConvT weight layout + `[Cin, Cout/groups, K]`, index `(ic·outChPerGroup + ocOffset)·K + k`) into a fresh + `reserveMemory` int32 intermediate at scale `s_in·s_loss`, then + `addSymInt32TensorsInplace` into the SYM grad accumulator. +- **biasGrad** = the same fixed-scale refold as Conv1d (`rescaleIntoAccumulatorScale` + over the `batch × outputLength` int32 sum). + +Backward dispatches on three independent qConfigs (`weightGradQ`/`biasGradQ`/ +`propLossQ`), like `conv1dBackward`/`linearBackward`. Operands are int12, grad +accumulators int16, accumulators int32 — no int64. Conv1dTransposed is VALID-only +(Phase 1), so the adjoint never hits a SAME/EXPLICIT padLeft. + +### Validator (PR3) + +`producerForwardQ` (`ModelValidationApi.c`) now returns the conv layer's `forwardQ` +for CONV1D and CONV1D_TRANSPOSED, bringing SYM-producing conv layers under the +int16 inter-layer contract: a SYM conv producer must be followed by a Quantization +layer (or sit in the last position). + +### SYM training chains + +The training loop allocates every grad/activation tensor from the FORWARD output +qConfig (`initGradTensor`), so a uniformly-SYM chain (every `forwardQ` SYM_INT32) +makes every grad tensor SYM_INT32 and every layer's `propLossQ` match — the #187 +guard passes. SYM-trainable conv layers are built via the low-level +`initConv1dTransposedConfigWithWeightsAndBias` with SYM `parameter_t`s (the +high-level factory keeps grads FLOAT32, matching the Linear KAIMING factory). +`Conv1dTransposed → Quant → MSE` trains under +`calculateGradsSequential` + `sgdStepM(SYM_INT32)`. + +## Quantized gradient accumulation — known precision Open Problem + +As of the quantized-gradient prerequisite (`gradInit`, 2026-06-05) a trainable +layer's parameter gradient can be stored in the dtype its `backwardMath` +declares. For SYM_INT32 grads, the per-microbatch accumulation reuses the +existing `addSymInt32TensorsInplace` ("strategy A", dynamic-rescale): it +dequantizes both the running grad and the new microbatch grad to float, adds, +and re-quantizes the running sum to a new absmax-derived scale **on every +microbatch**. + +This is functionally correct end-to-end today, but **not** numerically ideal: + +- Quantization noise compounds with the number of microbatches M. +- The running-sum absmax is pinned by the heaviest microbatch, coarsening the + LSB for the accumulated small-gradient mass. + +Preliminary characterization (internal simulation, M=100, N=64, σ=1e-3 with a +10% ×50 heavy tail — *problem characterization only, not a basis for a chosen +solution*): + +| Strategy | Final rel. error vs float64 | Float-free? | +|---|---|---| +| A — dynamic-rescale (current) | ~1.5e-4, **grows with M** (2.0e-5 @ step1 → 1.7e-4 @ step100) | No | +| B — fixed-scale integer accum | ~9.9e-5 | Yes | +| C — float accum, quantize-at-read | ~2.2e-5 | No | + +We deliberately ship strategy A now and do **not** adopt B/C or any homegrown +numerical scheme. The resolution path is a literature review (stochastic-rounding +accumulators, error-feedback / residual accumulation, higher-precision master +grads, block/group scaling, …) → implement or improve a **published** technique. +Tracked as a separate research task (#218). This note is intentionally public +(not buried in a private spec) so contributors hitting accuracy issues in +quantized training know this is a known, expected limitation rather than a bug. + +### Two accumulation schemes in-tree (both intentional) + +- **Strategy A (dynamic-rescale)** — Linear SYM weight grads and LayerNorm + gamma/beta grads: per-microbatch `addSymInt32TensorsInplace` (dequantize + both operands with their own scales, float-add, requantize the running sum + to a fresh absmax scale). Not float-free. +- **Fixed-scale integer accumulation** — Linear SYM bias grads + (`linearCalcBiasGradsSymInt32`): increments are rescaled into the running + grad's EXISTING scale and added in integer arithmetic; the scale is never + re-derived during accumulation. The coarser resolution (LSB pinned by the + running scale, which inits to 1.0) is inherent to the scheme. + + **Attribution note:** this fixed-scale integer bias-GRADIENT accumulation is + ODT's own construction and is NOT prescribed by Deutel et al. + (arXiv:2407.10734). The paper's quantization is *dynamic*: scales are + re-derived from observed data — weights every SGD update (Eqs. 6-7) — and the + method is framed throughout as "dynamic adaptation of the zero-point and + scale parameters" (Sec. IV-E). The paper has a forward bias (int32 bias on + the int32 MAC accumulator, Fig. 2) but describes no bias-*gradient* + accumulation, and it nowhere states that any scale is held static *during + training* (the only static/PTQ mention is post-training, at deployment) — so + absent evidence to the contrary, assume its scales are dynamic. ODT's + fixed-scale bias-grad scheme, which never re-derives the scale during + accumulation, therefore DEVIATES from the paper's dynamic scaling; the ODT + scheme that corresponds to Deutel is Strategy A (dynamic-rescale, above). + What ODT also follows from Deutel: per-layer error requant (~Eq. 4) and the + float-space SGD step (~Eqs. 5-7). Scheme choice + the init-scale resolution + limit: #218. + +This is a research framework: deliberate scheme differences like this one +MUST be documented here, so experimental design stays separable from +accidental inconsistency. LayerNorm uses strategy A for BOTH gamma and beta +per the 2026-06-05 LayerNorm spec. + diff --git a/docs/conventions/data-shape.md b/docs/conventions/data-shape.md new file mode 100644 index 0000000..1af3b51 --- /dev/null +++ b/docs/conventions/data-shape.md @@ -0,0 +1,16 @@ +# Data shape convention + +## Data Shape Convention + +Datasets deliver samples in their natural geometric shape (e.g. `[C, H, W]` +for images, `[C, L]` for time series). Any `reshape`, `flatten`, or `view` +operation is the **first layer of the model**, not a preprocessing step in +the dataset. This: + +- keeps dataset code independent of downstream model topology +- allows one dataset to feed models with different input ranks +- matches the PyTorch / Keras / elastic-ai.creator IR convention, so a future + ir2c can compile each shape transform to a corresponding C layer + +For flatten-to-2D, use `flattenLayerInit()` from `FlattenApi.h`. + diff --git a/docs/conventions/loss.md b/docs/conventions/loss.md new file mode 100644 index 0000000..13a7a29 --- /dev/null +++ b/docs/conventions/loss.md @@ -0,0 +1,57 @@ +# Loss & training-loop microbatch contracts + +## Loss API: microbatch contracts + +Each loss function in `src/loss_functions/` exposes: + +- `forward(modelOutput, label, reduction) → float` +- `backward(modelOutput, label, result) → void` +- `computeMeanScale(totalSamples, modelOutput) → float` + +### Reduction split + +`lossConfig_t.backwardReduction` is the user's training-strategy choice — it +drives whether `scaleOptimizerGradients` runs between `trainingBatchDefault` +and `optimFns.step`. It is a config field. + +`forwardReduction` is a per-call parameter on every aggregator +(`trainingBatchDefault`, `evaluationBatch`, `evaluationEpoch`, `inferenceWithLoss`, +`calculateGradsFn_t`). It controls how the per-microbatch loss value is +reported. `trainingRun` is the only function that hardcodes it +(to `REDUCTION_MEAN`) so train and eval losses are comparable; lower-level +callers pick freely. + +### Microbatch shape + +`modelOutput->shape->dimensions[0]` is the microbatch dimension `B`. For +`B=1` today, output shape is `[F]` (the leading 1 is implicit). For `B>=1` +in the future, output shape is `[B, F]` and `numFeaturesPerSample = numElements / B`. + +**Uniform-B assumption** (DataLoader contract): all microbatches in one +macro batch have equal `B`. The MEAN aggregator divides by total samples +(`Σ batch->size`) rather than by `(numberOfBatches × B)`, so non-uniform B +would skew the mean. ODT's DataLoader currently always produces uniform +batches via `dropLast=true`; non-uniform B is out of contract. + +### Backward macro-scaling + +Backward writes raw per-element gradients (`2(o-l)` for MSE, `(p-y)` for CE). +The macro-batch divisor lives at the optimizer: + +- `lossFunctions[lossConfig.funcType].computeMeanScale(N, modelOutput)` + returns the PyTorch-parity divisor (`1/(N*F)` for MSE, `1/N` for CE). +- `scaleOptimizerGradients(optimizer, factor)` multiplies every parameter's + `grad` field by the factor in place. +- `trainingEpochDefault` calls these between accumulation and `step`, + but only when `backwardReduction == REDUCTION_MEAN`. + +For SUM (or future per-sample weighted variants — see #150), the backward +gradient flows through unscaled. + +### Shape assertion (deferred) + +Runtime assertion of the `dimensions[0] >= 1` contract is deferred to the +microbatch-B>1 umbrella (#152) — specifically #153. Today (B=1 only) the +assertion would be effectively a no-op; the protective value materialises +when B>1 becomes a real feature target. + diff --git a/docs/conventions/tensor.md b/docs/conventions/tensor.md new file mode 100644 index 0000000..c224034 --- /dev/null +++ b/docs/conventions/tensor.md @@ -0,0 +1,66 @@ +# Tensor — quantization dtype semantics + +Conventions for `src/tensor/**` — dtypes, quantization configs, and the +conversion matrix. Path-scoped for Claude via `.claude/rules/tensor.md`. + +## SYM_INT32 is a compute format, not storage (#261) + +`SYM_INT32` (int32 mantissa + one per-tensor float scale) is the framework's +**integer-compute** representation — the only integer-math path the kernels use. +It is **not** a storage format: it costs the same 4 bytes/element as `FLOAT32` +but is a single-scale fixed-point approximation, so as storage it is dominated by +both `FLOAT32` (same size, better fidelity — a per-value exponent keeps the small +magnitudes a single scale loses) and `SYM`/`ASYM` (which sub-byte-pack). The +integer math is a **transient**; nothing durable should be persisted `SYM_INT32` +to "save memory" — it saves nothing and adds error. + +This bites hardest for **gradients**. Persistent parameter grads should be stored +`FLOAT32` (fidelity, same size) or `SYM`/`ASYM` (real compression); the integer +step stays transient `SYM_INT32`. The only legitimate `SYM_INT32` grads are the +transient dx/agrad operand-wires during backprop (int12, freed after the pass). +That today's parameter grads are stored `SYM_INT32` (`gradInitSymInt32`, and the +SGD SYM path that dequantizes → steps in float → requantizes for no gain) is a +known conceptual gap under redesign — #261 (subsumes #203). + +## SYM ↔ * conversion bridge (#227) + +`SYM` is the sub-byte bit-packed **storage** dtype; `SYM_INT32` is the int32-slot +**compute** dtype. The MCU lifecycle is store-packed (`SYM`) → unpack to int32 +(`SYM_INT32`) → compute → repack. `conversionMatrix` +(`src/tensor/TensorConversion.c`) fills these cells: PR-B implements the **unpack +row** (`SYM → {SYM_INT32, FLOAT32, INT32, ASYM}`); the pack column (`* → SYM`) is +PR-C. + +**Sign-extend on unpack.** `byteConversion` is a pure bit-copy that ZERO-FILLS on +widen, so a packed signed mantissa (e.g. `−3` at qBits=6 = `0b111101`) would read +back as `61`. Every `SYM →` cell routes through the shared +`unpackSignExtend(src, srcBits, dst, n)` helper, which widens then sign-extends the +two's-complement payload from `srcBits` (`(v ^ signBit) − signBit`). ASYM codes are +non-negative, so the ASYM **pack** path does not sign-extend. + +**`int_repr` vs `dequantize` (deliberate, documented asymmetry).** A conversion +whose destination is `INT32` emits the integer **codes** and drops the scale +(`int_repr`); a conversion whose destination is `FLOAT32` emits the **values** with +the scale applied (`dequantize`). This mirrors PyTorch `int_repr()` vs +`dequantize()` and is consistent across both source dtypes: `SYM → INT32` and +`SYM_INT32 → INT32` are both `int_repr`; `SYM → FLOAT32` and `SYM_INT32 → FLOAT32` +are both `dequantize`. No value-rounding `→INT32` variant exists (YAGNI; +near-useless for `scale ≪ 1`). + +**Rescale on the symmetric↔asymmetric transition.** `SYM → ASYM` always rescales +(dequantize → derive a fresh asym `scale`+`zeroPoint` from min/max → requantize → +pack): a symmetric code grid cannot hold an off-center `+zeroPoint` band at the +carried scale, independent of width. + +**Asymmetric quantization convention (#243).** Every `* → ASYM` cell builds a float +buffer (from its own preamble) and routes through one shared helper, +`quantizeFloatToAsym` (`src/tensor/TensorConversion.c`) — the single source of truth. +Standard affine: `scale = (max − min) / (2^qBits − 1)`, `zeroPoint = round(min/scale)`, +`code = clamp(round(v/scale − zeroPoint), 0, 2^qBits − 1)` (HALF_AWAY). Dequant is +`(code + zeroPoint)·scale` — note the **additive** `zeroPoint` (ODT's sign convention, +the inverse of PyTorch's `q − zeroPoint`). A constant tensor (`min == max`) uses +`scale = (min != 0) ? |min| : 1` to avoid divide-by-zero. The denominator is +`2^qBits − 1`, **not** `2^qBits` — the latter is an off-by-one that leaves the top code +unreachable. New asym-producing converters MUST call this helper and never re-derive the +grid inline: hand-rolled copies are exactly how the four `*→ASYM` converters drifted +before #243. The float→SYM pack sibling is `packFloatBufferAsSym`. diff --git a/docs/conventions/testing.md b/docs/conventions/testing.md new file mode 100644 index 0000000..89faf28 --- /dev/null +++ b/docs/conventions/testing.md @@ -0,0 +1,225 @@ +# Unit-test conventions + +## Sanitizer-driven memory bug detection + +The C unit-test suite is run twice in CI: once normally (`c-build-and-test`), +and once under AddressSanitizer + UndefinedBehaviorSanitizer +(`c-asan-build-and-test`). The sanitizer job is a hard gate — any heap-OOB, +use-after-free, double-free, or UB diagnoses fails the PR. LeakSanitizer is +deliberately **off** (`detect_leaks=0`) in CI; see the opt-in recipe below. + +### Local reproduction + +The `unit_test_asan` preset is the source of truth. Same flags, same runtime +options as CI: + +```bash +cmake --preset unit_test_asan +cmake --build --preset unit_test_asan +ctest --preset unit_test_asan +``` + +Or, in the devenv shell, the composite script: + +```bash +run_asan_tests +``` + +Sanitizer flags (`-fsanitize=address,undefined -fno-sanitize=function +-fno-omit-frame-pointer -fno-sanitize-recover=all -g -O1`) propagate to every +target in the link graph via the configure preset — there is no opt-in per +target. + +Runtime options the test preset sets: + +- `ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:halt_on_error=1:strict_string_checks=1:check_initialization_order=1` +- `UBSAN_OPTIONS=print_stacktrace=1:halt_on_error=1` + +`halt_on_error=1` plus `-fno-sanitize-recover=all` means the **first** finding +aborts the test binary — earlier tests must run cleanly to surface later ones. +When triaging multiple unrelated failures, isolate by running individual test +binaries from `build/unit_test_asan/test/unit/...` directly. + +### macOS toolchain requirement (LLVM ≥ 22) + +macOS 26.4 changed the dyld shared-cache layout in a way that hangs +AddressSanitizer startup — `__asan_init` livelocks before `main()` (zero output, +~100% CPU) — for any compiler-rt **≤ 21.1.8**, which is the nixpkgs Darwin +default that `pkgs.clang` would otherwise provide. The upstream fix (LLVM +PR #182943, backported to `release/22.x`) ships in **LLVM ≥ 22**, so the devenv +`run_asan_tests` and `ci` scripts pin the ASan compiler to clang 22 (the +`nixpkgs-llvm22` input → `asanClang` in `devenv.nix`). The normal `gcc` build +and CI (Linux / apt-clang) are unaffected. + +Running ASan outside devenv on macOS? Use clang ≥ 22, or Apple Command Line +Tools ≥ 26.5 (Apple backported the same fix into their clang 21). Apple CLT +≤ 26.3 will hang. + +### Opt-in LeakSanitizer recipe + +LSan is staged separately because it requires a cleanup convention every test +honours; see #82 for the umbrella. To run a single test or directory under LSan +during incremental cleanup work, override `detect_leaks` at the call site: + +```bash +ASAN_OPTIONS="detect_leaks=1:abort_on_error=1:halt_on_error=1" \ + build/unit_test_asan/test/unit//UnitTest +``` + +For broader recon (e.g. surveying which tests currently leak), prefer the +valgrind-based recipe in `docs/superpowers/tools/lsan-recon/` — it produces +reproducible, fully-attributed per-test reports. + +## Test memory discipline + +Unit tests in `test/unit/**` follow a tiered idiom for memory cleanup. The +tier boundary is mechanical: tests that contain no `*Init*` calls (i.e., +purely stack-allocated `tensor_t`/`shape_t`/`quantization_t` designated +initializers) stay in the **stack-only tier** and need no cleanup. Any test +that calls `*Init*` (= heap allocation through `reserveMemory`) is in the +**heap tier** and follows three rules. + +### Rule 1 — Build via the post-#106 primitives + +Heap tensors are built by: + +```c +size_t *dims = reserveMemory(N * sizeof(size_t)); +/* ... populate dims[i] ... */ +size_t *order = reserveMemory(N * sizeof(size_t)); +setOrderOfDimsForNewTensor(N, order); +shape_t *s = reserveMemory(sizeof(shape_t)); +setShape(s, dims, N, order); +tensor_t *t = initTensor(s, quantizationInitFloat(), NULL); +tensorFillFromFloatBuffer(t, src, count); /* or initDistribution(t, &d); */ +``` + +The deprecated `tensorInitFloat` / `tensorInitSymInt32` / `tensorInit*` +family must not be used in new tests. Their attributes emit +`-Wdeprecated-declarations` to surface accidental adoption. + +A file-local factory like `makeFloatTensorForDistTest` in +`test/unit/tensor/UnitTestTensorApi.c` is fine when 3+ tests in the same +file repeat the construction. A *cross-file* helper is deferred until 3+ +test files repeat the same construction. + +### Rule 2 — Free in reverse-init order + +`freeTensor` cascades to data + shape (with its dims and order blocks) + +quantization + sparsity + the tensor struct itself. Do not call +`freeShape` or `freeQuantization` on a shape/quantization that was already +consumed by `initTensor` — that is a double-free. The cascade table: + +| Allocation | Cleanup call | Cascades to | +|-------------------------------------------|----------------------|-------------------------------------| +| `initTensor(s, q, sp)` | `freeTensor(t)` | data, shape (+dims, +order), q, sp | +| `parameterInit(p, g)` | `freeParameter(par)` | param tensor + grad (if non-NULL) | +| `linearLayerInitLegacy(...)` | `freeLinearLayerLegacy(l)` | layer config wrapper only | +| `reluLayerInitLegacy(...)` | `freeReluLayerLegacy(l)` | layer config wrapper only | +| `softmaxLayerInit(...)` | `freeSoftmaxLayer(l)`| layer config wrapper only | +| `sgdMCreateOptim(...)` | `freeOptimSgdM(o)` | all registered parameters + states | +| `inference(...)` (returns `tensor_t *`) | `freeTensor(out)` | as above | +| `inferenceWithLoss(...)` | `freeInferenceStats` | stats struct + output tensor | +| `calculateGradsSequential(...)` | `freeTrainingStats` | stats struct | + +Layer free-functions release only the config wrapper, not the parameters +they reference. When an optimizer is in play, `freeOptimSgdM` takes +ownership of the parameter cleanup — do not also call `freeParameter` on +the same pointers. + +### Rule 3 — Assert-last (capture, free, assert) + +ODT's Unity build defines `UNITY_INCLUDE_SETJMP`, so a failing +`TEST_ASSERT_*` longjmps out of the test function and any code after it +does not run. To keep LSan output meaningful — failing tests should still +report zero leaks attributable to the test fixture — every heap-tier test +follows this three-block shape: + +```c +void testFoo(void) { + /* 1. Build heap fixtures (Rule 1). */ + quantization_t *q = quantizationInitFloat(); + /* ... etc ... */ + + /* 2. Exercise the system, capture every assertion value into a + * stack local. Do not assert here. */ + float capturedLoss = inferenceWithLoss(model, ...)->loss; + /* (capture more if needed) */ + + /* 3. Free in reverse-init order (Rule 2). */ + freeTensor(t); + /* ... etc ... */ + + /* 4. Assert on the captured locals. */ + TEST_ASSERT_FLOAT_WITHIN(1e-4f, EXPECTED_LOSS, capturedLoss); +} +``` + +Reference exemplars in the tree: `test/unit/userAPI/UnitTestInferenceApi.c`, +`test/unit/userAPI/UnitTestMultiLayerTraining.c`, +`test/unit/tensor/UnitTestTensorApi.c::testInitDistribution_*`. + +### Verification + +A test file is considered idiom-compliant when, run under valgrind in the +`odt-lsan-recon:2026-04-22` Docker image with +`--leak-check=full --show-leak-kinds=all`, all four LEAK SUMMARY +categories report 0 bytes in 0 blocks (or valgrind emits "All heap blocks +were freed -- no leaks are possible"). The reproducible recipe and +container Dockerfile live in `docs/superpowers/tools/lsan-recon/`. + +## Build-time gold-value generators (CMake + uv + PyTorch) + +Some unit tests compare C-side numerics against PyTorch reference values. The +references are not committed: a Python script in the test directory emits a C +header (`expected_*.h`) at build time, which the test then `#include`s. + +The wiring lives in `test/unit//CMakeLists.txt`: + +```cmake +add_custom_command( + OUTPUT ${GEN_HEADER} + COMMAND uv run ${CMAKE_CURRENT_SOURCE_DIR}/generate_expected_.py + --out ${GEN_HEADER} + DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/generate_expected_.py + VERBATIM +) +add_custom_target(generate_expected_ DEPENDS ${GEN_HEADER}) +add_dependencies(UnitTest generate_expected_) +target_include_directories(UnitTest PRIVATE ${CMAKE_CURRENT_BINARY_DIR}) +``` + +Reference exemplars: +`test/unit/arithmetic/generate_expected_conv1d_kernel.py`, +`test/unit/arithmetic/generate_expected_conv_transpose_1d_kernel.py`. + +### Generator-script conventions + +- Use `repr(v) + "f"` to format C float literals, **not** `f"{v:.9g}"`. + `repr` always preserves a decimal point or exponent, so `10.0f` stays valid. + `:.9g` produces `10` and the trailing `f` then makes it an invalid integer + suffix that gcc rejects. +- Self-check fixtures with `assert torch.allclose(...)` before emitting them, + so generator-side numerical drift fails the build instead of silently + shifting expected values. +- `torch` and `torchvision` are declared as direct dependencies in + `pyproject.toml`. The decoupling is intentional: generator scripts + import `torch` directly, so the dependency belongs at the project + level rather than inherited from `elasticai-creator`. + +### CI implication: every job that runs `cmake --build` MUST install uv + +The custom command above is invoked by ninja during the build phase, not by +configure. Any CI job that produces or runs targets depending on a generated +header must therefore have `uv` on `PATH` at build time. In +`.github/workflows/ci.yml` this is `c-build-and-test` and +`c-asan-build-and-test`; both install uv via `astral-sh/setup-uv@v6` and +`uv sync` before `cmake --preset ...`. + +Locally this is silent: `devenv.nix` puts `uv` on `PATH` for the whole shell, +so `cmake --build` finds it without any explicit setup. CI is stricter and +catches drift here before merge. + +When introducing a new generator under a new test target, audit every CI job +that builds the affected preset and add the uv setup steps if missing. + diff --git a/src/tensor/include/Quantization.h b/src/tensor/include/Quantization.h index da9a514..853977a 100644 --- a/src/tensor/include/Quantization.h +++ b/src/tensor/include/Quantization.h @@ -14,8 +14,7 @@ typedef struct symInt32QConfig { /* SYM_INT32 operand bit-width contract (#227). Operands feeding product * accumulators are int12 so int12*int12 products stay within an int32 * accumulator (no int64). Sound for reductions N <= 511 (512*2^22 > INT32_MAX); - * narrow the knob for wider layers. Grad accumulators are value-sums and stay - * wide (int16) per the #45 contract. Override with -DODT_SYM_OPERAND_QMAXBITS=N. */ + * narrow the knob for wider layers. Override with -DODT_SYM_OPERAND_QMAXBITS=N. */ #ifndef ODT_SYM_OPERAND_QMAXBITS #define ODT_SYM_OPERAND_QMAXBITS 12 #endif diff --git a/src/userApi/tensor/include/QuantizationApi.h b/src/userApi/tensor/include/QuantizationApi.h index 6670a6f..33f61c5 100644 --- a/src/userApi/tensor/include/QuantizationApi.h +++ b/src/userApi/tensor/include/QuantizationApi.h @@ -24,16 +24,9 @@ quantization_t *quantizationInitInt32(); */ quantization_t *quantizationInitSymInt32(roundingMode_t roundingMode); -/*! SymInt32 with explicit qMaxBits. The existing quantizationInitSymInt32(rm) - * hardcodes qMaxBits=16; this variant lets callers specify the active bit - * width for fixed-point arithmetic (e.g. 12 bits for tighter dynamic range, - * 32 bits for full int32 range). - * - * \param roundingMode: Rounding mode to be used - * \param qMaxBits: Active bit width for fixed-point arithmetic - * - * \returns Pointer to initialized quantization - */ +/*! SymInt32 with explicit qMaxBits. Plain quantizationInitSymInt32(rm) uses the + * int12 operand default (ODT_SYM_OPERAND_QMAXBITS). Widths >16 need scale=1 + * (raw-int, unvalidated); 32 is not cast-safe in the converters (#202). */ quantization_t *quantizationInitSymInt32WithBits(roundingMode_t roundingMode, uint8_t qMaxBits); /*! Sub-byte symmetric quantization with explicit bit width and rounding. */