es-ude · LeoBuron · Jul 1, 2026 · Jul 1, 2026
diff --git a/docs/CONVENTIONS.md b/docs/CONVENTIONS.md
diff --git a/docs/conventions/allocation.md b/docs/conventions/allocation.md
@@ -0,0 +1,15 @@
+# Allocation locality
+
+## Allocation Locality
+
+Only `src/userApi/` may call `malloc`, `calloc`, `realloc`, or `free` directly. All other code (sub-layers under `src/`, tests under `test/`) must route allocations through `reserveMemory` and `freeReservedMemory` in `src/userApi/StorageApi.{c,h}`.
+
+Why:
+- MCU stack overflows are silent killers; routing through StorageApi keeps stack usage predictable and small.
+- Reviewers know exactly where to look for memory issues: `src/userApi/`.
+- A future handle-based allocator can subsume the entire allocation surface in one API change instead of touching every call site.
+
+Enforcement:
+- A CI job (`alloc-locality` in `.github/workflows/ci.yml`) runs `git grep` against `src/` and `test/` (excluding `src/userApi/`) and fails the build on any match. Comments are excluded from the match.
+- Exceptions: none today. If a use-case arises that genuinely needs a direct alloc primitive outside `src/userApi/`, escalate via a PR comment so the rule itself can be revisited.
+
diff --git a/docs/conventions/arithmetic-sym.md b/docs/conventions/arithmetic-sym.md
@@ -0,0 +1,222 @@
+# Arithmetic & SYM_INT32 kernels
+
+Conventions for the integer-math path: `src/arithmetic/**` and the SYM kernels of
+`src/layer/{Conv1d,Conv1dTransposed,Linear,LayerNorm}*`. Path-scoped for Claude
+via `.claude/rules/arithmetic-sym.md`.
+
+## SYM_INT32 seed-rescale + the #189 guard
+
+A SYM_INT32 parameter that must enter an integer accumulator at a *different*
+scale — the forward bias seed (Matmul today; Conv when #45 lands) and the
+LayerNorm affine beta seed — is converted via `rescaleIntoAccumulatorScale`
+(`src/arithmetic/Rounding.c`): `seed = round(param_q * param_scale /
+accumulator_scale)`. The `float -> int32` cast is data-dependent and is UB on
+overflow (#189); the helper guards it NaN-robustly (`!(x <= T)`, reserving one
+worst-case int16 product `32768*32767` of headroom) under `-DODT_SEED_GUARD`
+(default ON; a future MCU/release build disables it, with UBSan #204 covering
+occurrences). All seed-rescale sites route through this one helper.
+
+This refold is deliberate, not a wart: it holds the real-valued bias **constant**
+under ODT's dynamic per-input activation scaling. A fixed integer added raw
+(`seed = b_int`, ignoring the bias scale) would apply the bias at
+`s_acc / s_bias` of its value (≈0.01-0.05% on real layers — effectively deleting
+it) and make it co-scale with input magnitude; the refold recomputes the seed
+each forward (`∝ 1/s_acc`) so the bias stays a constant offset. The bias stays
+SYM_INT32 (never a float master — the optimizer is single-dtype); a wide
+raw-integer bias (qMaxBits=32, scale=1) would need a structurally different
+scheme and is out of scope.
+
+## Conv1d / Conv1dTransposed SYM_INT32 (#45)
+
+Two integer sliding-window cores live in `src/arithmetic/`, siblings of the
+FLOAT kernels with identical loop nest + `SlidingWindow1d` geometry:
+
+- `conv1dKernelSymInt32` — gather forward; Conv1d forward, and Conv1dTransposed's
+  `dx` adjoint in PR3.
+- `convTranspose1dKernelSymInt32` — scatter forward; Conv1d's `dx` adjoint, and
+  Conv1dTransposed's forward in PR3.
+
+Both emit **raw accumulator-range int32 mantissas** at output scale `s_in·s_w`
+(NOT range-restored). An explicitly-chained Quantization layer (#192) restores
+the operand width downstream — the same contract as Linear/LayerNorm. Per-output-
+channel bias is refolded into the product scale via `rescaleIntoAccumulatorScale`
+(the #189 guarded helper); never raw-added.
+
+Conv1d backward dispatches on **three independent qConfigs** (`weightGradQ`,
+`biasGradQ`, `propLossQ`), like `linearBackward`:
+
+- **weightGrad (SYM)** = strategy A: integer gather into a fresh `reserveMemory`
+  intermediate at scale `s_loss·s_in`, then `addSymInt32TensorsInplace` into the
+  SYM grad accumulator (fresh absmax scale).
+- **biasGrad (SYM)** = an int32 `(batch × outputLength)` accumulator per output
+  channel, then `rescaleIntoAccumulatorScale(sum, s_loss, s_bg, mode)` at the
+  bias-grad's fixed scale (the #218 scheme).
+- **dx / propLoss (SYM)** = `convTranspose1dKernelSymInt32(lossGrad, weights)`,
+  scale `s_loss·s_w`, guarded by the #187 fail-fast if `propLoss` is not SYM.
+
+### Operand bit-width: int12, not int16 (int32-accumulator soundness)
+
+SYM kernels accumulate **products** of operands in an **int32** accumulator (no
+int64 — hard rule). For symmetric `b`-bit operands each product is ≤ 2^(2b−2),
+so an int32 accumulator (~2^31) holds only ~2^(33−2b) worst-case product terms
+before signed overflow (UB):
+
+| operand width | max product | int32 term at which overflow first occurs |
+|---|---|---|
+| int16 (qMaxBits=16) | 2^30 | 2 |
+| int12 (qMaxBits=12) | 2^22 | 512 |
+| int8  (qMaxBits=8)  | 2^14 | 131072 |
+
+The number of worst-case terms that still **fit** is one less: int16 survives 1,
+int12 survives **511**, int8 survives 131071 — i.e. int12 is sound for reductions
+of length **N ≤ 511** (`512·2^22 = 2^31 > INT32_MAX`).
+
+int16×int16→int32 is **unsound for product-accumulation** (forward, dx,
+weightGrad) — it overflows after ~2 full-scale terms; it is sound only for
+*value* sums (biasGrad). Conv SYM therefore uses **int12 operands**
+(`quantizationInitSymInt32WithBits(rm, 12)`): products ≤ 2047² ≈ 4.2e6, ~512-term
+int32 headroom — ample for the batch=1 MCU regime ODT targets, matching the
+low-bit×low-bit→int32 arithmetic of the Deutel FQT paper (arXiv:2407.10734) /
+TFLite. The **grad accumulators stay int16** (wider accumulator, free since SYM
+stores int32 regardless of qMaxBits). The **kernels are bit-width-agnostic** —
+only the quantization configs change; the int32 accumulator (no int64) is kept.
+
+**Realized framework-wide int12 contract (PR-A, #227):**
+
+- The SYM_INT32 **operand** default is int12 via the compile-time knob
+  `ODT_SYM_OPERAND_QMAXBITS` (=12), set in `initSymInt32QConfig`
+  (`src/tensor/include/Quantization.h`). Override per-build with
+  `-DODT_SYM_OPERAND_QMAXBITS=N` (e.g. =8 for layers wider than 511).
+- `matmulIntCore` (Linear forward / propLoss / weightGrad) and the LayerNorm
+  **affine product** now run on int12 operands, enforced by op-entry guards
+  (`matmulValidateSymOperand` at both Matmul SYM entries;
+  `layerNormValidateSymTensor` lowered to the knob). LayerNorm's per-group
+  mantissa-sum is a value-sum and stays sound at any qMaxBits ≤ 16.
+- **Grad accumulators stay int16** via `ODT_SYM_GRAD_QMAXBITS` (=16), pinned
+  in `gradInitSymInt32` (`getQLike` preserves the source width). biasGrad is a
+  value-sum; weightGrad is a sum of products (int32 accumulate → requantize).
+  Whether grads should be stored SYM_INT32 at all is under redesign — #261.
+- int12 is sound only for reductions **N ≤ 511**; the runtime N-vs-budget check
+  is a deferred follow-up. The #189 policy (release runs free, CI UBSan #204)
+  backstops residual overflow.
+- Note: the conv weightGrad product mixes an int12 input with an int16 grad
+  operand under the #218 grad-accumulator scheme — its budget is governed by
+  #218/#45, not closed by this operand flip.
+- The unit-test gold suite validates the **default** int12/int16 contract
+  (`ODT_SYM_OPERAND_QMAXBITS=12`, `ODT_SYM_GRAD_QMAXBITS=16`); building with a
+  knob override (e.g. `-DODT_SYM_OPERAND_QMAXBITS=8`) diverges from those gold
+  fixtures, which is expected and intentional.
+
+The training loop (`CalculateGradsSequential.c`) allocates grad/activation
+tensors from the **forward** qConfig, not the backward qConfigs — so a full-SYM
+chain needs each layer's `propLossQ` to agree with the forward-derived grad dtype
+(else the #187 guard fires), exactly as for Linear. The Conv→Quant→…→MSE chain
+wiring + FLOAT32-twin convergence check is PR3.
+
+### Conv1dTransposed SYM_INT32 (PR3)
+
+Conv1dTransposed is Conv1d's adjoint with roles swapped, so it reuses BOTH PR2
+cores — no new kernels:
+
+- **forward** = `convTranspose1dKernelSymInt32` (the scatter core; its internal
+  per-channel bias-seed refold gives ConvT bias for free). Pass `outputPadding`.
+- **dx / propLoss** = `conv1dKernelSymInt32` (the gather core, the VALID adjoint),
+  guarded by the #187 fail-fast if `propLoss` is not SYM_INT32.
+- **weightGrad** = strategy A: a scatter-style integer gather (ConvT weight layout
+  `[Cin, Cout/groups, K]`, index `(ic·outChPerGroup + ocOffset)·K + k`) into a fresh
+  `reserveMemory` int32 intermediate at scale `s_in·s_loss`, then
+  `addSymInt32TensorsInplace` into the SYM grad accumulator.
+- **biasGrad** = the same fixed-scale refold as Conv1d (`rescaleIntoAccumulatorScale`
+  over the `batch × outputLength` int32 sum).
+
+Backward dispatches on three independent qConfigs (`weightGradQ`/`biasGradQ`/
+`propLossQ`), like `conv1dBackward`/`linearBackward`. Operands are int12, grad
+accumulators int16, accumulators int32 — no int64. Conv1dTransposed is VALID-only
+(Phase 1), so the adjoint never hits a SAME/EXPLICIT padLeft.
+
+### Validator (PR3)
+
+`producerForwardQ` (`ModelValidationApi.c`) now returns the conv layer's `forwardQ`
+for CONV1D and CONV1D_TRANSPOSED, bringing SYM-producing conv layers under the
+int16 inter-layer contract: a SYM conv producer must be followed by a Quantization
+layer (or sit in the last position).
+
+### SYM training chains
+
+The training loop allocates every grad/activation tensor from the FORWARD output
+qConfig (`initGradTensor`), so a uniformly-SYM chain (every `forwardQ` SYM_INT32)
+makes every grad tensor SYM_INT32 and every layer's `propLossQ` match — the #187
+guard passes. SYM-trainable conv layers are built via the low-level
+`initConv1dTransposedConfigWithWeightsAndBias` with SYM `parameter_t`s (the
+high-level factory keeps grads FLOAT32, matching the Linear KAIMING factory).
+`Conv1dTransposed → Quant → MSE` trains under
+`calculateGradsSequential` + `sgdStepM(SYM_INT32)`.
+
+## Quantized gradient accumulation — known precision Open Problem
+
+As of the quantized-gradient prerequisite (`gradInit`, 2026-06-05) a trainable
+layer's parameter gradient can be stored in the dtype its `backwardMath`
+declares. For SYM_INT32 grads, the per-microbatch accumulation reuses the
+existing `addSymInt32TensorsInplace` ("strategy A", dynamic-rescale): it
+dequantizes both the running grad and the new microbatch grad to float, adds,
+and re-quantizes the running sum to a new absmax-derived scale **on every
+microbatch**.
+
+This is functionally correct end-to-end today, but **not** numerically ideal:
+
+- Quantization noise compounds with the number of microbatches M.
+- The running-sum absmax is pinned by the heaviest microbatch, coarsening the
+  LSB for the accumulated small-gradient mass.
+
+Preliminary characterization (internal simulation, M=100, N=64, σ=1e-3 with a
+10% ×50 heavy tail — *problem characterization only, not a basis for a chosen
+solution*):
+
+| Strategy | Final rel. error vs float64 | Float-free? |
+|---|---|---|
+| A — dynamic-rescale (current) | ~1.5e-4, **grows with M** (2.0e-5 @ step1 → 1.7e-4 @ step100) | No |
+| B — fixed-scale integer accum | ~9.9e-5 | Yes |
+| C — float accum, quantize-at-read | ~2.2e-5 | No |
+
+We deliberately ship strategy A now and do **not** adopt B/C or any homegrown
+numerical scheme. The resolution path is a literature review (stochastic-rounding
+accumulators, error-feedback / residual accumulation, higher-precision master
+grads, block/group scaling, …) → implement or improve a **published** technique.
+Tracked as a separate research task (#218). This note is intentionally public
+(not buried in a private spec) so contributors hitting accuracy issues in
+quantized training know this is a known, expected limitation rather than a bug.
+
+### Two accumulation schemes in-tree (both intentional)
+
+- **Strategy A (dynamic-rescale)** — Linear SYM weight grads and LayerNorm
+  gamma/beta grads: per-microbatch `addSymInt32TensorsInplace` (dequantize
+  both operands with their own scales, float-add, requantize the running sum
+  to a fresh absmax scale). Not float-free.
+- **Fixed-scale integer accumulation** — Linear SYM bias grads
+  (`linearCalcBiasGradsSymInt32`): increments are rescaled into the running
+  grad's EXISTING scale and added in integer arithmetic; the scale is never
+  re-derived during accumulation. The coarser resolution (LSB pinned by the
+  running scale, which inits to 1.0) is inherent to the scheme.
+
+  **Attribution note:** this fixed-scale integer bias-GRADIENT accumulation is
+  ODT's own construction and is NOT prescribed by Deutel et al.
+  (arXiv:2407.10734). The paper's quantization is *dynamic*: scales are
+  re-derived from observed data — weights every SGD update (Eqs. 6-7) — and the
+  method is framed throughout as "dynamic adaptation of the zero-point and
+  scale parameters" (Sec. IV-E). The paper has a forward bias (int32 bias on
+  the int32 MAC accumulator, Fig. 2) but describes no bias-*gradient*
+  accumulation, and it nowhere states that any scale is held static *during
+  training* (the only static/PTQ mention is post-training, at deployment) — so
+  absent evidence to the contrary, assume its scales are dynamic. ODT's
+  fixed-scale bias-grad scheme, which never re-derives the scale during
+  accumulation, therefore DEVIATES from the paper's dynamic scaling; the ODT
+  scheme that corresponds to Deutel is Strategy A (dynamic-rescale, above).
+  What ODT also follows from Deutel: per-layer error requant (~Eq. 4) and the
+  float-space SGD step (~Eqs. 5-7). Scheme choice + the init-scale resolution
+  limit: #218.
+
+This is a research framework: deliberate scheme differences like this one
+MUST be documented here, so experimental design stays separable from
+accidental inconsistency. LayerNorm uses strategy A for BOTH gamma and beta
+per the 2026-06-05 LayerNorm spec.
+
diff --git a/docs/conventions/data-shape.md b/docs/conventions/data-shape.md
@@ -0,0 +1,16 @@
+# Data shape convention
+
+## Data Shape Convention
+
+Datasets deliver samples in their natural geometric shape (e.g. `[C, H, W]`
+for images, `[C, L]` for time series). Any `reshape`, `flatten`, or `view`
+operation is the **first layer of the model**, not a preprocessing step in
+the dataset. This:
+
+- keeps dataset code independent of downstream model topology
+- allows one dataset to feed models with different input ranks
+- matches the PyTorch / Keras / elastic-ai.creator IR convention, so a future
+  ir2c can compile each shape transform to a corresponding C layer
+
+For flatten-to-2D, use `flattenLayerInit()` from `FlattenApi.h`.
+
diff --git a/docs/conventions/loss.md b/docs/conventions/loss.md
@@ -0,0 +1,57 @@
+# Loss & training-loop microbatch contracts
+
+## Loss API: microbatch contracts
+
+Each loss function in `src/loss_functions/` exposes:
+
+- `forward(modelOutput, label, reduction) → float`
+- `backward(modelOutput, label, result) → void`
+- `computeMeanScale(totalSamples, modelOutput) → float`
+
+### Reduction split
+
+`lossConfig_t.backwardReduction` is the user's training-strategy choice — it
+drives whether `scaleOptimizerGradients` runs between `trainingBatchDefault`
+and `optimFns.step`. It is a config field.
+
+`forwardReduction` is a per-call parameter on every aggregator
+(`trainingBatchDefault`, `evaluationBatch`, `evaluationEpoch`, `inferenceWithLoss`,
+`calculateGradsFn_t`). It controls how the per-microbatch loss value is
+reported. `trainingRun` is the only function that hardcodes it
+(to `REDUCTION_MEAN`) so train and eval losses are comparable; lower-level
+callers pick freely.
+
+### Microbatch shape
+
+`modelOutput->shape->dimensions[0]` is the microbatch dimension `B`. For
+`B=1` today, output shape is `[F]` (the leading 1 is implicit). For `B>=1`
+in the future, output shape is `[B, F]` and `numFeaturesPerSample = numElements / B`.
+
+**Uniform-B assumption** (DataLoader contract): all microbatches in one
+macro batch have equal `B`. The MEAN aggregator divides by total samples
+(`Σ batch->size`) rather than by `(numberOfBatches × B)`, so non-uniform B
+would skew the mean. ODT's DataLoader currently always produces uniform
+batches via `dropLast=true`; non-uniform B is out of contract.
+
+### Backward macro-scaling
+
+Backward writes raw per-element gradients (`2(o-l)` for MSE, `(p-y)` for CE).
+The macro-batch divisor lives at the optimizer:
+
+- `lossFunctions[lossConfig.funcType].computeMeanScale(N, modelOutput)`
+  returns the PyTorch-parity divisor (`1/(N*F)` for MSE, `1/N` for CE).
+- `scaleOptimizerGradients(optimizer, factor)` multiplies every parameter's
+  `grad` field by the factor in place.
+- `trainingEpochDefault` calls these between accumulation and `step`,
+  but only when `backwardReduction == REDUCTION_MEAN`.
+
+For SUM (or future per-sample weighted variants — see #150), the backward
+gradient flows through unscaled.
+
+### Shape assertion (deferred)
+
+Runtime assertion of the `dimensions[0] >= 1` contract is deferred to the
+microbatch-B>1 umbrella (#152) — specifically #153. Today (B=1 only) the
+assertion would be effectively a no-op; the protective value materialises
+when B>1 becomes a real feature target.
+