research: quantized gradient-accumulation scheme — fixed-scale vs strategy A, and a Deutel attribution correction

## Context

`docs/CONVENTIONS.md` ("Two accumulation schemes in-tree") documents two ways SYM_INT32 gradients accumulate and flags a "known precision Open Problem ... tracked as a separate research task." This is that task.

## Finding 1 — bias-grad fixed-scale accumulation is ODT's own, and deviates from Deutel's dynamic scaling

`linearCalcBiasGradsSymInt32` (`src/layer/Linear.c`) accumulates bias grads as `bg[f] += round(Σ loss_int · lossScale / bgScale)` — integer, at the grad tensor's fixed init scale (1.0), never re-derived.

The precise picture (verified against the v2 full text + Fig. 2):

- Deutel's quantization is **dynamic**: scales are re-derived from observed data — weights every SGD update (Eqs. 6-7) — and the method is framed throughout as "dynamic adaptation of the zero-point and scale parameters" (Sec. IV-E).
- Deutel **has** a forward bias (int32 bias on the int32 MAC accumulator, Fig. 2) but describes **no** bias-gradient accumulation (Eq. 3-4 omits the bias).
- The paper **nowhere** states that any scale is held static *during training* — the only static/PTQ mention is post-training, at deployment. Absent evidence, assume dynamic.

So ODT's fixed-scale integer bias-gradient accumulation **deviates** from the paper's dynamic scaling; it is ODT's own construction, not "following Deutel". The ODT scheme that *does* correspond to Deutel is **Strategy A** (dynamic-rescale: Linear weight grads, LayerNorm gamma/beta). (ODT also follows Deutel in per-layer error requant ≈ Eq. 4 and the float-space SGD step ≈ Eqs. 5-7.)

## Finding 2 — the scheme's resolution is pinned by the grad init scale

Fresh SYM grad tensors init at scale 1.0 (`initSymInt32QConfig`); `sgdZeroGrad` resets it each step. So bias-grad updates move in whole LSBs of size `1.0·lr`: a bias with `|Σ dy| < 0.5` freezes for that step (worst-case per-step deviation `lr·0.5`). Surfaced by the full-SYM chain test in #192: Linear biases deviate up to 3.9e-2 from the FLOAT32 twin, vs ≤4.2e-5 for all matmul/strategy-A paths. Benign today (no full-SYM example trains Linear biases yet); relevant from the full-SYM examples (#207) and the FQT stages onward.

## Candidate schemes (to evaluate, not decided)

- **(a) Strategy A** — reduce-over-batch in int32, then `addSymInt32TensorsInplace`: finer absmax-based resolution, matches weight-grad / LayerNorm gamma-beta handling AND the paper's dynamic scaling; not float-free; error grows with microbatch count M (the CONVENTIONS open-problem table).
- **(b) Keep fixed-scale** as a deliberate, honestly-labelled float-free experiment.
- **(c) Seed `bgScale`** from `lossScale` instead of 1.0.
- Literature: Deutel leaves the option-(b) gradient-buffer dtype unspecified; their released code (if public) would settle what they actually do — a prior search found no public release.

## Constraints

- No int64 in SYM paths (the int16 contract bounds the int32 sum; the int32 migration landed in #219).
- Research-backed-numerics rule: choose a scheme with literature backing or document the deviation; do not ship a homegrown scheme on simulation alone.

Relates to #137 (FQT epic, same paper), #210 (SYM umbrella), and the CONVENTIONS "known precision Open Problem".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

research: quantized gradient-accumulation scheme — fixed-scale vs strategy A, and a Deutel attribution correction #218

Context

Finding 1 — bias-grad fixed-scale accumulation is ODT's own, and deviates from Deutel's dynamic scaling

Finding 2 — the scheme's resolution is pinned by the grad init scale

Candidate schemes (to evaluate, not decided)

Constraints

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

research: quantized gradient-accumulation scheme — fixed-scale vs strategy A, and a Deutel attribution correction #218

Description

Context

Finding 1 — bias-grad fixed-scale accumulation is ODT's own, and deviates from Deutel's dynamic scaling

Finding 2 — the scheme's resolution is pinned by the grad init scale

Candidate schemes (to evaluate, not decided)

Constraints

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions