Context
docs/CONVENTIONS.md ("Two accumulation schemes in-tree") documents two ways SYM_INT32 gradients accumulate and flags a "known precision Open Problem ... tracked as a separate research task." This is that task.
Finding 1 — bias-grad fixed-scale accumulation is ODT's own, and deviates from Deutel's dynamic scaling
linearCalcBiasGradsSymInt32 (src/layer/Linear.c) accumulates bias grads as bg[f] += round(Σ loss_int · lossScale / bgScale) — integer, at the grad tensor's fixed init scale (1.0), never re-derived.
The precise picture (verified against the v2 full text + Fig. 2):
- Deutel's quantization is dynamic: scales are re-derived from observed data — weights every SGD update (Eqs. 6-7) — and the method is framed throughout as "dynamic adaptation of the zero-point and scale parameters" (Sec. IV-E).
- Deutel has a forward bias (int32 bias on the int32 MAC accumulator, Fig. 2) but describes no bias-gradient accumulation (Eq. 3-4 omits the bias).
- The paper nowhere states that any scale is held static during training — the only static/PTQ mention is post-training, at deployment. Absent evidence, assume dynamic.
So ODT's fixed-scale integer bias-gradient accumulation deviates from the paper's dynamic scaling; it is ODT's own construction, not "following Deutel". The ODT scheme that does correspond to Deutel is Strategy A (dynamic-rescale: Linear weight grads, LayerNorm gamma/beta). (ODT also follows Deutel in per-layer error requant ≈ Eq. 4 and the float-space SGD step ≈ Eqs. 5-7.)
Finding 2 — the scheme's resolution is pinned by the grad init scale
Fresh SYM grad tensors init at scale 1.0 (initSymInt32QConfig); sgdZeroGrad resets it each step. So bias-grad updates move in whole LSBs of size 1.0·lr: a bias with |Σ dy| < 0.5 freezes for that step (worst-case per-step deviation lr·0.5). Surfaced by the full-SYM chain test in #192: Linear biases deviate up to 3.9e-2 from the FLOAT32 twin, vs ≤4.2e-5 for all matmul/strategy-A paths. Benign today (no full-SYM example trains Linear biases yet); relevant from the full-SYM examples (#207) and the FQT stages onward.
Candidate schemes (to evaluate, not decided)
- (a) Strategy A — reduce-over-batch in int32, then
addSymInt32TensorsInplace: finer absmax-based resolution, matches weight-grad / LayerNorm gamma-beta handling AND the paper's dynamic scaling; not float-free; error grows with microbatch count M (the CONVENTIONS open-problem table).
- (b) Keep fixed-scale as a deliberate, honestly-labelled float-free experiment.
- (c) Seed
bgScale from lossScale instead of 1.0.
- Literature: Deutel leaves the option-(b) gradient-buffer dtype unspecified; their released code (if public) would settle what they actually do — a prior search found no public release.
Constraints
Relates to #137 (FQT epic, same paper), #210 (SYM umbrella), and the CONVENTIONS "known precision Open Problem".
Context
docs/CONVENTIONS.md("Two accumulation schemes in-tree") documents two ways SYM_INT32 gradients accumulate and flags a "known precision Open Problem ... tracked as a separate research task." This is that task.Finding 1 — bias-grad fixed-scale accumulation is ODT's own, and deviates from Deutel's dynamic scaling
linearCalcBiasGradsSymInt32(src/layer/Linear.c) accumulates bias grads asbg[f] += round(Σ loss_int · lossScale / bgScale)— integer, at the grad tensor's fixed init scale (1.0), never re-derived.The precise picture (verified against the v2 full text + Fig. 2):
So ODT's fixed-scale integer bias-gradient accumulation deviates from the paper's dynamic scaling; it is ODT's own construction, not "following Deutel". The ODT scheme that does correspond to Deutel is Strategy A (dynamic-rescale: Linear weight grads, LayerNorm gamma/beta). (ODT also follows Deutel in per-layer error requant ≈ Eq. 4 and the float-space SGD step ≈ Eqs. 5-7.)
Finding 2 — the scheme's resolution is pinned by the grad init scale
Fresh SYM grad tensors init at scale 1.0 (
initSymInt32QConfig);sgdZeroGradresets it each step. So bias-grad updates move in whole LSBs of size1.0·lr: a bias with|Σ dy| < 0.5freezes for that step (worst-case per-step deviationlr·0.5). Surfaced by the full-SYM chain test in #192: Linear biases deviate up to 3.9e-2 from the FLOAT32 twin, vs ≤4.2e-5 for all matmul/strategy-A paths. Benign today (no full-SYM example trains Linear biases yet); relevant from the full-SYM examples (#207) and the FQT stages onward.Candidate schemes (to evaluate, not decided)
addSymInt32TensorsInplace: finer absmax-based resolution, matches weight-grad / LayerNorm gamma-beta handling AND the paper's dynamic scaling; not float-free; error grows with microbatch count M (the CONVENTIONS open-problem table).bgScalefromlossScaleinstead of 1.0.Constraints
Relates to #137 (FQT epic, same paper), #210 (SYM umbrella), and the CONVENTIONS "known precision Open Problem".