Skip to content

research: quantized gradient-accumulation scheme — fixed-scale vs strategy A, and a Deutel attribution correction #218

Description

@LeoBuron

Context

docs/CONVENTIONS.md ("Two accumulation schemes in-tree") documents two ways SYM_INT32 gradients accumulate and flags a "known precision Open Problem ... tracked as a separate research task." This is that task.

Finding 1 — bias-grad fixed-scale accumulation is ODT's own, and deviates from Deutel's dynamic scaling

linearCalcBiasGradsSymInt32 (src/layer/Linear.c) accumulates bias grads as bg[f] += round(Σ loss_int · lossScale / bgScale) — integer, at the grad tensor's fixed init scale (1.0), never re-derived.

The precise picture (verified against the v2 full text + Fig. 2):

  • Deutel's quantization is dynamic: scales are re-derived from observed data — weights every SGD update (Eqs. 6-7) — and the method is framed throughout as "dynamic adaptation of the zero-point and scale parameters" (Sec. IV-E).
  • Deutel has a forward bias (int32 bias on the int32 MAC accumulator, Fig. 2) but describes no bias-gradient accumulation (Eq. 3-4 omits the bias).
  • The paper nowhere states that any scale is held static during training — the only static/PTQ mention is post-training, at deployment. Absent evidence, assume dynamic.

So ODT's fixed-scale integer bias-gradient accumulation deviates from the paper's dynamic scaling; it is ODT's own construction, not "following Deutel". The ODT scheme that does correspond to Deutel is Strategy A (dynamic-rescale: Linear weight grads, LayerNorm gamma/beta). (ODT also follows Deutel in per-layer error requant ≈ Eq. 4 and the float-space SGD step ≈ Eqs. 5-7.)

Finding 2 — the scheme's resolution is pinned by the grad init scale

Fresh SYM grad tensors init at scale 1.0 (initSymInt32QConfig); sgdZeroGrad resets it each step. So bias-grad updates move in whole LSBs of size 1.0·lr: a bias with |Σ dy| < 0.5 freezes for that step (worst-case per-step deviation lr·0.5). Surfaced by the full-SYM chain test in #192: Linear biases deviate up to 3.9e-2 from the FLOAT32 twin, vs ≤4.2e-5 for all matmul/strategy-A paths. Benign today (no full-SYM example trains Linear biases yet); relevant from the full-SYM examples (#207) and the FQT stages onward.

Candidate schemes (to evaluate, not decided)

  • (a) Strategy A — reduce-over-batch in int32, then addSymInt32TensorsInplace: finer absmax-based resolution, matches weight-grad / LayerNorm gamma-beta handling AND the paper's dynamic scaling; not float-free; error grows with microbatch count M (the CONVENTIONS open-problem table).
  • (b) Keep fixed-scale as a deliberate, honestly-labelled float-free experiment.
  • (c) Seed bgScale from lossScale instead of 1.0.
  • Literature: Deutel leaves the option-(b) gradient-buffer dtype unspecified; their released code (if public) would settle what they actually do — a prior search found no public release.

Constraints

Relates to #137 (FQT epic, same paper), #210 (SYM umbrella), and the CONVENTIONS "known precision Open Problem".

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions