Per-layer training-step trace facility (#257) by LeoBuron · Pull Request #260 · es-ude/OnDeviceTraining

LeoBuron · 2026-06-29T16:10:29Z

Summary

Implements the per-layer activation/gradient trace facility from #257 — a mechanism to record each layer's forward activation and backward gradient on both the C and PyTorch sides and diff them layer-by-layer, to localize where the C framework and PyTorch diverge during training.

Framework (src/userApi, production byte-identical):

traceSink_t callback + tracedGrads — fires per-layer forward activations, the loss-grad, and act-grads (∂L/∂ layer output, matching PyTorch forward-hook activation.grad).
traceModelWeights / traceModelGrads — per-trainable-layer parameter + gradient dumps (Linear/Conv1d/ConvT1d/LayerNorm).
Implemented by extracting static calculateGradsImpl(...,sink,ctx); calculateGradsSequential becomes a NULL-sink wrapper, so the production training path is unchanged (every sink call is NULL-guarded; verified by a closed-form CE+softmax characterization test).

Host tooling (examples/, all file I/O lives here — never in src/):

_shared/npyDumpSink — writes each fired FLOAT32 tensor to .npy.
kws_raw/trace_c.c — controlled-step harness (loads the exported weights, runs one step on a fully configurable batch, dumps every probe).
kws_raw/trace_pytorch.py — PyTorch mirror via hooks, with the ×B reconciliation (C's per-sample backward is unscaled; PyTorch's mean backward carries 1/B).
_shared/trace_compare.py — depth-ordered localizer with an absolute-error gate.
_shared/trace_sweep.py — multi-batch aggregator.

First diagnostic result (kws_raw)

Controlled single step from identical loaded weights, 10-batch sweep: forward activations, inter-layer act-grads, loss, and the optimizer step are clean on every batch; the divergence is in the early-layer parameter gradients, dominated by the first LayerNorm (ln1), and is batch-dependent. Per-step weight impact is tiny (≤ 5e-5). Root-cause (LayerNorm-backward formula vs amplified float noise; whether it compounds over training) is a follow-up the facility now enables.

Verification

Production path byte-identical (NULL-sink wrapper + closed-form characterization test).
63/63 unit tests; alloc-locality + clang-format-21 clean; examples build.
Whole-branch reviewed (no Critical/Important findings); production-byte-identical, I/O-boundary, and act-grad/×B/effB constraints all verified.

Closes #257

🤖 Generated with Claude Code

…ce sink Adds TraceApi.h with traceSink_t typedef; extracts the body of calculateGradsSequential into a static calculateGradsImpl(..., sink, sinkCtx) with three guarded sink calls (fwd/lossgrad/agrad). calculateGradsSequential becomes a NULL-sink wrapper — production path byte-identical. Characterisation test (UnitTestCalculateGradsSequential) pins closed-form CE+softmax gradients and guards the refactor.

… layer

…grads

Subagent edits skipped the clang-format PostToolUse hook; this brings the Phase-1 unit test into clang-format-21 compliance (braces on the guard return, call-arg reflow). Production .c/.h were already clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… to .npy

…yer dumps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… hooks

…yTorch hooks) The "agrad" probe was firing gradCurr (∂L/∂layer_input) after backward, so e.g. conv1.agrad had shape [1,1,1000] instead of the expected [1,16,1000], breaking comparison with PyTorch's activation.grad. Move the sink call to the top of the backward loop and fire gradNext (the upstream wire gradient = ∂L/∂layer_output) before initiating gradCurr or calling backward. With sink==NULL the production path is byte-identical to before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eep aggregator Add --abs-floor (default 1e-4) to trace_compare.py: drift now requires BOTH abs error > floor AND relative jump > JUMP_FACTOR, preventing near-zero act-grads (abs ~3e-7) from firing a spurious flag while still surfacing param-grad signal. Extract compare_pairs(c_dir, pt_dir) -> list[dict] as a reusable function returning {probe, phase, tier, max_abs, max_rel} for every matched pair. Add examples/_shared/trace_sweep.py: runs N non-overlapping batches, calls compare_pairs per batch, aggregates mean/max per (probe, phase), and prints a full tier-sorted table plus a focused param-grad summary sorted descending by mean_abs — the key signal for which layer diverges robustly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Subagent edits to npy_dump_sink.c and trace_c.c skipped the clang-format PostToolUse hook; bring them into clang-format-21 compliance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… cleanup, messages - Assert ndim==2 for all 4 events in testTracedGradsFiresInOrder (spec §7 discharged) - Add structural comment: tracedGrads/calculateGradsSequential share calculateGradsImpl - compare_dir returns (rc, first_drift) tuple; self_test asserts first-drift probe+phase - trace_pytorch: assert list(acts)==FWD_PROBES before return (manifest drift guard) - npy_dump_sink.h: fix comment "asserts" -> "hard-errors (exit 1)" - trace_c.c batch-clamp fprintf: show requested g_batch alongside clamped effB - trace_sweep: rmtree dump dirs before each batch (stale .sNN guard); relabel mean_abs -> mean(maxabs) in aggregate + focused-summary tables Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LeoBuron and others added 12 commits June 29, 2026 14:37

feat(training_loop): tracedGrads fires activation + act-grad sink per…

fc73ca2

… layer

feat(training_loop): traceModelWeights/Grads dump per-layer params + …

87b5cf4

…grads

feat(examples/_shared): npyDumpSink trace sink writes FLOAT32 tensors…

d7f7655

… to .npy

feat(examples/kws_raw): trace_c controlled-step harness writes per-la…

177788d

…yer dumps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(examples/kws_raw): trace_pytorch mirrors the controlled step via…

a83ce86

… hooks

feat(examples/_shared): trace_compare localizes first C-vs-PyTorch drift

a991211

style(examples): clang-format trace harness + dump sink

c85ca03

Subagent edits to npy_dump_sink.c and trace_c.c skipped the clang-format PostToolUse hook; bring them into clang-format-21 compliance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LeoBuron mentioned this pull request Jun 29, 2026

Observer layer: per-layer activation/grad tracing to diagnose the C-vs-PyTorch training divergence #257

Closed

LeoBuron merged commit 65bf907 into develop Jun 30, 2026
8 checks passed

LeoBuron deleted the observer-trace-facility branch June 30, 2026 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-layer training-step trace facility (#257)#260

Per-layer training-step trace facility (#257)#260
LeoBuron merged 12 commits into
developfrom
observer-trace-facility

LeoBuron commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LeoBuron commented Jun 29, 2026

Summary

First diagnostic result (kws_raw)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant