Skip to content

Per-layer training-step trace facility (#257)#260

Merged
LeoBuron merged 12 commits into
developfrom
observer-trace-facility
Jun 30, 2026
Merged

Per-layer training-step trace facility (#257)#260
LeoBuron merged 12 commits into
developfrom
observer-trace-facility

Conversation

@LeoBuron

Copy link
Copy Markdown
Member

Summary

Implements the per-layer activation/gradient trace facility from #257 — a mechanism to record each layer's forward activation and backward gradient on both the C and PyTorch sides and diff them layer-by-layer, to localize where the C framework and PyTorch diverge during training.

Framework (src/userApi, production byte-identical):

  • traceSink_t callback + tracedGrads — fires per-layer forward activations, the loss-grad, and act-grads (∂L/∂ layer output, matching PyTorch forward-hook activation.grad).
  • traceModelWeights / traceModelGrads — per-trainable-layer parameter + gradient dumps (Linear/Conv1d/ConvT1d/LayerNorm).
  • Implemented by extracting static calculateGradsImpl(...,sink,ctx); calculateGradsSequential becomes a NULL-sink wrapper, so the production training path is unchanged (every sink call is NULL-guarded; verified by a closed-form CE+softmax characterization test).

Host tooling (examples/, all file I/O lives here — never in src/):

  • _shared/npyDumpSink — writes each fired FLOAT32 tensor to .npy.
  • kws_raw/trace_c.c — controlled-step harness (loads the exported weights, runs one step on a fully configurable batch, dumps every probe).
  • kws_raw/trace_pytorch.py — PyTorch mirror via hooks, with the ×B reconciliation (C's per-sample backward is unscaled; PyTorch's mean backward carries 1/B).
  • _shared/trace_compare.py — depth-ordered localizer with an absolute-error gate.
  • _shared/trace_sweep.py — multi-batch aggregator.

First diagnostic result (kws_raw)

Controlled single step from identical loaded weights, 10-batch sweep: forward activations, inter-layer act-grads, loss, and the optimizer step are clean on every batch; the divergence is in the early-layer parameter gradients, dominated by the first LayerNorm (ln1), and is batch-dependent. Per-step weight impact is tiny (≤ 5e-5). Root-cause (LayerNorm-backward formula vs amplified float noise; whether it compounds over training) is a follow-up the facility now enables.

Verification

  • Production path byte-identical (NULL-sink wrapper + closed-form characterization test).
  • 63/63 unit tests; alloc-locality + clang-format-21 clean; examples build.
  • Whole-branch reviewed (no Critical/Important findings); production-byte-identical, I/O-boundary, and act-grad/×B/effB constraints all verified.

Closes #257

🤖 Generated with Claude Code

LeoBuron and others added 12 commits June 29, 2026 14:37
…ce sink

Adds TraceApi.h with traceSink_t typedef; extracts the body of
calculateGradsSequential into a static calculateGradsImpl(..., sink, sinkCtx)
with three guarded sink calls (fwd/lossgrad/agrad). calculateGradsSequential
becomes a NULL-sink wrapper — production path byte-identical. Characterisation
test (UnitTestCalculateGradsSequential) pins closed-form CE+softmax gradients
and guards the refactor.
Subagent edits skipped the clang-format PostToolUse hook; this brings the
Phase-1 unit test into clang-format-21 compliance (braces on the guard
return, call-arg reflow). Production .c/.h were already clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…yer dumps

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…yTorch hooks)

The "agrad" probe was firing gradCurr (∂L/∂layer_input) after backward,
so e.g. conv1.agrad had shape [1,1,1000] instead of the expected [1,16,1000],
breaking comparison with PyTorch's activation.grad.

Move the sink call to the top of the backward loop and fire gradNext
(the upstream wire gradient = ∂L/∂layer_output) before initiating
gradCurr or calling backward. With sink==NULL the production path is
byte-identical to before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eep aggregator

Add --abs-floor (default 1e-4) to trace_compare.py: drift now requires BOTH
abs error > floor AND relative jump > JUMP_FACTOR, preventing near-zero act-grads
(abs ~3e-7) from firing a spurious flag while still surfacing param-grad signal.

Extract compare_pairs(c_dir, pt_dir) -> list[dict] as a reusable function
returning {probe, phase, tier, max_abs, max_rel} for every matched pair.

Add examples/_shared/trace_sweep.py: runs N non-overlapping batches, calls
compare_pairs per batch, aggregates mean/max per (probe, phase), and prints
a full tier-sorted table plus a focused param-grad summary sorted descending
by mean_abs — the key signal for which layer diverges robustly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Subagent edits to npy_dump_sink.c and trace_c.c skipped the clang-format
PostToolUse hook; bring them into clang-format-21 compliance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cleanup, messages

- Assert ndim==2 for all 4 events in testTracedGradsFiresInOrder (spec §7 discharged)
- Add structural comment: tracedGrads/calculateGradsSequential share calculateGradsImpl
- compare_dir returns (rc, first_drift) tuple; self_test asserts first-drift probe+phase
- trace_pytorch: assert list(acts)==FWD_PROBES before return (manifest drift guard)
- npy_dump_sink.h: fix comment "asserts" -> "hard-errors (exit 1)"
- trace_c.c batch-clamp fprintf: show requested g_batch alongside clamped effB
- trace_sweep: rmtree dump dirs before each batch (stale .sNN guard); relabel
  mean_abs -> mean(maxabs) in aggregate + focused-summary tables

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LeoBuron LeoBuron merged commit 65bf907 into develop Jun 30, 2026
8 checks passed
@LeoBuron LeoBuron deleted the observer-trace-facility branch June 30, 2026 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant