Per-layer training-step trace facility (#257)#260
Merged
Conversation
…ce sink Adds TraceApi.h with traceSink_t typedef; extracts the body of calculateGradsSequential into a static calculateGradsImpl(..., sink, sinkCtx) with three guarded sink calls (fwd/lossgrad/agrad). calculateGradsSequential becomes a NULL-sink wrapper — production path byte-identical. Characterisation test (UnitTestCalculateGradsSequential) pins closed-form CE+softmax gradients and guards the refactor.
Subagent edits skipped the clang-format PostToolUse hook; this brings the Phase-1 unit test into clang-format-21 compliance (braces on the guard return, call-arg reflow). Production .c/.h were already clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…yer dumps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…yTorch hooks) The "agrad" probe was firing gradCurr (∂L/∂layer_input) after backward, so e.g. conv1.agrad had shape [1,1,1000] instead of the expected [1,16,1000], breaking comparison with PyTorch's activation.grad. Move the sink call to the top of the backward loop and fire gradNext (the upstream wire gradient = ∂L/∂layer_output) before initiating gradCurr or calling backward. With sink==NULL the production path is byte-identical to before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eep aggregator
Add --abs-floor (default 1e-4) to trace_compare.py: drift now requires BOTH
abs error > floor AND relative jump > JUMP_FACTOR, preventing near-zero act-grads
(abs ~3e-7) from firing a spurious flag while still surfacing param-grad signal.
Extract compare_pairs(c_dir, pt_dir) -> list[dict] as a reusable function
returning {probe, phase, tier, max_abs, max_rel} for every matched pair.
Add examples/_shared/trace_sweep.py: runs N non-overlapping batches, calls
compare_pairs per batch, aggregates mean/max per (probe, phase), and prints
a full tier-sorted table plus a focused param-grad summary sorted descending
by mean_abs — the key signal for which layer diverges robustly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Subagent edits to npy_dump_sink.c and trace_c.c skipped the clang-format PostToolUse hook; bring them into clang-format-21 compliance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… cleanup, messages - Assert ndim==2 for all 4 events in testTracedGradsFiresInOrder (spec §7 discharged) - Add structural comment: tracedGrads/calculateGradsSequential share calculateGradsImpl - compare_dir returns (rc, first_drift) tuple; self_test asserts first-drift probe+phase - trace_pytorch: assert list(acts)==FWD_PROBES before return (manifest drift guard) - npy_dump_sink.h: fix comment "asserts" -> "hard-errors (exit 1)" - trace_c.c batch-clamp fprintf: show requested g_batch alongside clamped effB - trace_sweep: rmtree dump dirs before each batch (stale .sNN guard); relabel mean_abs -> mean(maxabs) in aggregate + focused-summary tables Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the per-layer activation/gradient trace facility from #257 — a mechanism to record each layer's forward activation and backward gradient on both the C and PyTorch sides and diff them layer-by-layer, to localize where the C framework and PyTorch diverge during training.
Framework (
src/userApi, production byte-identical):traceSink_tcallback +tracedGrads— fires per-layer forward activations, the loss-grad, and act-grads (∂L/∂ layer output, matching PyTorch forward-hookactivation.grad).traceModelWeights/traceModelGrads— per-trainable-layer parameter + gradient dumps (Linear/Conv1d/ConvT1d/LayerNorm).static calculateGradsImpl(...,sink,ctx);calculateGradsSequentialbecomes aNULL-sink wrapper, so the production training path is unchanged (every sink call isNULL-guarded; verified by a closed-form CE+softmax characterization test).Host tooling (
examples/, all file I/O lives here — never insrc/):_shared/npyDumpSink— writes each fired FLOAT32 tensor to.npy.kws_raw/trace_c.c— controlled-step harness (loads the exported weights, runs one step on a fully configurable batch, dumps every probe).kws_raw/trace_pytorch.py— PyTorch mirror via hooks, with the ×B reconciliation (C's per-sample backward is unscaled; PyTorch's mean backward carries 1/B)._shared/trace_compare.py— depth-ordered localizer with an absolute-error gate._shared/trace_sweep.py— multi-batch aggregator.First diagnostic result (kws_raw)
Controlled single step from identical loaded weights, 10-batch sweep: forward activations, inter-layer act-grads, loss, and the optimizer step are clean on every batch; the divergence is in the early-layer parameter gradients, dominated by the first LayerNorm (
ln1), and is batch-dependent. Per-step weight impact is tiny (≤ 5e-5). Root-cause (LayerNorm-backward formula vs amplified float noise; whether it compounds over training) is a follow-up the facility now enables.Verification
NULL-sink wrapper + closed-form characterization test).Closes #257
🤖 Generated with Claude Code