Develop by LeoBuron · Pull Request #262 · es-ude/OnDeviceTraining

LeoBuron · 2026-07-01T18:39:38Z

No description provided.

example/ (singular) was added to the build unconditionally but no CI job ever compiled it, so it drifted against the current factory layer-init API and no longer builds (cmake --build --preset examples on the default 'all' target fails). Delete it: examples/ (plural) is the canonical example location, and UnitTestMnistSmoke keeps MNIST end-to-end coverage in CI. Also switch the clang-format gate (ci.yml + its devenv.nix mirror) from the now-deleted 'example' to 'examples'. This removes the dangling find path and closes the pre-existing blind spot where the plural examples' C files were never format-checked (verified all currently clean). Closes #235 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… parity The Conv1d/Linear/Conv1dTransposed factories hardcoded KAIMING_UNIFORM gain=sqrt(2) (He) for weights and zeroed biases - flagged in-code as deferred to 'Issue C'. That could not reproduce PyTorch's default init, so train-from-scratch parity demos diverged. Add a weightInit_t {initScheme_t scheme; float gain;} field to each init struct (zero-init -> INIT_DEFAULT, mirroring the existing bias_t idiom). Route all three factories' weight+bias allocation through shared helpers initWeightTensor/initBiasTensor (new src/userApi/LayerCommon.c, now a compiled static lib). Schemes: INIT_DEFAULT -> kaimingUniform(gain=sqrt(1/3), fan_in) = uniform(+/-1/sqrt(fan_in)), exactly PyTorch's kaiming_uniform_(a=sqrt(5)) weight default; bias (all schemes) -> uniform(+/-1/sqrt(fan_in)) per PyTorch; INIT_KAIMING_UNIFORM -> He (gain sqrt(2) default, overridable); INIT_XAVIER_UNIFORM -> Glorot (gain 1 default, overridable). Fan modes match PyTorch _calculate_fan_in_and_fan_out per layout (Conv1d in*k, Linear in, ConvT out*k). TDD: seeded statistical default-bound + override value tests added to UnitTestConv1dApi / UnitTestLinear / UnitTestConv1dTransposedApi; mutation-verified non-vacuous. Existing structural tests unaffected; 62/62 ctest. Implements the deferred Issue C (distribution parametrization for PyTorch-compatible init). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r example The factory-API migration left a parallel _v2 C trainer per example (har_classifier_v2, ecg_anomaly_ae_v2) alongside the original Legacy-API train_c.c, which was redundant, never built in CI, and pinned to deprecated APIs. Collapse to one directory per example. Move each _v2 train_c.c (factory API + StateDictApi) into its parent dir; delete the Legacy train_c.c and the _v2 dirs; targets keep their canonical names (train_c_har_classifier, train_c_ecg_anomaly_ae). Rewire the CI bit-parity job + examples/CMakeLists.txt to the consolidated targets/paths. Rewrite the two READMEs to the real flow (bit-parity primary). Make ECG compare.py informational: the train-from-scratch comparison is a sanity check, not a gate (independent init + a C-vs-PyTorch training-dynamics difference push the anomaly AUC outside tolerance; bit-parity is the exact gate). Bit-parity verified identical for both examples (HAR int32, ECG float allclose). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…guard) The bit-parity job built only the two named example targets, so any other example executable was never compiled in CI - exactly how example/MnistExperiment (#235) and the legacy v1 trainers rotted unnoticed. Build the default 'all' target instead; a future example that fails to compile now fails CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…import finding

torchaudio 2.11 (maintenance mode) routes its dataset decode through torchcodec (needs a system FFmpeg), so iterating SPEECHCOMMANDS raised ImportError at prepare time. Switch to ds.get_metadata (no decode) + a stdlib `wave` reader (int16 PCM / 32768) — the fallback the spec blessed, byte-identical output, no torchcodec/FFmpeg/scipy dependency. Also make the loader path-based: collect paths per label, then decode only the clips a split keeps (all 4 keywords + the sampled "unknown"). Peak RAM for 6-class drops ~5.8 GB -> ~1.4 GB, fitting the 7 GB CI runner.

The stdlib wave reader interprets frames as int16/32768; a non-mono or non-16-bit clip would be silently misdecoded. Assert the format (the corpus is uniformly 16 kHz mono 16-bit, so this never trips — it guards a future corpus swap). Flagged by the PR-A final review as the one silent-decode path.

…st npy

…xport

…wiring

…ence The raw model at the spec lr=0.001/15ep never escapes random init (flat loss, all-class-0 predictions) → a degenerate bit-parity gate. Add a rate-agnostic LayerNorm(64) on the pooled features before fc and tune to lr=0.005 / 20 epochs: the model now reaches test_acc 0.588 with predictions spread across all 6 classes, making the AvgPool[1,16000] §11.3 bit-parity proof meaningful (C reproduces the diverse PyTorch predictions exactly: 2483/2483 int32-identical). LayerNorm is the framework's only bit-parity-covered normalizer (BatchNorm is not), so the gate is preserved. C model: MODEL_SIZE 14->15, layerNormLayerInit at model[12] (normalizedShape=[64], eps 1e-5 = PyTorch default), 5-entry state-dict (+ln gamma/beta), +LayerNormApi/LayerNorm link libs. kws_mfcc (shipped) is unchanged.

…ing fragile end-LN A 10-seed x 3-placement x 3-lr sweep (50 epochs) showed the previously-shipped end-feature LayerNorm(64) is the WORST option: 0.47 +/- 0.25 test_acc, collapsing to a one-class reference on ~40% of seeds (the original seed-42 pick was lucky). Per-conv LayerNorm([C,L]) over each conv's full feature map (pre-ReLU) is the best and most stable: 0.72 +/- 0.01, all 10 seeds converge across all 6 classes. (Plain no-LayerNorm also trains fine at 50 epochs, 0.70 +/- 0.02 — the raw model was never un-trainable, just slow; LayerNorm is kept as the framework's bit-parity-covered normalizer and to exercise it in the gate.) Model: 3x [Conv1d -> LayerNorm([C,L]) -> ReLU -> MaxPool(4)] (shapes [16,1000], [32,250], [64,62]), lr=0.005, 50 epochs. C: MODEL_SIZE 15->17, three layerNormLayerInit(numNormDims=2, eps=1e-5) at model[2]/[6]/[10], 7-entry state-dict {conv1,ln1,conv2,ln2,conv3,ln3,fc}. Gate PASSES bit-identical (2483/2483) with diverse predictions across all 6 classes -- first bit-parity exercise of a multi-dim [C,L] LayerNorm in an example.

…ce sink Adds TraceApi.h with traceSink_t typedef; extracts the body of calculateGradsSequential into a static calculateGradsImpl(..., sink, sinkCtx) with three guarded sink calls (fwd/lossgrad/agrad). calculateGradsSequential becomes a NULL-sink wrapper — production path byte-identical. Characterisation test (UnitTestCalculateGradsSequential) pins closed-form CE+softmax gradients and guards the refactor.

… layer

…grads

Subagent edits skipped the clang-format PostToolUse hook; this brings the Phase-1 unit test into clang-format-21 compliance (braces on the guard return, call-arg reflow). Production .c/.h were already clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… to .npy

…yer dumps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… hooks

…yTorch hooks) The "agrad" probe was firing gradCurr (∂L/∂layer_input) after backward, so e.g. conv1.agrad had shape [1,1,1000] instead of the expected [1,16,1000], breaking comparison with PyTorch's activation.grad. Move the sink call to the top of the backward loop and fire gradNext (the upstream wire gradient = ∂L/∂layer_output) before initiating gradCurr or calling backward. With sink==NULL the production path is byte-identical to before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eep aggregator Add --abs-floor (default 1e-4) to trace_compare.py: drift now requires BOTH abs error > floor AND relative jump > JUMP_FACTOR, preventing near-zero act-grads (abs ~3e-7) from firing a spurious flag while still surfacing param-grad signal. Extract compare_pairs(c_dir, pt_dir) -> list[dict] as a reusable function returning {probe, phase, tier, max_abs, max_rel} for every matched pair. Add examples/_shared/trace_sweep.py: runs N non-overlapping batches, calls compare_pairs per batch, aggregates mean/max per (probe, phase), and prints a full tier-sorted table plus a focused param-grad summary sorted descending by mean_abs — the key signal for which layer diverges robustly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Subagent edits to npy_dump_sink.c and trace_c.c skipped the clang-format PostToolUse hook; bring them into clang-format-21 compliance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… cleanup, messages - Assert ndim==2 for all 4 events in testTracedGradsFiresInOrder (spec §7 discharged) - Add structural comment: tracedGrads/calculateGradsSequential share calculateGradsImpl - compare_dir returns (rc, first_drift) tuple; self_test asserts first-drift probe+phase - trace_pytorch: assert list(acts)==FWD_PROBES before return (manifest drift guard) - npy_dump_sink.h: fix comment "asserts" -> "hard-errors (exit 1)" - trace_c.c batch-clamp fprintf: show requested g_batch alongside clamped effB - trace_sweep: rmtree dump dirs before each batch (stale .sNN guard); relabel mean_abs -> mean(maxabs) in aggregate + focused-summary tables Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ed docs Honesty fixes: - Quantization.h: drop the stale "grad accumulators are value-sums, int16" comment -- weightGrad is a sum of products; grad storage-type is under #261. - QuantizationApi.h: fix the quantizationInitSymInt32WithBits docstring -- the plain initializer uses the int12 operand default (not "hardcodes 16"), and drop the "32 bits for full int32 range" over-promise (not cast-safe, #202). - CONVENTIONS.md: correct the value-sums claim for grad accumulators. Conventions reorg: - Split docs/CONVENTIONS.md into per-subsystem docs/conventions/*.md (verbatim moves); CONVENTIONS.md becomes a thin index + the memory-over-accuracy vision. - Add the "SYM_INT32 is a compute format, not storage (#261)" convention. No behavior change. Build clean; 63/63 unit tests pass. Relates to #261. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-07-01T18:54:19Z

Warning

Review limit reached

@LeoBuron, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 45 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fc3273f3-1253-4c8a-8032-f7cab2b32fc0

📥 Commits

Reviewing files that changed from the base of the PR and between 65bf907 and a3bf34c.

📒 Files selected for processing (9)

docs/CONVENTIONS.md
docs/conventions/allocation.md
docs/conventions/arithmetic-sym.md
docs/conventions/data-shape.md
docs/conventions/loss.md
docs/conventions/tensor.md
docs/conventions/testing.md
src/tensor/include/Quantization.h
src/userApi/tensor/include/QuantizationApi.h

📝 Walkthrough

Walkthrough

This PR adds new MNIST MLP/CNN and KWS (MFCC and raw-waveform) parity examples with shared MNIST/SpeechCommands data loaders, trace/dump/compare tooling, and a BIT_PARITY exact-match mode; introduces a shared weight-initialization API adopted by Conv1d/Conv1dTransposed/Linear; adds a gradient-tracing API; updates CI, CMake, and READMEs; and removes legacy v2/MnistExperiment examples.

Changes

Weight Initialization and Gradient Tracing APIs

Layer / File(s)	Summary
LayerCommon init contract and implementation `src/userApi/include/LayerCommon.h`, `src/userApi/LayerCommon.c`, `src/userApi/CMakeLists.txt`	Adds `initScheme_t`/`weightInit_t` and `initWeightTensor`/`initBiasTensor`, building `LayerCommon` as a real library.
Conv1d/Conv1dTransposed/Linear init wiring `src/userApi/layer/Conv1dApi.`, `src/userApi/layer/Conv1dTransposedApi.`, `src/userApi/layer/LinearApi.*`	Routes weight/bias allocation through the shared init API with computed fan-in/out, replacing hardcoded distributions.
Init bound unit tests `test/unit/userAPI/`, `test/unit/layer/UnitTestLinear.c`, `test/unit//CMakeLists.txt`	Adds tests validating default and overridden init bounds against PyTorch-style formulas.
Gradient/weight tracing API `src/userApi/training_loop/calculate_grads/`, `test/unit/training_loop/`	Adds `TraceApi.h`, `tracedGrads`/`traceModelWeights`/`traceModelGrads`, and corresponding unit tests.

Estimated code review effort: 4 (Complex) | ~75 minutes

New Example Suite and Shared Infrastructure

Layer / File(s)	Summary
Shared data loaders and dump/compare tooling `examples/_shared/*`	Adds `mnist_data.py`, `speechcommands_data.py`, `npy_dump_sink.*`, `trace_compare.py`, `trace_sweep.py`, and updates `_shared/CMakeLists.txt`, `pyproject.toml`.
MNIST MLP/CNN examples `examples/mnist_mlp/`, `examples/mnist_cnn/`	Adds data prep, PyTorch reference training, C training/inference with BIT_PARITY, and compare scripts for both variants.
KWS MFCC/raw examples `examples/kws_mfcc/`, `examples/kws_raw/`	Adds data prep, PyTorch reference training, C training/inference with BIT_PARITY, compare scripts, and (for raw) trace_c/trace_pytorch/probe_manifest tooling.
HAR/ECG BIT_PARITY refactor `examples/har_classifier/`, `examples/ecg_anomaly_ae/`	Refactors both examples to use the new init/state-dict APIs, adds BIT_PARITY inference mode, and makes `compare.py` informational-only.
Legacy example removal `example/MnistExperiment.c`, `examples/ecg_anomaly_ae_v2/train_c.c`	Deletes the legacy MNIST training program and the ecg_anomaly_ae_v2 example.
Build/CI wiring `CMakeLists.txt`, `examples/CMakeLists.txt`, `devenv.nix`, `.github/workflows/ci.yml`, `examples/README.md`	Gates examples build behind `BUILD_EXAMPLES`, updates subdirectory lists, and extends CI to prepare/train/cache/diff the new examples.

Estimated code review effort: 5 (Critical) | ~120 minutes

Possibly related PRs

es-ude/OnDeviceTraining#176: Implements AdaptiveAvgPool1d used by the new kws_mfcc/kws_raw model code in this PR.
es-ude/OnDeviceTraining#224: Introduces the same actions/cache@v4 raw-dataset caching pattern in ci.yml later extended here for MNIST/SpeechCommands.

Poem

A rabbit hopped through weights and grads,
Traced each layer, dumped .npy pads,
MNIST, KWS — datasets galore,
Bit-parity gates locked tight at the core,
Old v2 burrows dug up and cleared,
New shared paths hopped in, welcomed, cheered! 🐇✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is too vague and does not describe the changeset in a meaningful way.	Replace it with a short, specific title that names the main feature or change.
Description check	❓ Inconclusive	No pull request description was provided, so there is no meaningful summary to assess.	Add a brief description of the main changes and intent.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch develop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/ecg_anomaly_ae/train_c.c (1)

361-369: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Restore c_train_recons.npy for the train-from-scratch path.

Lines 361-363 now only write c_reconstructions.npy, but examples/ecg_anomaly_ae/compare.py still unconditionally loads outputs/c_train_recons.npy, and the README tells users to run that script after training. As-is, the documented demo crashes before it can emit the informational parity report or plots.

Proposed fix

     int status = 0;
+    if (bitParity == NULL || bitParity[0] == '\0') {
+        int trainRc = writeAllReconstructions(model, MODEL_SIZE, getTrainSample, getTrainSize(),
+                                              "examples/ecg_anomaly_ae/outputs/c_train_recons.npy");
+        if (trainRc != 0) {
+            fprintf(stderr, "ERROR: c_train_recons.npy write failed (rc=%d)\n", trainRc);
+            status = 1;
+        }
+    }
+
     int rc = writeAllReconstructions(model, MODEL_SIZE, getTestSample, getTestSize(),
                                      "examples/ecg_anomaly_ae/outputs/c_reconstructions.npy");
     if (rc != 0) {
         fprintf(stderr, "ERROR: c_reconstructions.npy write failed (rc=%d)\n", rc);
         status = 1;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/ecg_anomaly_ae/train_c.c` around lines 361 - 369, The
train-from-scratch path in writeAllReconstructions currently only emits
c_reconstructions.npy, but compare.py and the README still expect
outputs/c_train_recons.npy. Update the reconstruction-writing block around
writeAllReconstructions/getTestSample so the training path also produces
c_train_recons.npy (or otherwise matches the filename compare.py loads), keeping
the existing c_reconstructions.npy output intact if it is still needed.

🟡 Minor comments (10)

examples/kws_mfcc/train_c.c-120-124 (1)
120-124: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Fail fast on labels outside g_numClasses.

With the runtime class count, a stale/mismatched label like 34 in 6-class mode silently becomes an all-zero target. Validate cls before filling the one-hot tensor.
Proposed fix
         int32_t cls = ((int32_t *)intLabels->array[i]->data)[0];
+        if (cls < 0 || (size_t)cls >= g_numClasses) {
+            fprintf(stderr, "Invalid label %d for %zu classes\n", cls, g_numClasses);
+            exit(1);
+        }
         float *data = (float *)t->data;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/kws_mfcc/train_c.c` around lines 120 - 124, The one-hot label
construction in the training path silently accepts out-of-range class labels and
produces an all-zero target; add a bounds check for cls before the loop that
fills data. In the label handling logic around intLabels->array[i]->data,
validate that cls is within [0, g_numClasses) and fail fast if it is not, rather
than writing the tensor in train_c.c.
examples/kws_mfcc/train_c.c-68-73 (1)
68-73: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject partially parsed KWS_CLASSES values. strtol(env, NULL, 10) accepts prefixes like 6foo and still returns 6; require an endptr check so only exact 6 or 35 are accepted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/kws_mfcc/train_c.c` around lines 68 - 73, The KWS_CLASSES parsing in
the environment-value helper currently accepts partially parsed strings because
`strtol` is called with a null end pointer. Update the parsing logic in the
function that validates `KWS_CLASSES` to use an `endptr` and reject any value
where extra characters remain after the number, so only exact `6` or `35` are
accepted; keep the existing fallback and error message behavior for invalid
input.
src/userApi/training_loop/calculate_grads/include/TraceApi.h-23-25 (1)
23-25: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix the agrad contract text.

The implementation emits "agrad" before backward(...) and exposes the gradient entering that layer's backward pass, not a post-backward tensor. Saying "after each layer's backward" points consumers at the wrong quantity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/userApi/training_loop/calculate_grads/include/TraceApi.h` around lines 23
- 25, Update the `TraceApi.h` contract comment for the trace points emitted by
`calculateGradsSequential` / the `sink` callback so the `"agrad"` description
matches the implementation. The issue is that `"agrad"` is fired before
`backward(...)` and represents the gradient entering each layer’s backward pass,
not a post-backward result; revise the wording in the `calculateGradsSequential`
comment to reflect that pre-backward gradient input semantics.
examples/mnist_mlp/train_c.c-217-223 (1)
217-223: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Honor the logged seed on the C training path.

Line 221 builds the randomly initialized layers before any C-side RNG seeding happens, but Lines 252-255 still record "seed": 42 in c.json. That makes the train-from-scratch run non-reproducible from its own logged config. Seed the framework RNG immediately before buildModel() or drop the field if this path is intentionally unseeded.

Also applies to: 252-255
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/mnist_mlp/train_c.c` around lines 217 - 223, The C training path is
initializing the model in buildModel() before the framework RNG is seeded, while
the generated c.json still hardcodes "seed": 42. Update the training flow around
buildModel() to seed the RNG first so the logged seed matches the actual
initialization, or remove the seed field from the config if this path is meant
to stay unseeded. Make sure the change keeps the logged configuration and the
model initialization order consistent.
examples/mnist_mlp/compare.py-39-43 (1)
39-43: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject mismatched prediction and label lengths before building the confusion matrix.

zip(preds, labels) silently truncates to the shorter input here. A partial c_predictions.npy would still produce plots and a parity report over a truncated test set instead of failing fast.
Proposed fix
 def confusion_matrix(preds: np.ndarray, labels: np.ndarray, num_classes: int) -> np.ndarray:
+    if preds.shape[0] != labels.shape[0]:
+        raise ValueError(
+            f"prediction/label length mismatch: {preds.shape[0]} != {labels.shape[0]}"
+        )
     cm = np.zeros((num_classes, num_classes), dtype=np.int64)
     for p, a in zip(preds, labels):
         cm[int(p), int(a)] += 1
     return cm
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/mnist_mlp/compare.py` around lines 39 - 43, The confusion_matrix
helper currently iterates with zip(preds, labels), which silently truncates
mismatched inputs and can produce a partial report. Update confusion_matrix to
validate that preds and labels have the same length before the loop and raise a
clear error if they differ, so callers like the compare.py plotting/parity flow
fail fast instead of using incomplete data.
examples/kws_raw/README.md-8-11 (1)
8-11: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Mention the LayerNorm lengths in the rate-change note.

Changing K also changes the hard-coded LayerNorm([C, L]) shapes in train_c.c/trace_c.c; updating only the MaxPool inputLengths leaves the C model inconsistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/kws_raw/README.md` around lines 8 - 11, Update the rate-change note
to mention that changing K also affects the hard-coded LayerNorm([C, L]) shapes
in train_c.c and trace_c.c, not just the MaxPool inputLength values. Make the
README guidance around K and effective rate explicit that both the pooling input
lengths and LayerNorm lengths must be kept in sync with the new sample rate,
using the existing AdaptiveAvgPool1d(1) and train_c.c/trace_c.c references to
locate the affected model setup.
examples/kws_raw/trace_pytorch.py-35-40 (1)
35-40: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Split the semicolon-chained statements.

Ruff is already flagging E702 on Lines 35-40, 60, and 68, so this file will not pass the configured Python lint as written.

Also applies to: 60-60, 68-68
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/kws_raw/trace_pytorch.py` around lines 35 - 40, The
semicolon-chained statements in trace_pytorch.py trigger Ruff E702, so split
them into separate lines to satisfy linting. Update the chained assignments in
the tracing logic around the activations collection, including the blocks that
use acts and the later occurrences in the same function, so each assignment or
function call stands alone. Keep the existing behavior in the traced forward
path and preserve the same variable names like acts, h, model.conv1, model.ln1,
F.relu, and F.max_pool1d while removing all semicolons.
Source: Linters/SAST tools
examples/_shared/mnist_data.py-19-23 (1)
19-23: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use a real split check here.

With python -O, this assert is removed and any unsupported split falls through to train=False, so "validation" silently loads the test set. Raise ValueError before constructing the dataset instead.
Proposed fix
 def load_mnist(root: str | Path, split: str) -> tuple[np.ndarray, np.ndarray]:
-    assert split in ("train", "test"), split
+    if split not in ("train", "test"):
+        raise ValueError(f"unsupported MNIST split: {split}")
     ds = datasets.MNIST(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/_shared/mnist_data.py` around lines 19 - 23, The split validation in
load_mnist relies on an assert, which disappears under optimized Python and can
let unsupported values fall through to datasets.MNIST as the test split. Replace
the assert with an explicit runtime check in load_mnist that raises ValueError
for any split other than "train" or "test", and only construct datasets.MNIST
after the split has been validated.
examples/_shared/npy_dump_sink.c-23-23 (1)
23-23: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Reject probe indexes above numProbes.

The header contract only reserves layerIdx == numProbes for the loss probe. Mapping every larger index to "loss" silently aliases bad indexes onto the same output filename and can mask an upstream tracing bug.
Proposed fix
-    const char *probe = (layerIdx < ctx->numProbes) ? ctx->probeNames[layerIdx] : "loss";
+    const char *probe = NULL;
+    if (layerIdx < ctx->numProbes) {
+        probe = ctx->probeNames[layerIdx];
+    } else if (layerIdx == ctx->numProbes) {
+        probe = "loss";
+    } else {
+        fprintf(stderr, "npyDumpSink: invalid probe index %zu (numProbes=%zu)\n", layerIdx,
+                ctx->numProbes);
+        exit(1);
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/_shared/npy_dump_sink.c` at line 23, Reject probe indexes greater
than numProbes in the probe selection logic: in npy_dump_sink.c, the ternary
assigning probe currently maps every layerIdx above ctx->numProbes to "loss",
which can hide invalid inputs. Update the probe resolution so only layerIdx ==
ctx->numProbes uses the loss probe and any layerIdx > ctx->numProbes is treated
as an error or otherwise rejected before reaching the filename/output path.
examples/_shared/speechcommands_data.py-133-150 (1)
133-150: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Validate num_classes without assert.

Under python -O, the assertion disappears and unsupported values fall into the 6-class branch, so the loader returns the wrong label space instead of failing fast.
Proposed fix
 def load_speechcommands(root, num_classes: int) -> dict:
-    assert num_classes in (6, 35), num_classes
+    if num_classes not in (6, 35):
+        raise ValueError(f"unsupported num_classes: {num_classes}")
     root = Path(root)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/_shared/speechcommands_data.py` around lines 133 - 150, The
num_classes guard in load_speechcommands currently relies on assert, which can
be stripped under optimized Python and let invalid values proceed into the
6-class path. Replace the assert in load_speechcommands with an explicit runtime
check that rejects unsupported num_classes values and raises a clear exception
before any dataset splitting happens. Keep the existing validation logic around
the 35-class keyword set, but ensure the initial num_classes validation is
always enforced regardless of interpreter flags.

🧹 Nitpick comments (5)

.github/workflows/ci.yml (1)
156-170: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Minor inconsistency: MNIST raw data cached per-example instead of shared.

examples/mnist_mlp/data/raw and examples/mnist_cnn/data/raw cache the same underlying MNIST dataset separately, whereas the new SpeechCommands raw cache (Lines 166-170) is shared once under examples/_shared/data/speech_commands for both kws_mfcc and kws_raw. If examples/_shared/mnist_data.py downloads to a shared location, consider pointing both MNIST examples' raw cache at that single shared directory to avoid duplicate downloads/cache entries. Given MNIST's small size, impact is minimal.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 156 - 170, The raw dataset cache setup
is duplicating MNIST downloads across both examples instead of using a shared
location like the SpeechCommands cache. Update the cache paths in the CI
workflow so `mnist_mlp` and `mnist_cnn` point to the shared MNIST raw directory
used by `examples/_shared/mnist_data.py`, and adjust the cache key inputs to
reference the shared prepare/download script or other shared source of truth.
Keep the `actions/cache` block aligned with the shared-data pattern used by
`examples/_shared/data/speech_commands` so both MNIST examples reuse the same
cached raw dataset.
test/unit/userAPI/UnitTestConv1dApi.c (1)
324-336: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the override-path bias is still sampled.

biasMaxAbs == 0 still passes this test today, so a regression that accidentally zero-initializes bias when weightInit is overridden would slip through. Add the same > 0.0f check used in the default-init case.
Suggested patch
     TEST_ASSERT_TRUE_MESSAGE(weightMaxAbs > defaultBound,
                              "He override did not widen weights beyond the PyTorch default bound");
     TEST_ASSERT_TRUE_MESSAGE(weightMaxAbs <= heBound * 1.001f,
                              "He weights exceed the sqrt(6)/sqrt(fan_in) bound");
+    TEST_ASSERT_TRUE_MESSAGE(biasMaxAbs > 0.0f,
+                             "Bias became zero under the override path");
     TEST_ASSERT_TRUE_MESSAGE(biasMaxAbs <= defaultBound * 1.001f,
                              "Bias must stay PyTorch default uniform regardless of weight scheme");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/unit/userAPI/UnitTestConv1dApi.c` around lines 324 - 336, The
override-path bias assertion in UnitTestConv1dApi.c is missing a non-zero
sampling check, so a zero-initialized bias can still pass; update the bias
validation near the existing maxAbsFloat(cfg->bias->param) check to also assert
biasMaxAbs > 0.0f, matching the default-init test behavior while keeping the
current default-bound check intact.
test/unit/layer/UnitTestLinear.c (1)
1450-1455: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the override-path bias is still sampled.

The current assertion set would still pass if the Xavier path accidentally produced all-zero bias. Adding the same nonzero check as the default-init test will pin the intended contract.
Suggested patch
     TEST_ASSERT_TRUE_MESSAGE(weightMaxAbs > defaultBound,
                              "Xavier override did not change weights away from the default bound");
     TEST_ASSERT_TRUE_MESSAGE(weightMaxAbs <= xavierBound * 1.001f,
                              "Xavier weights exceed the sqrt(6/(fan_in+fan_out)) bound");
+    TEST_ASSERT_TRUE_MESSAGE(biasMaxAbs > 0.0f,
+                             "Bias became zero under the override path");
     TEST_ASSERT_TRUE_MESSAGE(biasMaxAbs <= defaultBound * 1.001f,
                              "Bias must stay PyTorch default uniform regardless of weight scheme");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/unit/layer/UnitTestLinear.c` around lines 1450 - 1455, The Xavier
override test in UnitTestLinear should also verify that the bias is actually
sampled and not left all-zero. Update the assertions around the existing weight
and bias checks in the linear unit test to include the same nonzero bias
validation used by the default-initialization test, so the override path is
pinned to a real sampled bias rather than only checking it stays within bounds.
test/unit/userAPI/UnitTestConv1dTransposedApi.c (1)
293-298: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the override-path bias is still sampled.

This only proves the bias stays within the old bound. It does not fail if the override path starts zero-initializing bias, because 0 <= defaultBound still passes.
Suggested patch
     TEST_ASSERT_TRUE_MESSAGE(weightMaxAbs > defaultBound,
                              "He override did not widen weights beyond the PyTorch default bound");
     TEST_ASSERT_TRUE_MESSAGE(weightMaxAbs <= heBound * 1.001f,
                              "He weights exceed the sqrt(6)/sqrt(fan_in) bound");
+    TEST_ASSERT_TRUE_MESSAGE(biasMaxAbs > 0.0f,
+                             "Bias became zero under the override path");
     TEST_ASSERT_TRUE_MESSAGE(biasMaxAbs <= defaultBound * 1.001f,
                              "Bias must stay PyTorch default uniform regardless of weight scheme");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/unit/userAPI/UnitTestConv1dTransposedApi.c` around lines 293 - 298, The
override-path bias check in the transposed conv unit test only verifies the
value stays within the default bound, so it can miss a zero-initialized bias.
Update the assertions in UnitTestConv1dTransposedApi to also verify the bias is
actually sampled/non-zero for the override path, using the existing weight and
bias max-abs checks in the same test case, while keeping the current bound
validation intact.
examples/har_classifier/CMakeLists.txt (1)
51-55: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make Distributions transitively owned by LayerCommon if that's what Line 55 is satisfying.

examples/har_classifier/train_c.c no longer uses the distribution API directly, so linking Distributions here makes every consumer of LayerCommon know one of its private implementation details. If this was added only because the new init helpers pull it in, move that dependency to the LayerCommon target in src/userApi/CMakeLists.txt instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/har_classifier/CMakeLists.txt` around lines 51 - 55, `Distributions`
should not be linked directly from the `examples/har_classifier` target list if
`train_c.c` no longer uses it; this leaks a private dependency through
`LayerCommon`. Update the `LayerCommon` target in `src/userApi/CMakeLists.txt`
so it owns the `Distributions` dependency transitively, and remove
`Distributions` from the `har_classifier` link list so consumers only depend on
`LayerCommon`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/_shared/trace_sweep.py`:
- Around line 120-121: The aggregate in trace_sweep.py is currently allowing
mismatched dump filename sets to slip through because compare_pairs() skips
missing counterparts and batch_results still gets appended. In the loop that
builds pairs and appends to batch_results, add a validation step using the
dump_c and dump_pt filename sets before calling trace_compare.compare_pairs() or
recording the result, and fail fast when they differ. Use the existing
trace_compare.compare_pairs and batch_results flow as the place to enforce this
check.

In `@examples/kws_raw/compare.py`:
- Around line 78-84: The compare.py parity report is meant to be informational
only, but the current final return in the main flow still turns any mismatch
into a nonzero exit. Update the result handling near the overall_pass summary so
the script always exits successfully after printing the report, and keep the
existing parity output logic in compare.py and any surrounding main entrypoint
consistent with that non-failing contract.
- Around line 47-50: The confusion_matrix function currently uses zip(preds,
labels), which silently truncates when the arrays differ in length. Add an
explicit length check at the start of confusion_matrix (or in the caller that
prepares preds and labels) and raise a clear error if the lengths do not match
before building the matrix. Keep the fix localized around confusion_matrix so
the mismatch is caught before any counting happens.

In `@examples/kws_raw/trace_c.c`:
- Around line 368-372: The per-step dump directory naming in the trace
generation loop is deterministic, so reruns can reuse stale contents from prior
executions. Update the step directory handling in the loop around ensureDir and
snprintf in trace_c.c so each step dump directory is either freshly cleared
before use or creation fails on an existing directory (including treating EEXIST
as an error), preventing mixed old/new .npy files from being read by the
comparer.

In `@examples/kws_raw/trace_pytorch.py`:
- Around line 11-13: The fixed output path in trace_pytorch.py can leave stale
*.sNNN.npy files from earlier runs, so start each run with a clean
dump_pt/step000 directory or write into a fresh step-specific directory. Update
the step/activation dump logic in the main tracing flow (where the output path
is created and files are written) to remove any existing directory before saving
new samples, using the existing path-building symbols in trace_pytorch.py.

In `@examples/kws_raw/train_c.c`:
- Around line 44-45: `buildModel()` is using random initialization without
applying `SEED`, so training is not actually reproducible even though the log
says seed 42. In `train_c.c`, make sure the C RNG is seeded with `SEED`
immediately before `buildModel()` is called, and keep the existing
`SEED`/`SHUFFLE_SEED` constants aligned with that setup so the model
construction uses the intended deterministic state. Also apply the same seeding
fix in the other referenced location so both paths behave consistently.
- Around line 414-419: Guard the prediction buffer allocation in train_c.c by
validating the size calculation before calling malloc: the numTest value from
getTestSize() can overflow when multiplied by sizeof(int32_t), causing an
undersized predictions buffer. Update the allocation path near getTestSize() and
predictions so it checks for overflow or ensures numTest is within a safe bound
before allocating, and fail cleanly if the requested buffer size is too large.

In `@examples/mnist_cnn/compare.py`:
- Around line 39-43: The prediction comparison flow is silently truncating
mismatched arrays because confusion_matrix zips preds and labels, so a partial
predictions file can pass instead of failing. Add an explicit length check in
the compare.py validation path before calling confusion_matrix, using the loaded
test_y and each prediction array, and raise an error if any prediction length
differs; keep the check near the code that loads predictions and invokes
confusion_matrix so failures surface early.
- Around line 70-76: Update the compare script so the informational parity
report never fails the process; in compare.py, keep the reporting in the main
compare routine (the print loop and overall summary) but change the return value
logic at the end so it always exits successfully instead of returning 1 when
overall_pass is false. Make sure the behavior of the compare command stays
aligned with the module docstring and README description of the step as
informational, not a CI gate.

In `@examples/mnist_mlp/compare.py`:
- Around line 70-76: The parity comparison in compare.py is informational-only,
but the current return path in the main comparison flow still exits with a
failure status when overall_pass is false. Update the logic around the results
loop and final return so the script always completes successfully for tolerance
drift, keeping the printed PASS/FAIL summary as informational output only. Use
the compare.py comparison function and the overall_pass/result handling to
locate the change, and remove any nonzero exit behavior from this reporting
path.

In `@src/userApi/layer/Conv1dApi.c`:
- Around line 108-112: The Conv1d Xavier initialization is computing fan-out
with grouped channels instead of the full output-channel count. Update the
fan-out calculation in the Conv1d weight initialization path so the code in the
Conv1dApi logic uses outChannels * kernelSize rather than dividing by groups,
while keeping fan-in based on the per-group input channels. Preserve the
existing initWeightTensor call and adjust the local fanOut symbol accordingly.

In `@src/userApi/training_loop/calculate_grads/CalculateGradsSequential.c`:
- Around line 140-156: `traceModelParams()` currently calls `sink(ctx, ...)`
unconditionally, so a NULL trace sink will crash even though the no-op path is
expected; add an early guard in `traceModelParams()` (before the loop) to return
immediately when `sink` is NULL, keeping `traceModelWeights()` and
`traceModelGrads()` behavior consistent with the existing `tracedGrads(...,
NULL, NULL)` no-op case.

In `@src/userApi/training_loop/calculate_grads/include/TraceApi.h`:
- Around line 17-21: Clarify the lifetime of the phase argument in traceSink_t:
it is currently undocumented even though tracedGrads() passes literals and
traceModelWeights()/traceModelGrads() may build it from temporary stack storage.
Update the comment in TraceApi.h near traceSink_t to state that phase is
borrowed and valid only for the duration of the callback, or change the callers
that synthesize phase to use stable storage before invoking the sink.

---

Outside diff comments:
In `@examples/ecg_anomaly_ae/train_c.c`:
- Around line 361-369: The train-from-scratch path in writeAllReconstructions
currently only emits c_reconstructions.npy, but compare.py and the README still
expect outputs/c_train_recons.npy. Update the reconstruction-writing block
around writeAllReconstructions/getTestSample so the training path also produces
c_train_recons.npy (or otherwise matches the filename compare.py loads), keeping
the existing c_reconstructions.npy output intact if it is still needed.

---

Minor comments:
In `@examples/_shared/mnist_data.py`:
- Around line 19-23: The split validation in load_mnist relies on an assert,
which disappears under optimized Python and can let unsupported values fall
through to datasets.MNIST as the test split. Replace the assert with an explicit
runtime check in load_mnist that raises ValueError for any split other than
"train" or "test", and only construct datasets.MNIST after the split has been
validated.

In `@examples/_shared/npy_dump_sink.c`:
- Line 23: Reject probe indexes greater than numProbes in the probe selection
logic: in npy_dump_sink.c, the ternary assigning probe currently maps every
layerIdx above ctx->numProbes to "loss", which can hide invalid inputs. Update
the probe resolution so only layerIdx == ctx->numProbes uses the loss probe and
any layerIdx > ctx->numProbes is treated as an error or otherwise rejected
before reaching the filename/output path.

In `@examples/_shared/speechcommands_data.py`:
- Around line 133-150: The num_classes guard in load_speechcommands currently
relies on assert, which can be stripped under optimized Python and let invalid
values proceed into the 6-class path. Replace the assert in load_speechcommands
with an explicit runtime check that rejects unsupported num_classes values and
raises a clear exception before any dataset splitting happens. Keep the existing
validation logic around the 35-class keyword set, but ensure the initial
num_classes validation is always enforced regardless of interpreter flags.

In `@examples/kws_mfcc/train_c.c`:
- Around line 120-124: The one-hot label construction in the training path
silently accepts out-of-range class labels and produces an all-zero target; add
a bounds check for cls before the loop that fills data. In the label handling
logic around intLabels->array[i]->data, validate that cls is within [0,
g_numClasses) and fail fast if it is not, rather than writing the tensor in
train_c.c.
- Around line 68-73: The KWS_CLASSES parsing in the environment-value helper
currently accepts partially parsed strings because `strtol` is called with a
null end pointer. Update the parsing logic in the function that validates
`KWS_CLASSES` to use an `endptr` and reject any value where extra characters
remain after the number, so only exact `6` or `35` are accepted; keep the
existing fallback and error message behavior for invalid input.

In `@examples/kws_raw/README.md`:
- Around line 8-11: Update the rate-change note to mention that changing K also
affects the hard-coded LayerNorm([C, L]) shapes in train_c.c and trace_c.c, not
just the MaxPool inputLength values. Make the README guidance around K and
effective rate explicit that both the pooling input lengths and LayerNorm
lengths must be kept in sync with the new sample rate, using the existing
AdaptiveAvgPool1d(1) and train_c.c/trace_c.c references to locate the affected
model setup.

In `@examples/kws_raw/trace_pytorch.py`:
- Around line 35-40: The semicolon-chained statements in trace_pytorch.py
trigger Ruff E702, so split them into separate lines to satisfy linting. Update
the chained assignments in the tracing logic around the activations collection,
including the blocks that use acts and the later occurrences in the same
function, so each assignment or function call stands alone. Keep the existing
behavior in the traced forward path and preserve the same variable names like
acts, h, model.conv1, model.ln1, F.relu, and F.max_pool1d while removing all
semicolons.

In `@examples/mnist_mlp/compare.py`:
- Around line 39-43: The confusion_matrix helper currently iterates with
zip(preds, labels), which silently truncates mismatched inputs and can produce a
partial report. Update confusion_matrix to validate that preds and labels have
the same length before the loop and raise a clear error if they differ, so
callers like the compare.py plotting/parity flow fail fast instead of using
incomplete data.

In `@examples/mnist_mlp/train_c.c`:
- Around line 217-223: The C training path is initializing the model in
buildModel() before the framework RNG is seeded, while the generated c.json
still hardcodes "seed": 42. Update the training flow around buildModel() to seed
the RNG first so the logged seed matches the actual initialization, or remove
the seed field from the config if this path is meant to stay unseeded. Make sure
the change keeps the logged configuration and the model initialization order
consistent.

In `@src/userApi/training_loop/calculate_grads/include/TraceApi.h`:
- Around line 23-25: Update the `TraceApi.h` contract comment for the trace
points emitted by `calculateGradsSequential` / the `sink` callback so the
`"agrad"` description matches the implementation. The issue is that `"agrad"` is
fired before `backward(...)` and represents the gradient entering each layer’s
backward pass, not a post-backward result; revise the wording in the
`calculateGradsSequential` comment to reflect that pre-backward gradient input
semantics.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 156-170: The raw dataset cache setup is duplicating MNIST
downloads across both examples instead of using a shared location like the
SpeechCommands cache. Update the cache paths in the CI workflow so `mnist_mlp`
and `mnist_cnn` point to the shared MNIST raw directory used by
`examples/_shared/mnist_data.py`, and adjust the cache key inputs to reference
the shared prepare/download script or other shared source of truth. Keep the
`actions/cache` block aligned with the shared-data pattern used by
`examples/_shared/data/speech_commands` so both MNIST examples reuse the same
cached raw dataset.

In `@examples/har_classifier/CMakeLists.txt`:
- Around line 51-55: `Distributions` should not be linked directly from the
`examples/har_classifier` target list if `train_c.c` no longer uses it; this
leaks a private dependency through `LayerCommon`. Update the `LayerCommon`
target in `src/userApi/CMakeLists.txt` so it owns the `Distributions` dependency
transitively, and remove `Distributions` from the `har_classifier` link list so
consumers only depend on `LayerCommon`.

In `@test/unit/layer/UnitTestLinear.c`:
- Around line 1450-1455: The Xavier override test in UnitTestLinear should also
verify that the bias is actually sampled and not left all-zero. Update the
assertions around the existing weight and bias checks in the linear unit test to
include the same nonzero bias validation used by the default-initialization
test, so the override path is pinned to a real sampled bias rather than only
checking it stays within bounds.

In `@test/unit/userAPI/UnitTestConv1dApi.c`:
- Around line 324-336: The override-path bias assertion in UnitTestConv1dApi.c
is missing a non-zero sampling check, so a zero-initialized bias can still pass;
update the bias validation near the existing maxAbsFloat(cfg->bias->param) check
to also assert biasMaxAbs > 0.0f, matching the default-init test behavior while
keeping the current default-bound check intact.

In `@test/unit/userAPI/UnitTestConv1dTransposedApi.c`:
- Around line 293-298: The override-path bias check in the transposed conv unit
test only verifies the value stays within the default bound, so it can miss a
zero-initialized bias. Update the assertions in UnitTestConv1dTransposedApi to
also verify the bias is actually sampled/non-zero for the override path, using
the existing weight and bias max-abs checks in the same test case, while keeping
the current bound validation intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1251b5b8-2278-4fef-86ed-4deda8727b24

📥 Commits

Reviewing files that changed from the base of the PR and between d346909 and 65bf907.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (68)

.github/workflows/ci.yml
CMakeLists.txt
devenv.nix
example/MnistExperiment.c
examples/CMakeLists.txt
examples/README.md
examples/_shared/CMakeLists.txt
examples/_shared/mnist_data.py
examples/_shared/npy_dump_sink.c
examples/_shared/npy_dump_sink.h
examples/_shared/speechcommands_data.py
examples/_shared/trace_compare.py
examples/_shared/trace_sweep.py
examples/ecg_anomaly_ae/CMakeLists.txt
examples/ecg_anomaly_ae/README.md
examples/ecg_anomaly_ae/compare.py
examples/ecg_anomaly_ae/train_c.c
examples/ecg_anomaly_ae_v2/train_c.c
examples/har_classifier/CMakeLists.txt
examples/har_classifier/README.md
examples/har_classifier/train_c.c
examples/kws_mfcc/CMakeLists.txt
examples/kws_mfcc/README.md
examples/kws_mfcc/compare.py
examples/kws_mfcc/prepare_data.py
examples/kws_mfcc/train_c.c
examples/kws_mfcc/train_pytorch.py
examples/kws_raw/CMakeLists.txt
examples/kws_raw/README.md
examples/kws_raw/compare.py
examples/kws_raw/prepare_data.py
examples/kws_raw/probe_manifest.h
examples/kws_raw/trace_c.c
examples/kws_raw/trace_pytorch.py
examples/kws_raw/train_c.c
examples/kws_raw/train_pytorch.py
examples/mnist_cnn/CMakeLists.txt
examples/mnist_cnn/README.md
examples/mnist_cnn/compare.py
examples/mnist_cnn/prepare_data.py
examples/mnist_cnn/train_c.c
examples/mnist_cnn/train_pytorch.py
examples/mnist_mlp/CMakeLists.txt
examples/mnist_mlp/README.md
examples/mnist_mlp/compare.py
examples/mnist_mlp/prepare_data.py
examples/mnist_mlp/train_c.c
examples/mnist_mlp/train_pytorch.py
pyproject.toml
src/userApi/CMakeLists.txt
src/userApi/LayerCommon.c
src/userApi/include/LayerCommon.h
src/userApi/layer/Conv1dApi.c
src/userApi/layer/Conv1dTransposedApi.c
src/userApi/layer/LinearApi.c
src/userApi/layer/include/Conv1dApi.h
src/userApi/layer/include/Conv1dTransposedApi.h
src/userApi/layer/include/LinearApi.h
src/userApi/training_loop/calculate_grads/CalculateGradsSequential.c
src/userApi/training_loop/calculate_grads/include/TraceApi.h
test/unit/CMakeLists.txt
test/unit/layer/CMakeLists.txt
test/unit/layer/UnitTestLinear.c
test/unit/training_loop/CMakeLists.txt
test/unit/training_loop/UnitTestCalculateGradsSequential.c
test/unit/userAPI/CMakeLists.txt
test/unit/userAPI/UnitTestConv1dApi.c
test/unit/userAPI/UnitTestConv1dTransposedApi.c

💤 Files with no reviewable changes (3)

example/MnistExperiment.c
examples/ecg_anomaly_ae_v2/train_c.c
CMakeLists.txt

coderabbitai · 2026-07-01T18:54:23Z

+        pairs = trace_compare.compare_pairs(c_dump, pt_dump)
+        batch_results.append((c_loss, pt_loss, pairs))


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail when the dump filename sets do not match.

trace_compare.compare_pairs() intentionally skips missing counterparts. In this loop, that means a missing probe/phase on either side just disappears from the aggregate tables, so trace_sweep.py can report a clean summary while traces are incomplete. Validate the dump_c and dump_pt filename sets before appending pairs.

Proposed fix

print(f" C: {c_out}") print(f" PT: {pt_out}", flush=True) c_loss = extract_loss(c_out) pt_loss = extract_loss(pt_out) + c_names = {p.name for p in c_dump.glob("*.npy")} + pt_names = {p.name for p in pt_dump.glob("*.npy")} + if not c_names or not pt_names or c_names != pt_names: + missing_in_pt = sorted(c_names - pt_names) + missing_in_c = sorted(pt_names - c_names) + raise RuntimeError( + "trace dump mismatch\n" + f" missing in PyTorch: {missing_in_pt[:10]}\n" + f" missing in C: {missing_in_c[:10]}" + ) pairs = trace_compare.compare_pairs(c_dump, pt_dump) batch_results.append((c_loss, pt_loss, pairs))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pairs = trace_compare.compare_pairs(c_dump, pt_dump)

batch_results.append((c_loss, pt_loss, pairs))

c_names = {p.name for p in c_dump.glob("*.npy")}

pt_names = {p.name for p in pt_dump.glob("*.npy")}

if not c_names or not pt_names or c_names != pt_names:

missing_in_pt = sorted(c_names - pt_names)

missing_in_c = sorted(pt_names - c_names)

raise RuntimeError(

"trace dump mismatch\n"

f" missing in PyTorch: {missing_in_pt[:10]}\n"

f" missing in C: {missing_in_c[:10]}"

)

pairs = trace_compare.compare_pairs(c_dump, pt_dump)

batch_results.append((c_loss, pt_loss, pairs))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/_shared/trace_sweep.py` around lines 120 - 121, The aggregate in trace_sweep.py is currently allowing mismatched dump filename sets to slip through because compare_pairs() skips missing counterparts and batch_results still gets appended. In the loop that builds pairs and appends to batch_results, add a validation step using the dump_c and dump_pt filename sets before calling trace_compare.compare_pairs() or recording the result, and fail fast when they differ. Use the existing trace_compare.compare_pairs and batch_results flow as the place to enforce this check.

coderabbitai · 2026-07-01T18:54:23Z

+def confusion_matrix(preds: np.ndarray, labels: np.ndarray, num_classes: int) -> np.ndarray:
+    cm = np.zeros((num_classes, num_classes), dtype=np.int64)
+    for p, a in zip(preds, labels):
+        cm[int(p), int(a)] += 1


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail fast on prediction/label length mismatches.

zip(preds, labels) silently truncates. If either predictions file has the wrong length, the confusion matrix still renders while dropping samples and hiding the contract violation.

Proposed fix

def confusion_matrix(preds: np.ndarray, labels: np.ndarray, num_classes: int) -> np.ndarray: + if preds.shape[0] != labels.shape[0]: + raise ValueError( + f"expected {labels.shape[0]} predictions, got {preds.shape[0]}" + ) cm = np.zeros((num_classes, num_classes), dtype=np.int64) for p, a in zip(preds, labels): cm[int(p), int(a)] += 1 return cm

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def confusion_matrix(preds: np.ndarray, labels: np.ndarray, num_classes: int) -> np.ndarray:

cm = np.zeros((num_classes, num_classes), dtype=np.int64)

for p, a in zip(preds, labels):

cm[int(p), int(a)] += 1

def confusion_matrix(preds: np.ndarray, labels: np.ndarray, num_classes: int) -> np.ndarray:

if preds.shape[0] != labels.shape[0]:

raise ValueError(

f"expected {labels.shape[0]} predictions, got {preds.shape[0]}"

)

cm = np.zeros((num_classes, num_classes), dtype=np.int64)

for p, a in zip(preds, labels):

cm[int(p), int(a)] += 1

🧰 Tools

🪛 Ruff (0.15.20)

[warning] 49-49: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/kws_raw/compare.py` around lines 47 - 50, The confusion_matrix function currently uses zip(preds, labels), which silently truncates when the arrays differ in length. Add an explicit length check at the start of confusion_matrix (or in the caller that prepares preds and labels) and raise a clear error if the lengths do not match before building the matrix. Keep the fix localized around confusion_matrix so the mismatch is caught before any counting happens.

Source: Linters/SAST tools

coderabbitai · 2026-07-01T18:54:23Z

+    print("\nParity report (PyTorch vs C) — INFORMATIONAL:")
+    print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")
+    for r in results:
+        print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "
+              f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")
+    print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")
+    return 0 if overall_pass else 1


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Keep the informational compare non-failing.

This script says the parity report is informational only, but return 0 if overall_pass else 1 makes every tolerance miss a hard failure. That contradicts both the module docstring and this PR’s stated contract for compare.py.

Proposed fix

- return 0 if overall_pass else 1 + return 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("\nParity report (PyTorch vs C) — INFORMATIONAL:")

print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")

for r in results:

print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "

f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")

return 0 if overall_pass else 1

print("\nParity report (PyTorch vs C) — INFORMATIONAL:")

print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")

for r in results:

print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "

f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")

return 0

🧰 Tools

🪛 Ruff (0.15.20)

[warning] 82-82: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/kws_raw/compare.py` around lines 78 - 84, The compare.py parity report is meant to be informational only, but the current final return in the main flow still turns any mismatch into a nonzero exit. Update the result handling near the overall_pass summary so the script always exits successfully after printing the report, and keep the existing parity output logic in compare.py and any surrounding main entrypoint consistent with that non-failing contract.

coderabbitai · 2026-07-01T18:54:23Z

+    ensureDir("examples/kws_raw/dump_c");
+    for (size_t step = 0; step < g_steps; step++) {
+        char dir[256];
+        snprintf(dir, sizeof(dir), "examples/kws_raw/dump_c/step%03zu", step);
+        ensureDir(dir);


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Avoid reusing existing step dump directories.

step%03zu is deterministic, so rerunning with fewer steps or a smaller --act-samples leaves stale .npy files behind and the comparer can read mixed old/new traces. Clear dir first, or treat EEXIST as an error for per-step dumps.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/kws_raw/trace_c.c` around lines 368 - 372, The per-step dump directory naming in the trace generation loop is deterministic, so reruns can reuse stale contents from prior executions. Update the step directory handling in the loop around ensureDir and snprintf in trace_c.c so each step dump directory is either freshly cleared before use or creation fails on an existing directory (including treating EEXIST as an error), preventing mixed old/new .npy files from being read by the comparer.

coderabbitai · 2026-07-01T18:54:23Z

+import argparse, sys
+from pathlib import Path
+import numpy as np, torch, torch.nn.functional as F


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Start from a clean dump_pt/step000 directory.

This reuses a fixed output path. If a rerun uses a smaller batch or fewer activation samples, old *.sNNN.npy files remain and the directory no longer represents one step. Remove the directory first or write to a fresh one.

Proposed fix

-import numpy as np, torch, torch.nn.functional as F +import shutil + +import numpy as np, torch, torch.nn.functional as F @@ - out = HERE / "dump_pt" / "step000"; out.mkdir(parents=True, exist_ok=True) + out = HERE / "dump_pt" / "step000" + shutil.rmtree(out, ignore_errors=True) + out.mkdir(parents=True, exist_ok=True)

Also applies to: 68-68

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/kws_raw/trace_pytorch.py` around lines 11 - 13, The fixed output path in trace_pytorch.py can leave stale *.sNNN.npy files from earlier runs, so start each run with a clean dump_pt/step000 directory or write into a fresh step-specific directory. Update the step/activation dump logic in the main tracing flow (where the output path is created and files are written) to remove any existing directory before saving new samples, using the existing path-building symbols in trace_pytorch.py.

coderabbitai · 2026-07-01T18:54:23Z

+    print("\nParity report (PyTorch vs C) — INFORMATIONAL:")
+    print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")
+    for r in results:
+        print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "
+              f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")
+    print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")
+    return 0 if overall_pass else 1


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Keep the “informational” compare step non-failing.

The module docstring and examples/mnist_cnn/README.md both describe this command as informational, but Line 76 returns 1 on tolerance misses. That turns the report into a gate for any scripted run.

Suggested fix

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)") - return 0 if overall_pass else 1 + return 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("\nParity report (PyTorch vs C) — INFORMATIONAL:")

print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")

for r in results:

print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "

f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")

return 0 if overall_pass else 1

print("\nParity report (PyTorch vs C) — INFORMATIONAL:")

print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")

for r in results:

print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "

f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")

return 0

🧰 Tools

🪛 Ruff (0.15.20)

[warning] 74-74: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/mnist_cnn/compare.py` around lines 70 - 76, Update the compare script so the informational parity report never fails the process; in compare.py, keep the reporting in the main compare routine (the print loop and overall summary) but change the return value logic at the end so it always exits successfully instead of returning 1 when overall_pass is false. Make sure the behavior of the compare command stays aligned with the module docstring and README description of the step as informational, not a CI gate.

coderabbitai · 2026-07-01T18:54:23Z

+    print("\nParity report (PyTorch vs C) — INFORMATIONAL:")
+    print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")
+    for r in results:
+        print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "
+              f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")
+    print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")
+    return 0 if overall_pass else 1


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not fail the process from an informational-only comparison.

Lines 70-76 currently make compare.py exit 1 on tolerance drift, which turns this script into a gate. That contradicts this module’s own docstring and examples/mnist_mlp/README.md, both of which describe the train-from-scratch comparison as informational-only.

Proposed fix

print("\nParity report (PyTorch vs C) — INFORMATIONAL:") print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}") for r in results: print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} " f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}") print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)") - return 0 if overall_pass else 1 + return 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("\nParity report (PyTorch vs C) — INFORMATIONAL:")

print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")

for r in results:

print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "

f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")

return 0 if overall_pass else 1

print("\nParity report (PyTorch vs C) — INFORMATIONAL:")

print(f"{'metric':<14} {'pt':>10} {'c':>10} {'diff':>10} {'tol':>8} {'type':>5} {'pass':>6}")

for r in results:

print(f"{r.metric:<14} {r.pt_value:>10.5f} {r.c_value:>10.5f} {r.diff:>10.5f} "

f"{r.tolerance:>8.4f} {r.tolerance_type:>5} {str(r.passed):>6}")

print(f"\nOverall: {'PASS' if overall_pass else 'FAIL'} (informational; not a CI gate)")

return 0

🧰 Tools

🪛 Ruff (0.15.20)

[warning] 74-74: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/mnist_mlp/compare.py` around lines 70 - 76, The parity comparison in compare.py is informational-only, but the current return path in the main comparison flow still exits with a failure status when overall_pass is false. Update the logic around the results loop and final return so the script always completes successfully for tolerance drift, keeping the printed PASS/FAIL summary as informational output only. Use the compare.py comparison function and the overall_pass/result handling to locate the change, and remove any nonzero exit behavior from this reporting path.

coderabbitai · 2026-07-01T18:54:24Z

+    /* fan_in = inPerGroup*kernelSize; fan_out = outPerGroup*kernelSize
+     * (PyTorch _calculate_fan_in_and_fan_out for the Conv1d weight layout). */
+    size_t fanIn = inPerGroup * kernelSize;
+    size_t fanOut = (outChannels / groups) * kernelSize;
+    initWeightTensor(paramTensor, weightInit, fanIn, fanOut);


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the current PyTorch source still derives fan_out from tensor.size(0). curl -fsSL https://raw.githubusercontent.com/pytorch/pytorch/main/torch/nn/init.py | sed -n '/def _calculate_fan_in_and_fan_out/,/return fan_in, fan_out/p'

Repository: es-ude/OnDeviceTraining

Length of output: 908

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the Conv1d initialization code and the tensor initializer it calls. git ls-files 'src/userApi/layer/Conv1dApi.c' 'src/**/initWeightTensor*' 'src/**/weightInit*' 'src/**/Tensor*' | sed -n '1,200p' printf '\n--- Conv1dApi.c ---\n' sed -n '1,180p' src/userApi/layer/Conv1dApi.c printf '\n--- Search initWeightTensor references ---\n' rg -n "initWeightTensor|fanIn|fanOut" src

Repository: es-ude/OnDeviceTraining

Length of output: 12338

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n--- LayerCommon.c (initWeightTensor) ---\n' sed -n '1,90p' src/userApi/LayerCommon.c printf '\n--- Conv1dTransposedApi.c (grouped fanOut reference) ---\n' sed -n '55,90p' src/userApi/layer/Conv1dTransposedApi.c

Repository: es-ude/OnDeviceTraining

Length of output: 4423

Use the full output-channel count for Conv1d Xavier fan-out.

For weights shaped [outChannels, inChannels/groups, kernelSize], fan-out should be outChannels * kernelSize, not (outChannels / groups) * kernelSize; otherwise grouped Xavier gets a wider bound than intended.

Proposed fix

- /* fan_in = inPerGroup*kernelSize; fan_out = outPerGroup*kernelSize + /* fan_in = inPerGroup*kernelSize; fan_out = outChannels*kernelSize * (PyTorch _calculate_fan_in_and_fan_out for the Conv1d weight layout). */ size_t fanIn = inPerGroup * kernelSize; - size_t fanOut = (outChannels / groups) * kernelSize; + size_t fanOut = outChannels * kernelSize;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/* fan_in = inPerGroup*kernelSize; fan_out = outPerGroup*kernelSize

* (PyTorch _calculate_fan_in_and_fan_out for the Conv1d weight layout). */

size_t fanIn = inPerGroup * kernelSize;

size_t fanOut = (outChannels / groups) * kernelSize;

initWeightTensor(paramTensor, weightInit, fanIn, fanOut);

/* fan_in = inPerGroup*kernelSize; fan_out = outChannels*kernelSize

* (PyTorch _calculate_fan_in_and_fan_out for the Conv1d weight layout). */

size_t fanIn = inPerGroup * kernelSize;

size_t fanOut = outChannels * kernelSize;

initWeightTensor(paramTensor, weightInit, fanIn, fanOut);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/userApi/layer/Conv1dApi.c` around lines 108 - 112, The Conv1d Xavier initialization is computing fan-out with grouped channels instead of the full output-channel count. Update the fan-out calculation in the Conv1d weight initialization path so the code in the Conv1dApi logic uses outChannels * kernelSize rather than dividing by groups, while keeping fan-in based on the per-group input channels. Preserve the existing initWeightTensor call and adjust the local fanOut symbol accordingly.

coderabbitai · 2026-07-01T18:54:24Z

+static void traceModelParams(layer_t **model, size_t modelSize, const char *tag, bool wantGrad,
+                             traceSink_t sink, void *ctx) {
+    char phase[64];
+    for (size_t i = 0; i < modelSize; i++) {
+        parameter_t *w = NULL, *b = NULL;
+        if (!layerParameters(model[i], &w, &b)) {
+            continue;
+        }
+        tensor_t *wt = wantGrad ? getGradFromParameter(w) : getParamFromParameter(w);
+        snprintf(phase, sizeof(phase), "%s.weight", tag);
+        sink(ctx, i, model[i]->type, phase, wt);
+        if (b != NULL) {
+            tensor_t *bt = wantGrad ? getGradFromParameter(b) : getParamFromParameter(b);
+            snprintf(phase, sizeof(phase), "%s.bias", tag);
+            sink(ctx, i, model[i]->type, phase, bt);
+        }
+    }


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard traceModelParams() against a NULL sink.

traceModelWeights() and traceModelGrads() both route here, and sink(ctx, ...) is unconditional. Passing NULL currently crashes, even though tracedGrads(..., NULL, NULL) is already treated as a valid no-op trace path.

Suggested fix

static void traceModelParams(layer_t **model, size_t modelSize, const char *tag, bool wantGrad, traceSink_t sink, void *ctx) { + if (sink == NULL) { + return; + } + char phase[64]; for (size_t i = 0; i < modelSize; i++) { parameter_t *w = NULL, *b = NULL;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

static void traceModelParams(layer_t **model, size_t modelSize, const char *tag, bool wantGrad,

traceSink_t sink, void *ctx) {

char phase[64];

for (size_t i = 0; i < modelSize; i++) {

parameter_t *w = NULL, *b = NULL;

if (!layerParameters(model[i], &w, &b)) {

continue;

}

tensor_t *wt = wantGrad ? getGradFromParameter(w) : getParamFromParameter(w);

snprintf(phase, sizeof(phase), "%s.weight", tag);

sink(ctx, i, model[i]->type, phase, wt);

if (b != NULL) {

tensor_t *bt = wantGrad ? getGradFromParameter(b) : getParamFromParameter(b);

snprintf(phase, sizeof(phase), "%s.bias", tag);

sink(ctx, i, model[i]->type, phase, bt);

}

}

static void traceModelParams(layer_t **model, size_t modelSize, const char *tag, bool wantGrad,

traceSink_t sink, void *ctx) {

if (sink == NULL) {

return;

}

char phase[64];

for (size_t i = 0; i < modelSize; i++) {

parameter_t *w = NULL, *b = NULL;

if (!layerParameters(model[i], &w, &b)) {

continue;

}

tensor_t *wt = wantGrad ? getGradFromParameter(w) : getParamFromParameter(w);

snprintf(phase, sizeof(phase), "%s.weight", tag);

sink(ctx, i, model[i]->type, phase, wt);

if (b != NULL) {

tensor_t *bt = wantGrad ? getGradFromParameter(b) : getParamFromParameter(b);

snprintf(phase, sizeof(phase), "%s.bias", tag);

sink(ctx, i, model[i]->type, phase, bt);

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/userApi/training_loop/calculate_grads/CalculateGradsSequential.c` around lines 140 - 156, `traceModelParams()` currently calls `sink(ctx, ...)` unconditionally, so a NULL trace sink will crash even though the no-op path is expected; add an early guard in `traceModelParams()` (before the loop) to return immediately when `sink` is NULL, keeping `traceModelWeights()` and `traceModelGrads()` behavior consistent with the existing `tracedGrads(..., NULL, NULL)` no-op case.

coderabbitai · 2026-07-01T18:54:24Z

+ *  - phase:      "fwd" | "agrad" | "lossgrad" for tracedGrads (Task 2);
+ *                "<tag>.weight" / "<tag>.bias" for traceModelWeights/Grads (Task 3).
+ *  - tensor:     borrowed; valid only for the duration of the call. */
+typedef void (*traceSink_t)(void *ctx, size_t layerIdx, layerType_t layerType, const char *phase,
+                            tensor_t *tensor);


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Clarify that phase is borrowed too.

tracedGrads() passes string literals, but traceModelWeights() / traceModelGrads() synthesize phase in stack storage before invoking the sink. A sink that retains the pointer past the callback can end up with dangling or corrupted phase names. Please either document phase as valid only for the duration of the call or switch those synthesized names to stable storage.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/userApi/training_loop/calculate_grads/include/TraceApi.h` around lines 17 - 21, Clarify the lifetime of the phase argument in traceSink_t: it is currently undocumented even though tracedGrads() passes literals and traceModelWeights()/traceModelGrads() may build it from temporary stack storage. Update the comment in TraceApi.h near traceSink_t to state that phase is borrowed and valid only for the duration of the callback, or change the callers that synthesize phase to use stable storage before invoking the sink.

LeoBuron and others added 30 commits June 25, 2026 15:16

feat(examples/_shared): add shared MNIST torchvision loader

7f29375

feat(examples/mnist_mlp): prepare_data writes train/val/test npy

9981a3d

feat(examples/mnist_mlp): PyTorch reference MLP + weight export

20f8ce9

feat(examples/mnist_mlp): factory-API C trainer + CMake wiring

df4a21f

feat(examples/mnist_mlp): informational compare.py + plots

fad2b57

docs(examples/mnist_mlp): add README

2e04430

ci(examples): wire mnist_mlp into the bit-parity job

bfe5262

chore(examples/_shared): annotate load_mnist root param, drop unused-…

92e1be8

…import finding

feat(examples/mnist_cnn): prepare_data writes train/val/test npy

cee7d63

feat(examples/mnist_cnn): PyTorch reference 1D-CNN + weight export

bd5902e

feat(examples/mnist_cnn): factory-API 1D-CNN trainer + CMake wiring

86f9f53

feat(examples/mnist_cnn): informational compare.py + plots

862230f

docs(examples/mnist_cnn): add README + update suite table

96b7a0a

ci(examples): wire mnist_cnn into the bit-parity job

39a2bd0

feat(examples/_shared): add SpeechCommands loader + torchaudio dep

0f416f6

feat(examples/kws_mfcc): prepare_data writes MFCC train/val/test npy

44f0f91

feat(examples/kws_mfcc): PyTorch reference MFCC CNN + weight export

edc4f16

feat(examples/kws_mfcc): factory-API MFCC CNN trainer + CMake wiring

b967ecc

feat(examples/kws_mfcc): informational compare.py + plots

e044d08

docs(examples/kws_mfcc): add README

3e62857

ci(examples): wire kws_mfcc into the bit-parity job

4aa661d

feat(examples/kws_raw): prepare_data writes raw waveform train/val/te…

6a559c7

…st npy

feat(examples/kws_raw): PyTorch reference raw-waveform CNN + weight e…

16736b8

…xport

feat(examples/kws_raw): factory-API raw-waveform CNN trainer + CMake …

2518b79

…wiring

LeoBuron and others added 18 commits June 26, 2026 20:47

feat(examples/kws_raw): informational compare.py + plots

fc7043d

docs(examples/kws_raw): add README + update suite table

4e19a2f

ci(examples): wire kws_raw into the bit-parity job

14b5fb7

feat(training_loop): tracedGrads fires activation + act-grad sink per…

fc73ca2

… layer

feat(training_loop): traceModelWeights/Grads dump per-layer params + …

87b5cf4

…grads

feat(examples/_shared): npyDumpSink trace sink writes FLOAT32 tensors…

d7f7655

… to .npy

feat(examples/kws_raw): trace_c controlled-step harness writes per-la…

177788d

…yer dumps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(examples/kws_raw): trace_pytorch mirrors the controlled step via…

a83ce86

… hooks

feat(examples/_shared): trace_compare localizes first C-vs-PyTorch drift

a991211

style(examples): clang-format trace harness + dump sink

c85ca03

Subagent edits to npy_dump_sink.c and trace_c.c skipped the clang-format PostToolUse hook; bring them into clang-format-21 compliance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

LeoBuron merged commit a3bf34c into main Jul 1, 2026
22 checks passed

LeoBuron mentioned this pull request Jul 1, 2026

Observer layer: per-layer activation/grad tracing to diagnose the C-vs-PyTorch training divergence #257

Closed

		pairs = trace_compare.compare_pairs(c_dump, pt_dump)
		batch_results.append((c_loss, pt_loss, pairs))

-        pairs = trace_compare.compare_pairs(c_dump, pt_dump)
-        batch_results.append((c_loss, pt_loss, pairs))
+        c_names = {p.name for p in c_dump.glob("*.npy")}
+        pt_names = {p.name for p in pt_dump.glob("*.npy")}
+        if not c_names or not pt_names or c_names != pt_names:
+            missing_in_pt = sorted(c_names - pt_names)
+            missing_in_c = sorted(pt_names - c_names)
+            raise RuntimeError(
+                "trace dump mismatch\n"
+                f"  missing in PyTorch: {missing_in_pt[:10]}\n"
+                f"  missing in C: {missing_in_c[:10]}"
+            )
+        pairs = trace_compare.compare_pairs(c_dump, pt_dump)
+        batch_results.append((c_loss, pt_loss, pairs))

Uh oh!

Conversation

LeoBuron commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

❌ Failed checks (2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading