feat(examples/kws_raw): per-conv LayerNorm (10-seed-validated) replacing fragile end-LN#259
Merged
Merged
Conversation
…ing fragile end-LN
A 10-seed x 3-placement x 3-lr sweep (50 epochs) showed the previously-shipped
end-feature LayerNorm(64) is the WORST option: 0.47 +/- 0.25 test_acc, collapsing
to a one-class reference on ~40% of seeds (the original seed-42 pick was lucky).
Per-conv LayerNorm([C,L]) over each conv's full feature map (pre-ReLU) is the best
and most stable: 0.72 +/- 0.01, all 10 seeds converge across all 6 classes. (Plain
no-LayerNorm also trains fine at 50 epochs, 0.70 +/- 0.02 — the raw model was never
un-trainable, just slow; LayerNorm is kept as the framework's bit-parity-covered
normalizer and to exercise it in the gate.)
Model: 3x [Conv1d -> LayerNorm([C,L]) -> ReLU -> MaxPool(4)] (shapes [16,1000],
[32,250], [64,62]), lr=0.005, 50 epochs. C: MODEL_SIZE 15->17, three
layerNormLayerInit(numNormDims=2, eps=1e-5) at model[2]/[6]/[10], 7-entry
state-dict {conv1,ln1,conv2,ln2,conv3,ln3,fc}. Gate PASSES bit-identical (2483/2483)
with diverse predictions across all 6 classes -- first bit-parity exercise of a
multi-dim [C,L] LayerNorm in an example.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Revises the
kws_rawmodel after a 10-seed config sweep showed the previously-shipped end-featureLayerNorm(64)was the worst, least-stable choice.Why
A 10-seed × 3-placement × 3-lr sweep (50 epochs each):
LayerNorm([C,L])The shipped end-feature LayerNorm collapses to a one-class reference on ~40% of seeds (the original seed-42 pick was lucky → a fragile/degenerate gate). Per-conv LayerNorm — one over each conv's full
[C,L]feature map, pre-ReLU — converges reliably and highest. (No-LayerNorm also trains fine at 50 epochs; the raw model was never un-trainable, just slow. LayerNorm is kept as the framework's only bit-parity-covered normalizer + to exercise it.)Change
AvgPool1d(16) → 3× [Conv1d(K3,SAME) → LayerNorm([C,L]) → ReLU → MaxPool(4)] → AdaptiveAvgPool1d(1) → Flatten → Linear. LayerNorm shapes[16,1000],[32,250],[64,62]. lr=0.005, 50 epochs.MODEL_SIZE15→17, threelayerNormLayerInit(numNormDims=2, eps=1e-5), 7-entry state-dict{conv1,ln1,conv2,ln2,conv3,ln3,fc}.BIT_PARITY=1C int32 predictions bit-identical to PyTorch (2483/2483), diverse across all 6 classes, test_acc 0.721 — the first bit-parity exercise of a multi-dim[C,L]LayerNorm in an example.Supersedes the end-LN config from #256. The shipped gate was seed-independent (loads fixed weights) so this is a quality/robustness fix, not a correctness bug.
🤖 Generated with Claude Code