feat(examples): kws_mfcc SpeechCommands MFCC parity demo (Stage 3 PR-A)#255
Merged
Conversation
torchaudio 2.11 (maintenance mode) routes its dataset decode through torchcodec (needs a system FFmpeg), so iterating SPEECHCOMMANDS raised ImportError at prepare time. Switch to ds.get_metadata (no decode) + a stdlib `wave` reader (int16 PCM / 32768) — the fallback the spec blessed, byte-identical output, no torchcodec/FFmpeg/scipy dependency. Also make the loader path-based: collect paths per label, then decode only the clips a split keeps (all 4 keywords + the sampled "unknown"). Peak RAM for 6-class drops ~5.8 GB -> ~1.4 GB, fitting the 7 GB CI runner.
The stdlib wave reader interprets frames as int16/32768; a non-mono or non-16-bit clip would be silently misdecoded. Assert the format (the corpus is uniformly 16 kHz mono 16-bit, so this never trips — it guards a future corpus swap). Flagged by the PR-A final review as the one silent-decode path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
examples/kws_mfcc/— SpeechCommands keyword-spotting (MFCC features → 1D-CNN) demonstrating exact PyTorch↔C bit-parity. First of the two Stage-3 KWS demos; establishes the sharedexamples/_shared/speechcommands_data.pyloader (reused by PR-Bkws_raw).Gate:
BIT_PARITY=1C int32 predictions are bit-identical to PyTorch (2483/2483, 6-class). First example to proveAdaptiveAvgPool1d(1)exact vs PyTorch.Model:
Conv1d(40→32,K3,SAME) → ReLU → MaxPool(2) → Conv1d(32→64,K3,SAME) → ReLU → MaxPool(2) → AdaptiveAvgPool1d(1) → Flatten → Linear(64→C) → Softmax.Class-count knob:
KWS_CLASSES(default 6). CI runs 6-class only (yes/no/up/down+ synthetic silence + unknown, balanced); 35-class is local-only (compiled by the build-all rot-guard).Note — torchaudio 2.11 / torchcodec: 2.11 (maintenance mode) routes dataset decode through
torchcodec(needs system FFmpeg). The loader uses the spec-blessed fallback:get_metadata+ stdlibwavereader (int16/32768 — byte-identical), so no torchcodec/FFmpeg dependency. The loader is path-based to bound peak RAM (~1.4 GB for 6-class, within the runner's 7 GB).CI: shared raw-download cache + per-example processed-
.npycache (gates prepare on cache-hit); int32-exact diff. 6-class only.🤖 Generated with Claude Code