fix(asr): reduce English drift on French recordings via token blocklist#630
Open
Matth-93 wants to merge 1 commit into
Open
fix(asr): reduce English drift on French recordings via token blocklist#630Matth-93 wants to merge 1 commit into
Matth-93 wants to merge 1 commit into
Conversation
When transcribing non-English Latin-script audio, Parakeet TDT v3 has no language conditioning token and falls back to its English prior on acoustically ambiguous frames, producing output that reads like a live translation. This adds a post-processing step in the decoder that partially mitigates it for French. Every time the joint model picks a token from an English-exclusive blocklist (48 space-prefixed SentencePiece tokens impossible in French prose), scan the top-64 logits for the highest-scoring non-blocked Latin-script alternative and substitute it. If nothing suitable appears in top-64, the original prediction stands. The substituted score is recalculated via softmax over top-K and fed back into the LSTM, compounding the effect across subsequent frames. Benchmarks on six French recordings (drift = % sentences with English tokens or langdetect=EN): Children weather segment 0% -> 0% Archival 1912 2.9% -> 2.7% Weightlifting documentary 7.1% -> 0% French slang documentary 18.2% -> 3.7% Picard dialect doc 16.7% -> 0% Spontaneous interview 31.3% -> 13.5% The 13.5% remaining on the interview is architectural: sustained English passages where no French token appeared in top-64 at all. The blocklist cannot help there. Scope: French/English only. The mechanism generalises but building equivalent blocklists for other pairs requires vocabulary analysis and speakers who can evaluate the output. Leaving that to contributors who know those languages.
Contributor
Author
|
Hi @Alex-Wengg and @BrandonWeng! Wanted to flag something about the failing CI check. The Japanese ASR Benchmark (CTC) job is marked as failed, but looking at the actual benchmark output in the log, it passed cleanly: The job failure appears to be a Node.js 20 deprecation warning in the GitHub Actions runner setup, unrelated to this patch. The English blocklist is applied to the TDT decoder to prevent it from drifting into English on non-English recordings. It has no effect on Japanese or any other language path. Happy to rerun the job if that helps move the review forward. Let me know if you need anything else from me! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parakeet TDT v3 has no language conditioning token. On spontaneous non-English speech, the model falls back to its English training prior and produces output that reads like a real-time translation. This is a known architectural limitation.
This PR adds a post-processing step in the decoder that partially mitigates it for French.
Mechanism
When the joint model picks a token from an English-exclusive blocklist (48 space-prefixed SentencePiece tokens:
' the',' and',' with', etc.),applyEnglishBlocklistscans the top-64 logits for the highest-scoring non-blocked Latin-script alternative and substitutes it. If nothing suitable is in top-64, the original prediction stands.The substituted score is fed back into the transducer's LSTM state, which compounds: each forced French token nudges the following frames away from English.
Runs in O(64) per non-blank token. No additional model call, no latency impact.
Benchmark
French recordings, drift = % sentences flagged by English function-word scan or langdetect=EN:
The remaining 13.5% on the interview is architectural: sustained English passages where no French candidate appeared in top-64 at all. The blocklist cannot help there.
Test recordings
All INA recordings are publicly available on YouTube:
The sixth recording (57-minute spontaneous interview) is private.
Scope
This patch is French/English specific. The 48-token blocklist was derived empirically from the Parakeet TDT v3 SentencePiece vocabulary for French. The mechanism generalises to other language pairs but building equivalent blocklists requires the same vocabulary analysis and, critically, someone who can actually evaluate the output quality. I only speak French, so I only benchmarked French. I'm not going to claim results for languages I can't verify.
Flagging this as a starting point and leaving the generalisation to contributors who know the other languages and have audio to test against.
Context
I'm not a professional developer. I found this problem while building a personal project on top of FluidAudio and did the diagnostic work: instrumented the decoder to log English token wins, measured that 84% of them had a French alternative in the top-64, ran before/after benchmarks across six recordings.