Skip to content

fix(asr): reduce English drift on French recordings via token blocklist#630

Open
Matth-93 wants to merge 1 commit into
FluidInference:mainfrom
Matth-93:fix/parakeet-french-english-blocklist
Open

fix(asr): reduce English drift on French recordings via token blocklist#630
Matth-93 wants to merge 1 commit into
FluidInference:mainfrom
Matth-93:fix/parakeet-french-english-blocklist

Conversation

@Matth-93
Copy link
Copy Markdown
Contributor

Parakeet TDT v3 has no language conditioning token. On spontaneous non-English speech, the model falls back to its English training prior and produces output that reads like a real-time translation. This is a known architectural limitation.

This PR adds a post-processing step in the decoder that partially mitigates it for French.

Mechanism

When the joint model picks a token from an English-exclusive blocklist (48 space-prefixed SentencePiece tokens: ' the', ' and', ' with', etc.), applyEnglishBlocklist scans the top-64 logits for the highest-scoring non-blocked Latin-script alternative and substitutes it. If nothing suitable is in top-64, the original prediction stands.

The substituted score is fed back into the transducer's LSTM state, which compounds: each forced French token nudges the following frames away from English.

Runs in O(64) per non-blank token. No additional model call, no latency impact.

Benchmark

French recordings, drift = % sentences flagged by English function-word scan or langdetect=EN:

Recording Before After
Children's weather segment 0% 0%
Archival 1912 2.9% 2.7%
Weightlifting documentary 7.1% 0%
French slang documentary 18.2% 3.7%
Picard dialect documentary 16.7% 0%
Spontaneous interview (57 min) 31.3% 13.5%

The remaining 13.5% on the interview is architectural: sustained English passages where no French candidate appeared in top-64 at all. The blocklist cannot help there.

Test recordings

All INA recordings are publicly available on YouTube:

The sixth recording (57-minute spontaneous interview) is private.

Scope

This patch is French/English specific. The 48-token blocklist was derived empirically from the Parakeet TDT v3 SentencePiece vocabulary for French. The mechanism generalises to other language pairs but building equivalent blocklists requires the same vocabulary analysis and, critically, someone who can actually evaluate the output quality. I only speak French, so I only benchmarked French. I'm not going to claim results for languages I can't verify.

Flagging this as a starting point and leaving the generalisation to contributors who know the other languages and have audio to test against.

Context

I'm not a professional developer. I found this problem while building a personal project on top of FluidAudio and did the diagnostic work: instrumented the decoder to log English token wins, measured that 84% of them had a French alternative in the top-64, ran before/after benchmarks across six recordings.

When transcribing non-English Latin-script audio, Parakeet TDT v3 has no
language conditioning token and falls back to its English prior on
acoustically ambiguous frames, producing output that reads like a live
translation. This adds a post-processing step in the decoder that partially
mitigates it for French.

Every time the joint model picks a token from an English-exclusive blocklist
(48 space-prefixed SentencePiece tokens impossible in French prose), scan the
top-64 logits for the highest-scoring non-blocked Latin-script alternative and
substitute it. If nothing suitable appears in top-64, the original prediction
stands. The substituted score is recalculated via softmax over top-K and fed
back into the LSTM, compounding the effect across subsequent frames.

Benchmarks on six French recordings (drift = % sentences with English tokens
or langdetect=EN):

  Children weather segment   0%   -> 0%
  Archival 1912              2.9% -> 2.7%
  Weightlifting documentary  7.1% -> 0%
  French slang documentary  18.2% -> 3.7%
  Picard dialect doc        16.7% -> 0%
  Spontaneous interview     31.3% -> 13.5%

The 13.5% remaining on the interview is architectural: sustained English
passages where no French token appeared in top-64 at all. The blocklist
cannot help there.

Scope: French/English only. The mechanism generalises but building equivalent
blocklists for other pairs requires vocabulary analysis and speakers who can
evaluate the output. Leaving that to contributors who know those languages.
@Matth-93 Matth-93 marked this pull request as ready for review May 20, 2026 12:59
@Matth-93
Copy link
Copy Markdown
Contributor Author

Matth-93 commented May 21, 2026

Hi @Alex-Wengg and @BrandonWeng! Wanted to flag something about the failing CI check.

The Japanese ASR Benchmark (CTC) job is marked as failed, but looking at the actual benchmark output in the log, it passed cleanly:

CER: 9.07%
Status: Passed

The job failure appears to be a Node.js 20 deprecation warning in the GitHub Actions runner setup, unrelated to this patch. The English blocklist is applied to the TDT decoder to prevent it from drifting into English on non-English recordings. It has no effect on Japanese or any other language path.

Happy to rerun the job if that helps move the review forward. Let me know if you need anything else from me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant