fix(asr): reduce English drift on French recordings via token blocklist by Matth-93 · Pull Request #630 · FluidInference/FluidAudio

Matth-93 · 2026-05-20T12:49:10Z

Parakeet TDT v3 has no language conditioning token. On spontaneous non-English speech, the model falls back to its English training prior and produces output that reads like a real-time translation. This is a known architectural limitation.

This PR adds a post-processing step in the decoder that partially mitigates it for French.

Mechanism

When the joint model picks a token from an English-exclusive blocklist (48 space-prefixed SentencePiece tokens: ' the', ' and', ' with', etc.), applyEnglishBlocklist scans the top-64 logits for the highest-scoring non-blocked Latin-script alternative and substitutes it. If nothing suitable is in top-64, the original prediction stands.

The substituted score is fed back into the transducer's LSTM state, which compounds: each forced French token nudges the following frames away from English.

Runs in O(64) per non-blank token. No additional model call, no latency impact.

Benchmark

French recordings, drift = % sentences flagged by English function-word scan or langdetect=EN:

Recording	Before	After
Children's weather segment	0%	0%
Archival 1912	2.9%	2.7%
Weightlifting documentary	7.1%	0%
French slang documentary	18.2%	3.7%
Picard dialect documentary	16.7%	0%
Spontaneous interview (57 min)	31.3%	13.5%

The remaining 13.5% on the interview is architectural: sustained English passages where no French candidate appeared in top-64 at all. The blocklist cannot help there.

Test recordings

All INA recordings are publicly available on YouTube:

The sixth recording (57-minute spontaneous interview) is private.

Scope

This patch is French/English specific. The 48-token blocklist was derived empirically from the Parakeet TDT v3 SentencePiece vocabulary for French. The mechanism generalises to other language pairs but building equivalent blocklists requires the same vocabulary analysis and, critically, someone who can actually evaluate the output quality. I only speak French, so I only benchmarked French. I'm not going to claim results for languages I can't verify.

Flagging this as a starting point and leaving the generalisation to contributors who know the other languages and have audio to test against.

Context

I'm not a professional developer. I found this problem while building a personal project on top of FluidAudio and did the diagnostic work: instrumented the decoder to log English token wins, measured that 84% of them had a French alternative in the top-64, ran before/after benchmarks across six recordings.

When transcribing non-English Latin-script audio, Parakeet TDT v3 has no language conditioning token and falls back to its English prior on acoustically ambiguous frames, producing output that reads like a live translation. This adds a post-processing step in the decoder that partially mitigates it for French. Every time the joint model picks a token from an English-exclusive blocklist (48 space-prefixed SentencePiece tokens impossible in French prose), scan the top-64 logits for the highest-scoring non-blocked Latin-script alternative and substitute it. If nothing suitable appears in top-64, the original prediction stands. The substituted score is recalculated via softmax over top-K and fed back into the LSTM, compounding the effect across subsequent frames. Benchmarks on six French recordings (drift = % sentences with English tokens or langdetect=EN): Children weather segment 0% -> 0% Archival 1912 2.9% -> 2.7% Weightlifting documentary 7.1% -> 0% French slang documentary 18.2% -> 3.7% Picard dialect doc 16.7% -> 0% Spontaneous interview 31.3% -> 13.5% The 13.5% remaining on the interview is architectural: sustained English passages where no French token appeared in top-64 at all. The blocklist cannot help there. Scope: French/English only. The mechanism generalises but building equivalent blocklists for other pairs requires vocabulary analysis and speakers who can evaluate the output. Leaving that to contributors who know those languages.

Matth-93 · 2026-05-21T19:56:06Z

Hi @Alex-Wengg and @BrandonWeng! Wanted to flag something about the failing CI check.

The Japanese ASR Benchmark (CTC) job is marked as failed, but looking at the actual benchmark output in the log, it passed cleanly:

CER: 9.07%
Status: Passed

The job failure appears to be a Node.js 20 deprecation warning in the GitHub Actions runner setup, unrelated to this patch. The English blocklist is applied to the TDT decoder to prevent it from drifting into English on non-English recordings. It has no effect on Japanese or any other language path.

Happy to rerun the job if that helps move the review forward. Let me know if you need anything else from me!

Matth-93 marked this pull request as ready for review May 20, 2026 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(asr): reduce English drift on French recordings via token blocklist#630

fix(asr): reduce English drift on French recordings via token blocklist#630
Matth-93 wants to merge 1 commit into
FluidInference:mainfrom
Matth-93:fix/parakeet-french-english-blocklist

Matth-93 commented May 20, 2026

Uh oh!

Matth-93 commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Matth-93 commented May 20, 2026

Mechanism

Benchmark

Test recordings

Scope

Context

Uh oh!

Matth-93 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Matth-93 commented May 21, 2026 •

edited

Loading