I am comparing two ASR models. Conformer (74M, 4k vocab) and Zipformer (73M, 4k vocab, default model). Trained on the same 100K‑hour dataset with identical noisy data augmentations. I evaluated both models on 15 test datasets, most of which are real‑world recordings.
In clean conditions, Zipformer performs better than Conformer.
However, in noisy conditions, Zipformer underperforms on 6 of the datasets, showing an average 9% relative degradation compared to Conformer. These noisy datasets are internally collected (except Libri‑other), and the average test‑utterance duration is around 15 seconds.
One key difference during training is that for Zipformer I used feature‑level noise augmentation (default) instead of waveform‑level augmentation, which was used for Conformer. I am wondering whether the feature‑level augmentation is contributing to the performance drop in noisy scenarios.
Both models were trained with a 24‑second max‑utterance limit.
Has anyone else experienced similar accuracy issues with Zipformer on noisy datasets?
Any suggestions for improving noisy‑condition performance would be helpful. I can try them and report back.
Below I pasted how clean , noisy and spec augment applied on feature level for data while training. I tried with and without spec augmentation after adding the noise,with SpecAugmenation results are slightly better.
Decoding setup for both models : Greedy Search, chunk size: 320ms, left-context- 4

I am comparing two ASR models. Conformer (74M, 4k vocab) and Zipformer (73M, 4k vocab, default model). Trained on the same 100K‑hour dataset with identical noisy data augmentations. I evaluated both models on 15 test datasets, most of which are real‑world recordings.
In clean conditions, Zipformer performs better than Conformer.
However, in noisy conditions, Zipformer underperforms on 6 of the datasets, showing an average 9% relative degradation compared to Conformer. These noisy datasets are internally collected (except Libri‑other), and the average test‑utterance duration is around 15 seconds.
One key difference during training is that for Zipformer I used feature‑level noise augmentation (default) instead of waveform‑level augmentation, which was used for Conformer. I am wondering whether the feature‑level augmentation is contributing to the performance drop in noisy scenarios.
Both models were trained with a 24‑second max‑utterance limit.
Has anyone else experienced similar accuracy issues with Zipformer on noisy datasets?
Any suggestions for improving noisy‑condition performance would be helpful. I can try them and report back.
Below I pasted how clean , noisy and spec augment applied on feature level for data while training. I tried with and without spec augmentation after adding the noise,with SpecAugmenation results are slightly better.
Decoding setup for both models : Greedy Search, chunk size: 320ms, left-context- 4