Skip to content

Zipformer(Streaming) Performance in Noisy Conditions #2073

Description

@iithgkb

I am comparing two ASR models. Conformer (74M, 4k vocab) and Zipformer (73M, 4k vocab, default model). Trained on the same 100K‑hour dataset with identical noisy data augmentations. I evaluated both models on 15 test datasets, most of which are real‑world recordings.
In clean conditions, Zipformer performs better than Conformer.
However, in noisy conditions, Zipformer underperforms on 6 of the datasets, showing an average 9% relative degradation compared to Conformer. These noisy datasets are internally collected (except Libri‑other), and the average test‑utterance duration is around 15 seconds.
One key difference during training is that for Zipformer I used feature‑level noise augmentation (default) instead of waveform‑level augmentation, which was used for Conformer. I am wondering whether the feature‑level augmentation is contributing to the performance drop in noisy scenarios.
Both models were trained with a 24‑second max‑utterance limit.
Has anyone else experienced similar accuracy issues with Zipformer on noisy datasets?
Any suggestions for improving noisy‑condition performance would be helpful. I can try them and report back.
Below I pasted how clean , noisy and spec augment applied on feature level for data while training. I tried with and without spec augmentation after adding the noise,with SpecAugmenation results are slightly better.
Decoding setup for both models : Greedy Search, chunk size: 320ms, left-context- 4

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions