Hi icefall team,
First of all, thank you for the great work on icefall and especially the Zipformer recipes. The modular design and Lhotse integration make it incredibly practical for adapting to new languages like Persian.
I am currently training a Persian (fa) ASR model using egs/commonvoice/ASR/zipformer as the base recipe, and I would really appreciate your insight on a few architectural and hyperparameter decisions.
Goal
My target is on-device inference on a low-power ARM CPU with limited memory bandwidth.
Priorities:
- Small model footprint
- Low latency
- Robustness to noisy home environments
Data & Preparation
- ~400 hours Persian speech (CommonVoice)
- Speed perturbation (0.9, 1.1) → ~1200h effective audio
- MUSAN noise/reverb augmentation enabled
- BPE size: 500
Current “Small” Configuration (~18M parameters)
--encoder-dim "192, 256, 384, 256, 192"
--num-encoder-layers "2, 2, 3, 2, 2"
--feedforward-dim "512, 768, 1024, 768, 512"
--nhead "4, 4, 4, 4, 4"
--chunk-size 16
--left-context 64
Questions
1) Architecture choice
For a small ARM CPU target, is Zipformer’s hierarchical structure expected to be significantly more cache-friendly during ONNX/NCNN inference compared to a flatter stateless7 (Conv/Emformer-style) model?
2) Downsampling strategy
Would you recommend more aggressive downsampling_factor early in the network (e.g., 2 or 4) to reduce the sequence length seen by attention layers on CPU, or is it better to keep early layers at higher resolution for phonetic fidelity?
3) BPE size
Is BPE 500 too large for this model size?
Would reducing it to ~250–300 noticeably improve Joiner/Decoder speed for CPU inference?
4) Latency vs. parameter count
If I scale the model toward ~30–40M parameters (larger encoder-dim), is there a potential “sweet spot” where the model becomes more confident and reduces decoding/beam overhead enough to offset the larger matrix multiplications?
5) Streaming vs. non-streaming
For this type of hardware, would you recommend:
- Strict streaming (causal) Zipformer, or
- Non-streaming Zipformer with chunked attention as a compromise?
6) Decoder choice
Since this CPU has relatively slow memory access, would a simpler stateless decoder be noticeably faster than the default RNN-T decoder in this scenario?
I would also greatly appreciate any suggestions on how to better scale the Zipformer stacks or layer distribution to better fit the small cache and memory bandwidth of such CPUs.
Thank you again for this excellent toolkit.
Best regards
Hi icefall team,
First of all, thank you for the great work on icefall and especially the Zipformer recipes. The modular design and Lhotse integration make it incredibly practical for adapting to new languages like Persian.
I am currently training a Persian (fa) ASR model using
egs/commonvoice/ASR/zipformeras the base recipe, and I would really appreciate your insight on a few architectural and hyperparameter decisions.Goal
My target is on-device inference on a low-power ARM CPU with limited memory bandwidth.
Priorities:
Data & Preparation
Current “Small” Configuration (~18M parameters)
Questions
1) Architecture choice
For a small ARM CPU target, is Zipformer’s hierarchical structure expected to be significantly more cache-friendly during ONNX/NCNN inference compared to a flatter
stateless7(Conv/Emformer-style) model?2) Downsampling strategy
Would you recommend more aggressive
downsampling_factorearly in the network (e.g., 2 or 4) to reduce the sequence length seen by attention layers on CPU, or is it better to keep early layers at higher resolution for phonetic fidelity?3) BPE size
Is BPE 500 too large for this model size?
Would reducing it to ~250–300 noticeably improve Joiner/Decoder speed for CPU inference?
4) Latency vs. parameter count
If I scale the model toward ~30–40M parameters (larger
encoder-dim), is there a potential “sweet spot” where the model becomes more confident and reduces decoding/beam overhead enough to offset the larger matrix multiplications?5) Streaming vs. non-streaming
For this type of hardware, would you recommend:
6) Decoder choice
Since this CPU has relatively slow memory access, would a simpler stateless decoder be noticeably faster than the default RNN-T decoder in this scenario?
I would also greatly appreciate any suggestions on how to better scale the Zipformer stacks or layer distribution to better fit the small cache and memory bandwidth of such CPUs.
Thank you again for this excellent toolkit.
Best regards