Guidance on Zipformer configuration for low-power ARM CPU (Persian ASR, ~400h data)

Hi icefall team,

First of all, thank you for the great work on icefall and especially the Zipformer recipes. The modular design and Lhotse integration make it incredibly practical for adapting to new languages like Persian.

I am currently training a Persian (fa) ASR model using `egs/commonvoice/ASR/zipformer` as the base recipe, and I would really appreciate your insight on a few architectural and hyperparameter decisions.

---

### Goal

My target is **on-device inference** on a **low-power ARM CPU** with limited memory bandwidth.

**Priorities:**

* Small model footprint
* Low latency
* Robustness to noisy home environments

---

### Data & Preparation

* ~400 hours Persian speech (CommonVoice)
* Speed perturbation (0.9, 1.1) → ~1200h effective audio
* MUSAN noise/reverb augmentation enabled
* BPE size: 500

---

### Current “Small” Configuration (~18M parameters)

```
--encoder-dim "192, 256, 384, 256, 192"
--num-encoder-layers "2, 2, 3, 2, 2"
--feedforward-dim "512, 768, 1024, 768, 512"
--nhead "4, 4, 4, 4, 4"
--chunk-size 16
--left-context 64
```

---

### Questions

**1) Architecture choice**

For a small ARM CPU target, is Zipformer’s hierarchical structure expected to be significantly more cache-friendly during ONNX/NCNN inference compared to a flatter `stateless7` (Conv/Emformer-style) model?

---

**2) Downsampling strategy**

Would you recommend more aggressive `downsampling_factor` early in the network (e.g., 2 or 4) to reduce the sequence length seen by attention layers on CPU, or is it better to keep early layers at higher resolution for phonetic fidelity?

---

**3) BPE size**

Is BPE 500 too large for this model size?
Would reducing it to ~250–300 noticeably improve Joiner/Decoder speed for CPU inference?

---

**4) Latency vs. parameter count**

If I scale the model toward ~30–40M parameters (larger `encoder-dim`), is there a potential “sweet spot” where the model becomes more confident and reduces decoding/beam overhead enough to offset the larger matrix multiplications?

---

**5) Streaming vs. non-streaming**

For this type of hardware, would you recommend:

* Strict streaming (causal) Zipformer, or
* Non-streaming Zipformer with chunked attention as a compromise?

---

**6) Decoder choice**

Since this CPU has relatively slow memory access, would a simpler stateless decoder be noticeably faster than the default RNN-T decoder in this scenario?

---

I would also greatly appreciate any suggestions on how to better scale the Zipformer stacks or layer distribution to better fit the small cache and memory bandwidth of such CPUs.

Thank you again for this excellent toolkit.

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance on Zipformer configuration for low-power ARM CPU (Persian ASR, ~400h data) #2066

Goal

Data & Preparation

Current “Small” Configuration (~18M parameters)

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Guidance on Zipformer configuration for low-power ARM CPU (Persian ASR, ~400h data) #2066

Description

Goal

Data & Preparation

Current “Small” Configuration (~18M parameters)

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions