Empty hyp when decoding zipformer streaming model

I am getting **EMPTY HYP** in my test set decoding. I trained a zipformer model with `--casual True` (which makes it streaming capable)

Exact training command:
```shell
./zipformer/train.py \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_run_1 \
  --bpe-model /data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model \
  --causal 1 \
  --max-duration 3600
```

Then i decoded on a test set, the command i used:
```shell
./zipformer/decode.py \
  --epoch 30 \
  --avg 15 \
  --exp-dir zipformer/exp_run_1 \
  --bpe-model /data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model \
  --causal 1 \
  --chunk-size 32 \
  --left-context-frames 64 \
  --max-duration 3600 \
  --decoding-method modified_beam_search \
  --beam-size 4
```

But in my recogs file, i see a log of **empty hyp**, for example:
```
844424930526921_229_f: ref=['मेनका', 'की', 'सुंदरता', 'से', 'मोहित', 'विश्वामित्र', 'ने', 'उससे', 'शारीरिक', 'संबंध', 'बनाए']
844424930526921_229_f: hyp=[]

844424930526970_229_f: ref=['जिसके', 'बाद', 'वह', 'काफी', 'जल', 'गया', 'और', 'चिल्लाने', 'लगा']
844424930526970_229_f: hyp=[]

844424930527095_261_f: ref=['पत्नी', 'सब्जी', 'लेने', 'में', 'इतना', 'मोलभाव', 'कर', 'रही', 'थी', 'कि', 'पति', 'परेशान', 'हो', 'गया']
844424930527095_261_f: hyp=[]

844424930527099_261_f: ref=['श्रीनगर', 'के', 'बाहरी', 'क्षेत्र', 'जीवान', 'में', 'महबूबा', 'को', 'समारोह', 'की', 'अध्यक्षता', 'करनी', 'थी']
844424930527099_261_f: hyp=[]

844424930527100_261_f: ref=['भोपाल', 'में', 'साकेत', 'नगर', 'निवासी', 'गीता', 'दुबे', 'का', 'देहांत', 'हो', 'गया']
844424930527100_261_f: hyp=[]

844424930527103_261_f: ref=['गामा', 'पहलवान', 'की', 'पुण्यतिथि', 'पर', 'भगोड़े', 'पहलवान', 'सुशील', 'कुमार', 'के', 'लिए', 'कुछ', 'ज़रूरी', 'सबक']
844424930527103_261_f: hyp=[]
```

Another interesting observation is its correlation with `max-duration`. `max-duration` is affecting how many empty hyps i get and in turn affects by WER:
```text
For max-duration: 3600
I am getting 44 empty hyps out of 159459 samples in my test set
WER: 12.76

For max-duration: 8000
I am getting 29721 empty hyps out of 159459 samples in my test set
WER: 44.2
```
How is the batch size affecting the quality of my result? Is empty hyp the only cause of performace drop or is the batch size causing other issues as well? I am quite stuck at this issue. 

`head` of logs:
```json
{
  "attention_decoder_attention_dim": 512,
  "attention_decoder_dim": 512,
  "attention_decoder_feedforward_dim": 2048,
  "attention_decoder_num_heads": 8,
  "attention_decoder_num_layers": 6,
  "avg": 15,
  "backoff_id": 500,
  "batch_idx_train": 0,
  "beam": 20.0,
  "beam_size": 4,
  "best_train_epoch": -1,
  "best_train_loss": Infinity,
  "best_valid_epoch": -1,
  "best_valid_loss": Infinity,
  "blank_id": 0,
  "bpe_model": "/data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model",
  "bucketing_sampler": true,
  "causal": true,
  "chunk_size": "32",
  "cnn_module_kernel": "31,31,15,15,15,31",
  "concatenate_cuts": false,
  "context_file": "",
  "context_score": 2,
  "context_size": 2,
  "decoder_dim": 512,
  "decoding_method": "modified_beam_search",
  "downsampling_factor": "1,2,4,8,4,2",
  "drop_last": true,
  "duration_factor": 1.0,
  "enable_spec_aug": true,
  "encoder_dim": "192,256,384,512,384,256",
  "encoder_unmasked_dim": "192,192,256,256,256,192",
  "env_info": {
    "IP address": "192.64.119.53",
    "hostname": "yal.ai",
    "icefall-git-branch": "master",
    "icefall-git-date": "Thu Jun 18 07:10:25 2026",
    "icefall-git-sha1": "f7c1384-dirty",
    "icefall-path": "/data/speech/sai/zipformer_training/icefall",
    "k2-build-type": "Release",
    "k2-git-date": "Thu Apr 23 01:15:22 2026",
    "k2-git-sha1": "6ebfb7b412a95a6b1b11a18272325b712ff674a3",
    "k2-path": "/data/speech/sai/envs/zipformer/lib/python3.12/site-packages/k2/__init__.py",
    "k2-version": "1.24.4",
    "k2-with-cuda": true,
    "lhotse-path": "/data/speech/sai/envs/zipformer/lib/python3.12/site-packages/lhotse/__init__.py",
    "lhotse-version": "1.33.0.dev+git.6b45efe.clean",
    "python-version": "3.12",
    "torch-cuda-available": true,
    "torch-cuda-version": "12.9",
    "torch-version": "2.11.0+cu129"
  },
  "epoch": 30,
  "exp_dir": "zipformer/exp_run_1",
  "feature_dim": 80,
  "feedforward_dim": "512,768,1024,1536,1024,768",
  "full_dataset": true,
  "gap": 1.0,
  "has_contexts": false,
  "ignore_id": -1,
  "input_strategy": "PrecomputedFeatures",
  "iter": 0,
  "joiner_dim": 512,
  "label_smoothing": 0.1,
  "lang_dir": "data/lang_bpe_500",
  "left_context_frames": "64",
  "lm_avg": 1,
  "lm_epoch": 7,
  "lm_exp_dir": null,
  "lm_scale": 0.3,
  "lm_type": "rnn",
  "lm_vocab_size": 500,
  "log_interval": 50,
  "max_contexts": 8,
  "max_duration": 8000,
  "max_states": 64,
  "max_sym_per_frame": 1,
  "nbest_scale": 0.5,
  "ngram_lm_scale": 0.01,
  "num_buckets": 30,
  "num_encoder_layers": "2,2,3,4,3,2",
  "num_heads": "4,4,4,8,4,4",
  "num_paths": 200,
  "num_workers": 2,
  "on_the_fly_feats": false,
  "pos_dim": 48,
  "pos_head_dim": "4",
  "query_head_dim": "32",
  "res_dir": "zipformer/exp_run_1/modified_beam_search",
  "reset_interval": 200,
  "return_cuts": true,
  "rnn_lm_embedding_dim": 2048,
  "rnn_lm_hidden_dim": 2048,
  "rnn_lm_num_layers": 3,
  "rnn_lm_tie_weights": true,
  "shuffle": true,
  "skip_scoring": false,
  "spec_aug_time_warp_factor": 80,
  "subsampling_factor": 4,
  "suffix": "epoch-30_avg-15_chunk-32_left-context-64__modified_beam_search__beam-size-4_use-averaged-model",
  "tokens_ngram": 2,
  "transformer_lm_dim_feedforward": 2048,
  "transformer_lm_embedding_dim": 768,
  "transformer_lm_encoder_dim": 768,
  "transformer_lm_exp_dir": null,
  "transformer_lm_nhead": 8,
  "transformer_lm_num_layers": 16,
  "transformer_lm_tie_weights": true,
  "unk_id": 2,
  "use_attention_decoder": false,
  "use_averaged_model": true,
  "use_cr_ctc": false,
  "use_ctc": false,
  "use_shallow_fusion": false,
  "use_transducer": true,
  "valid_interval": 3000,
  "value_head_dim": "12",
  "vocab_size": 500,
  "warm_step": 2000
}
```


Can anyone point out to what i might be doing wrong here?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Empty hyp when decoding zipformer streaming model #2091

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Empty hyp when decoding zipformer streaming model #2091

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions