I am getting EMPTY HYP in my test set decoding. I trained a zipformer model with --casual True (which makes it streaming capable)
Exact training command:
./zipformer/train.py \
--world-size 8 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp_run_1 \
--bpe-model /data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model \
--causal 1 \
--max-duration 3600
Then i decoded on a test set, the command i used:
./zipformer/decode.py \
--epoch 30 \
--avg 15 \
--exp-dir zipformer/exp_run_1 \
--bpe-model /data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model \
--causal 1 \
--chunk-size 32 \
--left-context-frames 64 \
--max-duration 3600 \
--decoding-method modified_beam_search \
--beam-size 4
But in my recogs file, i see a log of empty hyp, for example:
844424930526921_229_f: ref=['मेनका', 'की', 'सुंदरता', 'से', 'मोहित', 'विश्वामित्र', 'ने', 'उससे', 'शारीरिक', 'संबंध', 'बनाए']
844424930526921_229_f: hyp=[]
844424930526970_229_f: ref=['जिसके', 'बाद', 'वह', 'काफी', 'जल', 'गया', 'और', 'चिल्लाने', 'लगा']
844424930526970_229_f: hyp=[]
844424930527095_261_f: ref=['पत्नी', 'सब्जी', 'लेने', 'में', 'इतना', 'मोलभाव', 'कर', 'रही', 'थी', 'कि', 'पति', 'परेशान', 'हो', 'गया']
844424930527095_261_f: hyp=[]
844424930527099_261_f: ref=['श्रीनगर', 'के', 'बाहरी', 'क्षेत्र', 'जीवान', 'में', 'महबूबा', 'को', 'समारोह', 'की', 'अध्यक्षता', 'करनी', 'थी']
844424930527099_261_f: hyp=[]
844424930527100_261_f: ref=['भोपाल', 'में', 'साकेत', 'नगर', 'निवासी', 'गीता', 'दुबे', 'का', 'देहांत', 'हो', 'गया']
844424930527100_261_f: hyp=[]
844424930527103_261_f: ref=['गामा', 'पहलवान', 'की', 'पुण्यतिथि', 'पर', 'भगोड़े', 'पहलवान', 'सुशील', 'कुमार', 'के', 'लिए', 'कुछ', 'ज़रूरी', 'सबक']
844424930527103_261_f: hyp=[]
Another interesting observation is its correlation with max-duration. max-duration is affecting how many empty hyps i get and in turn affects by WER:
For max-duration: 3600
I am getting 44 empty hyps out of 159459 samples in my test set
WER: 12.76
For max-duration: 8000
I am getting 29721 empty hyps out of 159459 samples in my test set
WER: 44.2
How is the batch size affecting the quality of my result? Is empty hyp the only cause of performace drop or is the batch size causing other issues as well? I am quite stuck at this issue.
head of logs:
{
"attention_decoder_attention_dim": 512,
"attention_decoder_dim": 512,
"attention_decoder_feedforward_dim": 2048,
"attention_decoder_num_heads": 8,
"attention_decoder_num_layers": 6,
"avg": 15,
"backoff_id": 500,
"batch_idx_train": 0,
"beam": 20.0,
"beam_size": 4,
"best_train_epoch": -1,
"best_train_loss": Infinity,
"best_valid_epoch": -1,
"best_valid_loss": Infinity,
"blank_id": 0,
"bpe_model": "/data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model",
"bucketing_sampler": true,
"causal": true,
"chunk_size": "32",
"cnn_module_kernel": "31,31,15,15,15,31",
"concatenate_cuts": false,
"context_file": "",
"context_score": 2,
"context_size": 2,
"decoder_dim": 512,
"decoding_method": "modified_beam_search",
"downsampling_factor": "1,2,4,8,4,2",
"drop_last": true,
"duration_factor": 1.0,
"enable_spec_aug": true,
"encoder_dim": "192,256,384,512,384,256",
"encoder_unmasked_dim": "192,192,256,256,256,192",
"env_info": {
"IP address": "192.64.119.53",
"hostname": "yal.ai",
"icefall-git-branch": "master",
"icefall-git-date": "Thu Jun 18 07:10:25 2026",
"icefall-git-sha1": "f7c1384-dirty",
"icefall-path": "/data/speech/sai/zipformer_training/icefall",
"k2-build-type": "Release",
"k2-git-date": "Thu Apr 23 01:15:22 2026",
"k2-git-sha1": "6ebfb7b412a95a6b1b11a18272325b712ff674a3",
"k2-path": "/data/speech/sai/envs/zipformer/lib/python3.12/site-packages/k2/__init__.py",
"k2-version": "1.24.4",
"k2-with-cuda": true,
"lhotse-path": "/data/speech/sai/envs/zipformer/lib/python3.12/site-packages/lhotse/__init__.py",
"lhotse-version": "1.33.0.dev+git.6b45efe.clean",
"python-version": "3.12",
"torch-cuda-available": true,
"torch-cuda-version": "12.9",
"torch-version": "2.11.0+cu129"
},
"epoch": 30,
"exp_dir": "zipformer/exp_run_1",
"feature_dim": 80,
"feedforward_dim": "512,768,1024,1536,1024,768",
"full_dataset": true,
"gap": 1.0,
"has_contexts": false,
"ignore_id": -1,
"input_strategy": "PrecomputedFeatures",
"iter": 0,
"joiner_dim": 512,
"label_smoothing": 0.1,
"lang_dir": "data/lang_bpe_500",
"left_context_frames": "64",
"lm_avg": 1,
"lm_epoch": 7,
"lm_exp_dir": null,
"lm_scale": 0.3,
"lm_type": "rnn",
"lm_vocab_size": 500,
"log_interval": 50,
"max_contexts": 8,
"max_duration": 8000,
"max_states": 64,
"max_sym_per_frame": 1,
"nbest_scale": 0.5,
"ngram_lm_scale": 0.01,
"num_buckets": 30,
"num_encoder_layers": "2,2,3,4,3,2",
"num_heads": "4,4,4,8,4,4",
"num_paths": 200,
"num_workers": 2,
"on_the_fly_feats": false,
"pos_dim": 48,
"pos_head_dim": "4",
"query_head_dim": "32",
"res_dir": "zipformer/exp_run_1/modified_beam_search",
"reset_interval": 200,
"return_cuts": true,
"rnn_lm_embedding_dim": 2048,
"rnn_lm_hidden_dim": 2048,
"rnn_lm_num_layers": 3,
"rnn_lm_tie_weights": true,
"shuffle": true,
"skip_scoring": false,
"spec_aug_time_warp_factor": 80,
"subsampling_factor": 4,
"suffix": "epoch-30_avg-15_chunk-32_left-context-64__modified_beam_search__beam-size-4_use-averaged-model",
"tokens_ngram": 2,
"transformer_lm_dim_feedforward": 2048,
"transformer_lm_embedding_dim": 768,
"transformer_lm_encoder_dim": 768,
"transformer_lm_exp_dir": null,
"transformer_lm_nhead": 8,
"transformer_lm_num_layers": 16,
"transformer_lm_tie_weights": true,
"unk_id": 2,
"use_attention_decoder": false,
"use_averaged_model": true,
"use_cr_ctc": false,
"use_ctc": false,
"use_shallow_fusion": false,
"use_transducer": true,
"valid_interval": 3000,
"value_head_dim": "12",
"vocab_size": 500,
"warm_step": 2000
}
Can anyone point out to what i might be doing wrong here?
I am getting EMPTY HYP in my test set decoding. I trained a zipformer model with
--casual True(which makes it streaming capable)Exact training command:
Then i decoded on a test set, the command i used:
But in my recogs file, i see a log of empty hyp, for example:
Another interesting observation is its correlation with
max-duration.max-durationis affecting how many empty hyps i get and in turn affects by WER:How is the batch size affecting the quality of my result? Is empty hyp the only cause of performace drop or is the batch size causing other issues as well? I am quite stuck at this issue.
headof logs:{ "attention_decoder_attention_dim": 512, "attention_decoder_dim": 512, "attention_decoder_feedforward_dim": 2048, "attention_decoder_num_heads": 8, "attention_decoder_num_layers": 6, "avg": 15, "backoff_id": 500, "batch_idx_train": 0, "beam": 20.0, "beam_size": 4, "best_train_epoch": -1, "best_train_loss": Infinity, "best_valid_epoch": -1, "best_valid_loss": Infinity, "blank_id": 0, "bpe_model": "/data/speech/sai/zipformer_training/exp_run_1/lang_bpe_500/bpe.model", "bucketing_sampler": true, "causal": true, "chunk_size": "32", "cnn_module_kernel": "31,31,15,15,15,31", "concatenate_cuts": false, "context_file": "", "context_score": 2, "context_size": 2, "decoder_dim": 512, "decoding_method": "modified_beam_search", "downsampling_factor": "1,2,4,8,4,2", "drop_last": true, "duration_factor": 1.0, "enable_spec_aug": true, "encoder_dim": "192,256,384,512,384,256", "encoder_unmasked_dim": "192,192,256,256,256,192", "env_info": { "IP address": "192.64.119.53", "hostname": "yal.ai", "icefall-git-branch": "master", "icefall-git-date": "Thu Jun 18 07:10:25 2026", "icefall-git-sha1": "f7c1384-dirty", "icefall-path": "/data/speech/sai/zipformer_training/icefall", "k2-build-type": "Release", "k2-git-date": "Thu Apr 23 01:15:22 2026", "k2-git-sha1": "6ebfb7b412a95a6b1b11a18272325b712ff674a3", "k2-path": "/data/speech/sai/envs/zipformer/lib/python3.12/site-packages/k2/__init__.py", "k2-version": "1.24.4", "k2-with-cuda": true, "lhotse-path": "/data/speech/sai/envs/zipformer/lib/python3.12/site-packages/lhotse/__init__.py", "lhotse-version": "1.33.0.dev+git.6b45efe.clean", "python-version": "3.12", "torch-cuda-available": true, "torch-cuda-version": "12.9", "torch-version": "2.11.0+cu129" }, "epoch": 30, "exp_dir": "zipformer/exp_run_1", "feature_dim": 80, "feedforward_dim": "512,768,1024,1536,1024,768", "full_dataset": true, "gap": 1.0, "has_contexts": false, "ignore_id": -1, "input_strategy": "PrecomputedFeatures", "iter": 0, "joiner_dim": 512, "label_smoothing": 0.1, "lang_dir": "data/lang_bpe_500", "left_context_frames": "64", "lm_avg": 1, "lm_epoch": 7, "lm_exp_dir": null, "lm_scale": 0.3, "lm_type": "rnn", "lm_vocab_size": 500, "log_interval": 50, "max_contexts": 8, "max_duration": 8000, "max_states": 64, "max_sym_per_frame": 1, "nbest_scale": 0.5, "ngram_lm_scale": 0.01, "num_buckets": 30, "num_encoder_layers": "2,2,3,4,3,2", "num_heads": "4,4,4,8,4,4", "num_paths": 200, "num_workers": 2, "on_the_fly_feats": false, "pos_dim": 48, "pos_head_dim": "4", "query_head_dim": "32", "res_dir": "zipformer/exp_run_1/modified_beam_search", "reset_interval": 200, "return_cuts": true, "rnn_lm_embedding_dim": 2048, "rnn_lm_hidden_dim": 2048, "rnn_lm_num_layers": 3, "rnn_lm_tie_weights": true, "shuffle": true, "skip_scoring": false, "spec_aug_time_warp_factor": 80, "subsampling_factor": 4, "suffix": "epoch-30_avg-15_chunk-32_left-context-64__modified_beam_search__beam-size-4_use-averaged-model", "tokens_ngram": 2, "transformer_lm_dim_feedforward": 2048, "transformer_lm_embedding_dim": 768, "transformer_lm_encoder_dim": 768, "transformer_lm_exp_dir": null, "transformer_lm_nhead": 8, "transformer_lm_num_layers": 16, "transformer_lm_tie_weights": true, "unk_id": 2, "use_attention_decoder": false, "use_averaged_model": true, "use_cr_ctc": false, "use_ctc": false, "use_shallow_fusion": false, "use_transducer": true, "valid_interval": 3000, "value_head_dim": "12", "vocab_size": 500, "warm_step": 2000 }Can anyone point out to what i might be doing wrong here?