I am trying to adapt nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0 (offline, att_context_size: [-1, -1]) to a streaming architecture (att_context_size: [70, 13], att_context_style: chunked_limited) using speech_to_text_hybrid_rnnt_ctc_bpe.py with init_from_pretrained_model (encoder only, decoder/joint reinitialized from scratch).
The RNNT decoder collapses immediately to a blank-only prediction strategy and never recovers, resulting in val_wer stuck at exactly 1.000 across all training runs regardless of hyperparameters.
init_from_pretrained_model:
model0:
name: nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0
include:
- encoder
exclude:
- conv.batch_norm
- pre_encode.out.weight
@pzelasko
@chtruong814
@nithinraok
I am trying to adapt nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0 (offline, att_context_size: [-1, -1]) to a streaming architecture (att_context_size: [70, 13], att_context_style: chunked_limited) using speech_to_text_hybrid_rnnt_ctc_bpe.py with init_from_pretrained_model (encoder only, decoder/joint reinitialized from scratch).
The RNNT decoder collapses immediately to a blank-only prediction strategy and never recovers, resulting in val_wer stuck at exactly 1.000 across all training runs regardless of hyperparameters.
@pzelasko
@chtruong814
@nithinraok