Skip to content

Fine-tuning Offline FastConformer to Streaming Architecture #15819

Description

@Firaskraiem1

I am trying to adapt nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0 (offline, att_context_size: [-1, -1]) to a streaming architecture (att_context_size: [70, 13], att_context_style: chunked_limited) using speech_to_text_hybrid_rnnt_ctc_bpe.py with init_from_pretrained_model (encoder only, decoder/joint reinitialized from scratch).
The RNNT decoder collapses immediately to a blank-only prediction strategy and never recovers, resulting in val_wer stuck at exactly 1.000 across all training runs regardless of hyperparameters.

init_from_pretrained_model:
  model0:
    name: nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0
    include:
      - encoder
    exclude:
      - conv.batch_norm
      - pre_encode.out.weight

@pzelasko
@chtruong814
@nithinraok

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions