Hi,
i just found some unusual behavior after rebuilding fresh environment (torch 2.8, cuda12.6).
When training Transduder+Ctc model, with Eden(... , warmup_start=0.1), the training diverges
after 100-200 steps of 1st epoch.
The issue disappears:
- when removing the Ctc loss
- or when setting back to original value
Eden(... , warmup_start=0.5) (even for Transducer+Ctc training)
- update: this postpones the divergence to epoch 3 (librispeech 100h train task),
right before diverging the grad norm seems huge:
2026-06-23 17:37:35,479 WARNING [optim.py:588] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+07 2.033e+08 4.885e+08 1.750e+09 8.677e+10, threshold=9.769e+08, percent-clipped=13.0
It seems like the ScaledAdam does not like the early Ctc gradients when the learning rate is too small.
Should there be some Ctc loss warmstarting introduced for Transducer+Ctc loss training ?
(assuming the model should learn first something from Transducer loss...)
Have you seen something similar ?
With kind regards
Karel from BUT
Hi,
i just found some unusual behavior after rebuilding fresh environment (torch 2.8, cuda12.6).
When training Transduder+Ctc model, with
Eden(... , warmup_start=0.1), the training divergesafter 100-200 steps of 1st epoch.
The issue disappears:
Eden(... , warmup_start=0.5)(even for Transducer+Ctc training)right before diverging the grad norm seems huge:
2026-06-23 17:37:35,479 WARNING [optim.py:588] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+07 2.033e+08 4.885e+08 1.750e+09 8.677e+10, threshold=9.769e+08, percent-clipped=13.0It seems like the ScaledAdam does not like the early Ctc gradients when the learning rate is too small.
Should there be some Ctc loss warmstarting introduced for Transducer+Ctc loss training ?
(assuming the model should learn first something from Transducer loss...)
Have you seen something similar ?
With kind regards
Karel from BUT