fix(clip_grads): handle empty grads_for_norm in inf-norm and p-norm paths by Mattral · Pull Request #5530 · NVIDIA/Megatron-LM

Mattral · 2026-06-28T11:51:11Z

Summary

get_grad_norm_fp32 crashes when called with an empty gradient list,
which occurs in practice when all parameters on a rank are filtered out
(frozen layers, shared params, TP duplicates). Two out of three norm-type
branches are affected; the L2 branch already has a correct guard.

Changes

megatron/core/optimizer/clip_grads.py

Branch	Bug	Fix
`inf` norm (line 95)	`max()` over empty generator → `ValueError`	Add `default=torch.tensor(0.0)` to `max()`
generic p-norm (line 127)	`total_norm` stays `float 0.0`; `all_reduce` receives a non-Tensor → `TypeError`	Initialise `total_norm = torch.zeros(1, dtype=torch.float, device='cuda')` before loop

No change to the L2 path or any non-empty-list behaviour.

Testing

Added unit tests in tests/unit_tests/optimizer/test_clip_grads.py
covering empty-list calls for all three norm_type paths (inf, 2.0,
custom p). Tests run without GPU using the local fallback
implementations.
Existing CI passes unchanged.

Checklist

Commits signed off (git commit -s)
Rebased on main
No unrelated formatting changes
One logical change per commit

Fixes #5529

copy-pr-bot · 2026-06-28T11:51:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-28T11:51:23Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

get_grad_norm_fp32 crashes when grads_for_norm is empty: - inf-norm path (line 95): max() over an empty generator raises ValueError. Fix: pass default=torch.tensor(0.0) to max(). - generic p-norm path (lines 127-129): total_norm stays as a Python float after the loop body never executes, then torch.distributed.all_reduce receives a float instead of a Tensor, raising TypeError. Fix: initialise total_norm as a zero CUDA tensor before the loop, mirroring the existing L2 path pattern. The L2 path (norm_type == 2.0) already handles the empty case correctly via an explicit `if grads_for_norm` guard; this commit applies the same intent to the remaining two branches without changing the non-empty behaviour. Signed-off-by: Mattral <mattralminn@gmail.com>

Mattral requested review from a team as code owners June 28, 2026 11:51

svcnvidia-nemo-ci marked this pull request as draft June 28, 2026 11:51

github-actions Bot added the community-request label Jun 28, 2026

Mattral marked this pull request as ready for review June 28, 2026 11:52

svcnvidia-nemo-ci requested a review from a team June 28, 2026 11:52

Mattral force-pushed the main branch from b021a36 to 6c0d31f Compare June 28, 2026 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(clip_grads): handle empty grads_for_norm in inf-norm and p-norm paths#5530

fix(clip_grads): handle empty grads_for_norm in inf-norm and p-norm paths#5530
Mattral wants to merge 1 commit into
NVIDIA:mainfrom
Mattral:main

Mattral commented Jun 28, 2026

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Mattral commented Jun 28, 2026

Summary

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants