[codex] Fix tensor-parallel label smoothing by ilml · Pull Request #5522 · NVIDIA/Megatron-LM

ilml · 2026-06-27T02:29:30Z

Summary

compute label smoothing with the global tensor-parallel vocabulary size
reduce the max-shifted logit sum across the explicit TP group so every rank uses the same mean log-probability
keep the mean log-probability calculation finite when softmax probabilities underflow
add a TP=2 forward/backward regression against the documented dense target distribution

Root cause

The label-smoothing path treated each rank's vocabulary partition size as the full vocabulary size and averaged log-probabilities only within the local shard. That made losses rank-dependent and subtracted alpha / (partition_vocab_size - 1) from non-target gradients instead of alpha / (global_vocab_size - 1).

The path also took log() after exponentiation and normalization. Large but finite logit gaps can underflow those probabilities to zero, producing an infinite smoothed loss.

Impact

For TP > 1 with label_smoothing > 0, every training step could use incorrect loss values and gradients. The default label_smoothing=0 path does not perform the new sum or collective and is unchanged.

Validation

Passed locally:

python3 -m py_compile megatron/core/tensor_parallel/cross_entropy.py tests/unit_tests/tensor_parallel/test_cross_entropy.py
git diff --check
equation-level reproduction: the old TP=2 calculation produced rank losses differing by 1.0, a maximum loss error of 0.714286, and non-zero global gradient row sums; the corrected equations matched the dense loss and gradient to machine precision
independent code review of the final diff found no actionable issues

Not run locally:

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q tests/unit_tests/tensor_parallel/test_cross_entropy.py::test_vocab_parallel_cross_entropy_label_smoothing
- this host has no PyTorch installation or working NVIDIA driver
tools/autoformat.sh
- the repository's CI container is unavailable and the host does not provide black

Fixes #737

Signed-off-by: Tom Long <tolong@nvidia.com>

copy-pr-bot · 2026-06-27T02:29:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Fix tensor-parallel label smoothing

d6fc45b

Signed-off-by: Tom Long <tolong@nvidia.com>

ilml requested a review from yashaswikarnati June 27, 2026 02:59

ilml marked this pull request as ready for review June 27, 2026 04:43

ilml requested review from a team as code owners June 27, 2026 04:43

svcnvidia-nemo-ci added Final Review PR is in the "final review" stage complexity: low labels Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Fix tensor-parallel label smoothing#5522

[codex] Fix tensor-parallel label smoothing#5522
ilml wants to merge 1 commit into
NVIDIA:mainfrom
ilml:codex/fix-tp-label-smoothing

ilml commented Jun 27, 2026

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ilml commented Jun 27, 2026

Summary

Root cause

Impact

Validation

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants