Skip to content

[codex] Fix tensor-parallel label smoothing#5522

Open
ilml wants to merge 1 commit into
NVIDIA:mainfrom
ilml:codex/fix-tp-label-smoothing
Open

[codex] Fix tensor-parallel label smoothing#5522
ilml wants to merge 1 commit into
NVIDIA:mainfrom
ilml:codex/fix-tp-label-smoothing

Conversation

@ilml

@ilml ilml commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • compute label smoothing with the global tensor-parallel vocabulary size
  • reduce the max-shifted logit sum across the explicit TP group so every rank uses the same mean log-probability
  • keep the mean log-probability calculation finite when softmax probabilities underflow
  • add a TP=2 forward/backward regression against the documented dense target distribution

Root cause

The label-smoothing path treated each rank's vocabulary partition size as the full vocabulary size and averaged log-probabilities only within the local shard. That made losses rank-dependent and subtracted alpha / (partition_vocab_size - 1) from non-target gradients instead of alpha / (global_vocab_size - 1).

The path also took log() after exponentiation and normalization. Large but finite logit gaps can underflow those probabilities to zero, producing an infinite smoothed loss.

Impact

For TP > 1 with label_smoothing > 0, every training step could use incorrect loss values and gradients. The default label_smoothing=0 path does not perform the new sum or collective and is unchanged.

Validation

Passed locally:

  • python3 -m py_compile megatron/core/tensor_parallel/cross_entropy.py tests/unit_tests/tensor_parallel/test_cross_entropy.py
  • git diff --check
  • equation-level reproduction: the old TP=2 calculation produced rank losses differing by 1.0, a maximum loss error of 0.714286, and non-zero global gradient row sums; the corrected equations matched the dense loss and gradient to machine precision
  • independent code review of the final diff found no actionable issues

Not run locally:

  • uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q tests/unit_tests/tensor_parallel/test_cross_entropy.py::test_vocab_parallel_cross_entropy_label_smoothing
    • this host has no PyTorch installation or working NVIDIA driver
  • tools/autoformat.sh
    • the repository's CI container is unavailable and the host does not provide black

Fixes #737

Signed-off-by: Tom Long <tolong@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ilml ilml requested a review from yashaswikarnati June 27, 2026 02:59
@ilml ilml marked this pull request as ready for review June 27, 2026 04:43
@ilml ilml requested review from a team as code owners June 27, 2026 04:43
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Final Review PR is in the "final review" stage complexity: low labels Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: low Final Review PR is in the "final review" stage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] cross-entropy loss not computed correctly when label_smoothing is enabled

2 participants