[Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes by zhongbozhu · Pull Request #5464 · NVIDIA/Megatron-LM

zhongbozhu · 2026-06-24T00:19:05Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Fix moe_single_grouped_weight with bf16, mxfp8, nvfp4 training with fp8/fp4 primary weight turned on or off.

Mirror PR to main: #5487

Unit tests with numerical checks passed, pending E2E validation. test_single_grouped_mxfp8_train_eval_train_matches_train_only is a newly introduced test targeting to test the reuse_grad_buff_for_mxfp8_param_ag rigorously, like adding checks for train-eval-train switches.

Unit test coverage matrix:

Precision	Primary Weight Path	Grad Accum Fusion	Comparison	Notes / Transformer Config
BF16	BF16 primary weight	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8=None` `fp4=None` `gradient_accumulation_fusion=False`
BF16	BF16 primary weight	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8=None` `fp4=None` `gradient_accumulation_fusion=True`
MXFP8	BF16 primary weight, MXFP8 compute	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=False` `reuse_grad_buf_for_mxfp8_param_ag=False` `gradient_accumulation_fusion=False`
MXFP8	BF16 primary weight, MXFP8 compute	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=False` `reuse_grad_buf_for_mxfp8_param_ag=False` `gradient_accumulation_fusion=True`
MXFP8	MXFP8 primary weight	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=True` `reuse_grad_buf_for_mxfp8_param_ag=True` `gradient_accumulation_fusion=False`
MXFP8	MXFP8 primary weight	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp8="e4m3"` `fp8_recipe="mxfp8"` `fp8_param_gather=True` `reuse_grad_buf_for_mxfp8_param_ag=True` `gradient_accumulation_fusion=True`
NVFP4	BF16 primary weight, NVFP4 compute	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=False` `gradient_accumulation_fusion=False`
NVFP4	BF16 primary weight, NVFP4 compute	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=False` `gradient_accumulation_fusion=True`
NVFP4	NVFP4 primary weight	Off	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=True` `gradient_accumulation_fusion=False`
NVFP4	NVFP4 primary weight	On	single grouped weight on compared with single grouped weight off	`bf16=True` `fp4="e2m1"` `fp4_recipe="nvfp4"` `fp4_param_gather=True` `gradient_accumulation_fusion=True`

Env: 1 x gb200 node, 4 GPUs, the unit test only uses 2 parallel ranks.

Command:

torchrun --nproc_per_node=2 --log-dir /tmp/mcore-single-weight-ut --tee 0:3 --redirects 3 -m pytest -s -q tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py

[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_mxfp8_single_weight_torch_dist_checkpoint_matches_discrete_baseline[save-only-single]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_mxfp8_single_weight_torch_dist_checkpoint_matches_discrete_baseline[save-single-load-discrete]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_mxfp8_single_weight_torch_dist_checkpoint_matches_discrete_baseline[save-discrete-load-single]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_mxfp8_train_eval_train_matches_train_only
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_with_primary_param_gather[True-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_without_primary_param_gather[True-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[False-True-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-False-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-False-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-False-nvfp4]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-True-bf16]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-True-mxfp8]
[default0]:PASSED tests/unit_tests/transformer/moe/test_moe_single_grouped_weight_numerics.py::TestMoESingleGroupedWeightNumerics::test_single_grouped_weight_parity_module_grouped_linear[True-True-nvfp4]

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

copy-pr-bot · 2026-06-24T00:19:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kunlunl · 2026-06-24T14:52:30Z

/claude strict-review

claude · 2026-06-24T14:57:43Z

Code Review Summary

CRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3

Overall Assessment

This is a well-structured fix for moe_single_grouped_weight across BF16, MXFP8, and NVFP4 with thorough test coverage. The data-flow through quantized/non-quantized param paths in the DDP buffer and distributed optimizer is correct. The register_grouped_linear_params refactor properly addresses the root cause (TE overwriting DDP-managed parameters with fresh meta tensors). The torch.no_grad() additions are necessary to prevent autograd tracking on buffer management ops with tensor subclasses.

Risk level: Low-Medium. The changes are narrowly scoped to the GroupedTensor integration paths and gated behind moe_single_grouped_weight. The FSDP guard is a good safeguard. The numerical parity tests cover the full precision × param-gather × grad-accum-fusion matrix.

Key Findings

IMPORTANT — Unused _unwrap_parameter_data on DistributedOptimizer (distrib_optimizer.py:1118-1121)
Added as a @staticmethod but never called. Duplicates the function in fp8_utils.py. Should be removed (inline comment posted with suggestion block).

IMPORTANT — is_nvfp4tensor not updated to unwrap Parameters (fp4_utils.py:58-60)
The PR updates is_float8tensor and is_mxfp8tensor to handle torch.nn.Parameter-wrapped TE subclasses via _is_instance_or_param_data, but is_nvfp4tensor still uses plain isinstance. This inconsistency could misclassify a Parameter-wrapped NVFP4Tensor in _param_uses_quantized_storage. The fix is straightforward — this file already imports _is_instance_or_param_data indirectly through the fp8_utils imports added in this PR. (Couldn't post inline since these lines aren't in the diff.)

Suggested fix:

def is_nvfp4tensor(tensor: torch.Tensor) -> bool:
    """Check if a tensor is a Transformer Engine NVFP4Tensor."""
    return HAVE_TE_FP4_TENSOR_CLASS and _is_instance_or_param_data(tensor, FP4_TENSOR_CLASS)

Suggestions (posted inline)

copy_tensor_to_quantized_param: document that the plain copy_ fallback relies on TE's overridden method
register_grouped_linear_params: consider clearing stale "weight" in the per-index branch for symmetry
modify_grouped_nvfp4_rowwise_storage: add comment explaining why member views are refreshed eagerly (vs. lazily in the MXFP8 counterpart)

kunlunl · 2026-06-24T14:57:59Z

                        bucket.layerwise_params_list[local_rank]
                    ).detach()
-                    local_slot_view.copy_(flat_local_params)
+                    with torch.no_grad():


Why this with torch.no_grad() needed?

removed, they are redundant I believe

we need to use torch.no_grad() when the mutation is intentional and should not affect gradients.
Looks like you removed this everywhere. Not sure if this matters, but:
I tried implementing this feature some time ago and I got below error in the past in mxfp8 reuse grad buffer case when doing some copying.
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation
Need to make sure unit test still pass after this change.

good catch, there is a bug in E2E test not captured in UT

should be resolved now

zhongbozhu · 2026-06-24T20:00:31Z

/ok to test 509c7a6

zhongbozhu · 2026-06-24T20:26:34Z

Note: GB200 unit test was added #5477 but not yet synced to dev

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

…ight Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Signed-off-by: zhongboz <zhongboz@nvidia.com>

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Signed-off-by: zhongboz <zhongboz@nvidia.com>

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

Signed-off-by: zhongboz <zhongboz@nvidia.com>

zhongbozhu · 2026-06-28T06:23:51Z

E2E test with Qwen3.5 VL 35B-A3B SFT - branch dev_fix_single_weight

Before this PR, moe single weight will simply diverge. now it converges well.

Performance benefit comes form lower CPU overhead when quantizing to MXFP8 in distributed optimizer. Plus that CUDA Graph can be hard to open for multimodal SFT as of today.

Green plot (before this PR) had grad norm spikes because if we have reuse_grad_buff_for_mxfp8_param_ag, the training step right after eval doesn't clear the param_data buffer because the all-gather was already done in eval - so it got skipped, but unfortunately the zero buffer operation was also skipped.

zhongbozhu · 2026-06-28T06:43:16Z

E2E performance benefit shown in Nsys - time spent in looping over moe weights in optimizer master weights and quantize to mxfp8, discrete weight vs. single weight

Discrete

Single

Signed-off-by: zhongboz <zhongboz@nvidia.com>

zhongbozhu requested review from a team as code owners June 24, 2026 00:19

zhongbozhu requested review from WanZzzzzz and kunlunl June 24, 2026 00:31

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/optimizer/distrib_optimizer.py Outdated

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/fp8_utils.py

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py

claude Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread megatron/core/fp4_utils.py

kunlunl reviewed Jun 24, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:01 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:04 Inactive

zhongbozhu mentioned this pull request Jun 24, 2026

[Main] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes #5487

Open

6 tasks

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:13 Inactive

zhongbozhu force-pushed the dev_fix_single_weight branch from 7cae31a to 0456abf Compare June 24, 2026 20:19

WanZzzzzz approved these changes Jun 26, 2026

View reviewed changes

zhongbozhu added 6 commits June 26, 2026 16:18

fix single weight - first draft

3c3199a

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

update unit test

b183b83

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

fix for gradient_accumulation_fusion

8d1a8ff

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

checks all ranks

b532964

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

increase UT coverage

5b879dc

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

resolve comments

c24ff37

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

zhongbozhu added 3 commits June 26, 2026 16:18

resolve comments and refactor param remapping logic for better clarity

b4da20b

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

add mcore warning about use_transformer_engine_op_fuser and single we…

c791977

…ight Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

linter

44b0b72

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

zhongbozhu force-pushed the dev_fix_single_weight branch from 0456abf to 44b0b72 Compare June 26, 2026 23:18

zhongbozhu mentioned this pull request Jun 27, 2026

[bug] Investigate convergence of performance features with Qwen3.5 VL as proxy model NVIDIA-NeMo/Megatron-Bridge#3801

Open

zhongbozhu added 6 commits June 26, 2026 19:02

another linter

35cc1a7

Signed-off-by: zhongboz <zhongboz@nvidia.com>

fix a no_grad bug in E2E traning, add repro to unit test

529053d

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

fix unit test

35acdc2

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

continue improve UT

d10b461

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

improve UT, fix grad norm spike after eval

20e5856

Signed-off-by: zhongboz <zhongboz@nvidia.com>

run UT in CI

79706bb

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 28, 2026 05:57 Inactive

zhongbozhu changed the title ~~[Dev] Fix moe single grouped weight feature with fp8 fp4 primary weight support~~ [Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes Jun 28, 2026

copy-pr-bot Bot temporarily deployed to public June 28, 2026 06:01 Inactive

lint

a73c86b

Signed-off-by: zhongboz <zhongboz@nvidia.com>

include checkpointing to the unit test

faa033b

Signed-off-by: zhongboz <zhongboz@nvidia.com>

Uh oh!

Conversation

zhongbozhu commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

kunlunl commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 24, 2026

Code Review Summary

Overall Assessment

Key Findings

Suggestions (posted inline)

Uh oh!

kunlunl Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

WanZzzzzz Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu commented Jun 24, 2026

Uh oh!

zhongbozhu commented Jun 24, 2026

Uh oh!

zhongbozhu commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhongbozhu commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhongbozhu commented Jun 24, 2026 •

edited

Loading

zhongbozhu commented Jun 28, 2026 •

edited

Loading