[Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes#5464
[Dev] Numerical fix for moe single grouped weight with fp8 fp4 primary weight and grad norm spikes#5464zhongbozhu wants to merge 17 commits into
Conversation
|
/claude strict-review |
Code Review SummaryCRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3 Overall AssessmentThis is a well-structured fix for Risk level: Low-Medium. The changes are narrowly scoped to the GroupedTensor integration paths and gated behind Key FindingsIMPORTANT — Unused IMPORTANT — Suggested fix: def is_nvfp4tensor(tensor: torch.Tensor) -> bool:
"""Check if a tensor is a Transformer Engine NVFP4Tensor."""
return HAVE_TE_FP4_TENSOR_CLASS and _is_instance_or_param_data(tensor, FP4_TENSOR_CLASS)Suggestions (posted inline)
|
| bucket.layerwise_params_list[local_rank] | ||
| ).detach() | ||
| local_slot_view.copy_(flat_local_params) | ||
| with torch.no_grad(): |
There was a problem hiding this comment.
Why this with torch.no_grad() needed?
There was a problem hiding this comment.
removed, they are redundant I believe
There was a problem hiding this comment.
we need to use torch.no_grad() when the mutation is intentional and should not affect gradients.
Looks like you removed this everywhere. Not sure if this matters, but:
I tried implementing this feature some time ago and I got below error in the past in mxfp8 reuse grad buffer case when doing some copying.
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation
Need to make sure unit test still pass after this change.
There was a problem hiding this comment.
good catch, there is a bug in E2E test not captured in UT
There was a problem hiding this comment.
should be resolved now
|
/ok to test 509c7a6 |
7cae31a to
0456abf
Compare
|
Note: GB200 unit test was added #5477 but not yet synced to dev |
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
…ight Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
0456abf to
44b0b72
Compare
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>



What does this PR do ?
Fix
moe_single_grouped_weightwith bf16, mxfp8, nvfp4 training with fp8/fp4 primary weight turned on or off.Mirror PR to main: #5487
Unit tests with numerical checks passed, pending E2E validation.
test_single_grouped_mxfp8_train_eval_train_matches_train_onlyis a newly introduced test targeting to test thereuse_grad_buff_for_mxfp8_param_agrigorously, like adding checks for train-eval-train switches.Unit test coverage matrix:
bf16=Truefp8=Nonefp4=Nonegradient_accumulation_fusion=Falsebf16=Truefp8=Nonefp4=Nonegradient_accumulation_fusion=Truebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Falsereuse_grad_buf_for_mxfp8_param_ag=Falsegradient_accumulation_fusion=Falsebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Falsereuse_grad_buf_for_mxfp8_param_ag=Falsegradient_accumulation_fusion=Truebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Truereuse_grad_buf_for_mxfp8_param_ag=Truegradient_accumulation_fusion=Falsebf16=Truefp8="e4m3"fp8_recipe="mxfp8"fp8_param_gather=Truereuse_grad_buf_for_mxfp8_param_ag=Truegradient_accumulation_fusion=Truebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Falsegradient_accumulation_fusion=Falsebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Falsegradient_accumulation_fusion=Truebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Truegradient_accumulation_fusion=Falsebf16=Truefp4="e2m1"fp4_recipe="nvfp4"fp4_param_gather=Truegradient_accumulation_fusion=TrueEnv: 1 x gb200 node, 4 GPUs, the unit test only uses 2 parallel ranks.
Command:
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.