Enable expert-parallel language modules in MegatronMIMO by liding-nv · Pull Request #4485 · NVIDIA-NeMo/Megatron-Bridge

liding-nv · 2026-06-24T15:07:30Z

Summary

This PR enables MegatronMIMO to describe and build non-colocated language MoE modules with expert parallelism while keeping encoder modules dense.

Key changes:

Adds language-side expert_model_parallel_size / expert_tensor_parallel_size handling to MegatronMIMO parallelism config.
Separates dense rank accounting from expert rank views, so EP/ETP refactor the language rank span instead of multiplying the module world size.
Validates heterogeneous module rank ranges tile the distributed world exactly, and keeps encoder modules dense for this MoE path.
Rebuilds MegatronMIMO HyperCommGrid setup around a dense base view plus an expert view:
- dense: tp, cp, dp, pp
- expert: expt_tp, ep, expt_dp, pp
Wires the resulting expert groups into ProcessGroupCollection and MCore compatibility globals.
Seeds per-module RNG with real TP/PP/EP/ETP ranks so expert-parallel language ranks do not silently share identical expert seeds.
Passes language TP/CP groups into MCore MimoModel construction for sequence-parallel partitioning paths.
Initializes the MCore global memory buffer for MegatronMIMO custom parallel setup path.
Threads ep through the MegatronMIMO conversion CLI component parser.
Extends unit coverage for config validation, grid construction, provider wiring, RNG seeding, global expert-group bridging, checkpoint/optimizer setup, and conversion CLI parsing.

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot · 2026-06-24T15:07:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

liding-nv · 2026-06-24T15:07:40Z

/ok to test d3cfc95

liding-nv · 2026-06-24T16:45:34Z

/ok to test 6f1b8af

claude · 2026-06-24T19:49:47Z

Review

Finding: CLI help text missing ep=N

File: examples/conversion/convert_megatron_mimo.py, line 265

The ep=N key was added to _COMPONENT_KEY_TO_FIELD (line 66) and to the docstring/error message in _parse_component_spec (lines 75, 79), but the --component help text in _add_common_args still reads:

name=tp=N[,pp=N,dp=N,cp=N,etp=N,rank_offset=N]

It should include ep=N:

name=tp=N[,pp=N,dp=N,cp=N,ep=N,etp=N,rank_offset=N]

Everything else looks solid. The rank algebra (dense_model_parallel_size excluding EP/ETP), the expert factorization validation, the tiling validation upgrade (gaps + world_size coverage), the Phase 1 guards at both the config and builder layers, and the provider sync are all consistent and well-tested. The test coverage is thorough across all changed code paths.

Suggested test cases: No perf tests impacted.

Signed-off-by: Li Ding <liding@nvidia.com>

add MegatronMIMO expert parallel config

d3cfc95

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 24, 2026 15:08 Inactive

copy-pr-bot Bot temporarily deployed to test June 24, 2026 15:08 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 15:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 15:19 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 16:46 Inactive

copy-pr-bot Bot temporarily deployed to test June 24, 2026 16:46 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 16:56 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 17:16 Inactive

liding-nv marked this pull request as ready for review June 24, 2026 19:45

liding-nv marked this pull request as draft June 24, 2026 20:15

training loop for non-colocated moe model

8a0cf29

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv changed the title ~~add MegatronMIMO expert parallel config~~ Enable expert-parallel language modules in MegatronMIMO Jun 25, 2026

functional tests

b27324c

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv force-pushed the liding/mimo-moe-rank-config branch from 6f1b8af to b27324c Compare June 26, 2026 15:45

Merge branch 'main' into liding/mimo-moe-rank-config

acf1df8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable expert-parallel language modules in MegatronMIMO#4485

Enable expert-parallel language modules in MegatronMIMO#4485
liding-nv wants to merge 4 commits into
mainfrom
liding/mimo-moe-rank-config

liding-nv commented Jun 24, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

liding-nv commented Jun 24, 2026

Uh oh!

liding-nv commented Jun 24, 2026

Uh oh!

claude Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liding-nv commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

liding-nv commented Jun 24, 2026

Uh oh!

liding-nv commented Jun 24, 2026

Uh oh!

claude Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liding-nv commented Jun 24, 2026 •

edited

Loading