Skip to content

Enable expert-parallel language modules in MegatronMIMO#4485

Draft
liding-nv wants to merge 4 commits into
mainfrom
liding/mimo-moe-rank-config
Draft

Enable expert-parallel language modules in MegatronMIMO#4485
liding-nv wants to merge 4 commits into
mainfrom
liding/mimo-moe-rank-config

Conversation

@liding-nv

@liding-nv liding-nv commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR enables MegatronMIMO to describe and build non-colocated language MoE modules with expert parallelism while keeping encoder modules dense.

Key changes:

  • Adds language-side expert_model_parallel_size / expert_tensor_parallel_size handling to MegatronMIMO parallelism config.
  • Separates dense rank accounting from expert rank views, so EP/ETP refactor the language rank span instead of multiplying the module world size.
  • Validates heterogeneous module rank ranges tile the distributed world exactly, and keeps encoder modules dense for this MoE path.
  • Rebuilds MegatronMIMO HyperCommGrid setup around a dense base view plus an expert view:
    • dense: tp, cp, dp, pp
    • expert: expt_tp, ep, expt_dp, pp
  • Wires the resulting expert groups into ProcessGroupCollection and MCore compatibility globals.
  • Seeds per-module RNG with real TP/PP/EP/ETP ranks so expert-parallel language ranks do not silently share identical expert seeds.
  • Passes language TP/CP groups into MCore MimoModel construction for sequence-parallel partitioning paths.
  • Initializes the MCore global memory buffer for MegatronMIMO custom parallel setup path.
  • Threads ep through the MegatronMIMO conversion CLI component parser.
  • Extends unit coverage for config validation, grid construction, provider wiring, RNG seeding, global expert-group bridging, checkpoint/optimizer setup, and conversion CLI parsing.

Signed-off-by: Li Ding <liding@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@liding-nv

Copy link
Copy Markdown
Contributor Author

/ok to test d3cfc95

@liding-nv

Copy link
Copy Markdown
Contributor Author

/ok to test 6f1b8af

@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review

Finding: CLI help text missing ep=N

File: examples/conversion/convert_megatron_mimo.py, line 265

The ep=N key was added to _COMPONENT_KEY_TO_FIELD (line 66) and to the docstring/error message in _parse_component_spec (lines 75, 79), but the --component help text in _add_common_args still reads:

name=tp=N[,pp=N,dp=N,cp=N,etp=N,rank_offset=N]

It should include ep=N:

name=tp=N[,pp=N,dp=N,cp=N,ep=N,etp=N,rank_offset=N]


Everything else looks solid. The rank algebra (dense_model_parallel_size excluding EP/ETP), the expert factorization validation, the tiling validation upgrade (gaps + world_size coverage), the Phase 1 guards at both the config and builder layers, and the provider sync are all consistent and well-tested. The test coverage is thorough across all changed code paths.

Suggested test cases: No perf tests impacted.

@liding-nv liding-nv marked this pull request as draft June 24, 2026 20:15
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv liding-nv changed the title add MegatronMIMO expert parallel config Enable expert-parallel language modules in MegatronMIMO Jun 25, 2026
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv liding-nv force-pushed the liding/mimo-moe-rank-config branch from 6f1b8af to b27324c Compare June 26, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant