[trainer, rollout] feat: log rollout moe load-balance metrics#6853
[trainer, rollout] feat: log rollout moe load-balance metrics#6853Luosuu wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces the VeOmni-specific MoE load-balance monitor with a rollout-side MoE load-balance metrics logging mechanism (moe_load_balance_metrics_interval) based on routing replay data. It introduces a metrics accumulator and helper functions in metric_utils.py, integrates them into the PPO trainer, updates configuration files, and adds unit tests. The review feedback highlights two important optimization opportunities: first, making _get_nested_attr more robust to support OmegaConf DictConfig objects by using hasattr(obj, 'get') instead of isinstance(obj, dict); second, improving performance in _compute_rollout_moe_load_counts by applying the boolean mask on the GPU before transferring the tensor to the CPU to minimize host-device transfer overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
9fe0ee6 to
1c84b58
Compare
1c84b58 to
e4608af
Compare
1b913a3 to
854dd68
Compare
|
Update:
|
|
Addressed the latest review comments in
Validated with ruff, py_compile, and the rollout MoE CPU test suite. The 100-step follow-up smoke run completed. |
48b976d to
8adfcff
Compare
|
Follow-up: amended the review-feedback commit to keep Revalidated with ruff, py_compile, the rollout MoE CPU tests, and the commit hooks. |
8adfcff to
9e540b3
Compare
|
Follow-up: slimmed The num-experts lazy inference cache and one-time skip warning state now live in |
9e540b3 to
007de1e
Compare
|
Follow-up: reduced the
Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate. |
007de1e to
6a463a1
Compare
|
Follow-up: kept the expert-count helpers split by behavior in
Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate. |
6a463a1 to
aeb0831
Compare
|
Follow-up: addressed the robustness findings for missing
|
aeb0831 to
d1094b2
Compare
|
Follow-up: removed the unnecessary private wrapper around
|
d1094b2 to
e28d6b9
Compare
|
Updated the branch to address the Root cause: the keyed-warning unit test depended on Local verification:
|
Summary
rollout/moe/*metricsrollout.moe_load_balance_metrics_intervalknob, defaulting to disabledoverride_config.model_configTesting
ruff format verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.pyruff check verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.pypython -m py_compile verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.pyPYTHONPATH=. pytest -q tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.pyscripts/generate_trainer_config.shmoe_load_balance_metrics_interval=1; confirmedrollout/moe/*scalar metrics are uploaded through the main experiment tracker