[trainer, rollout] feat: log rollout moe load-balance metrics by Luosuu · Pull Request #6853 · verl-project/verl

Luosuu · 2026-06-25T23:53:06Z

Summary

add rollout MoE load-balance metrics from routed expert replay data
aggregate routed expert counts across each logging interval before reporting rollout/moe/* metrics
report through the normal v1 trainer metrics path and remove the VeOmni actor-side MoE load-balance monitor
add an opt-in rollout.moe_load_balance_metrics_interval knob, defaulting to disabled
include replay diagnostics so runs can distinguish missing routed expert replay data from missing load-balance calculations
infer the global routed expert count from HF-style model config fields, including nested override_config.model_config

Testing

ruff format verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
ruff check verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
python -m py_compile verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py
PYTHONPATH=. pytest -q tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
scripts/generate_trainer_config.sh
smoke-tested Qwen3.5-35B-A3B with vLLM rollout and moe_load_balance_metrics_interval=1; confirmed rollout/moe/* scalar metrics are uploaded through the main experiment tracker
completed a 100-step follow-up run with the same rollout MoE metrics path enabled

gemini-code-assist

Code Review

This pull request replaces the VeOmni-specific MoE load-balance monitor with a rollout-side MoE load-balance metrics logging mechanism (moe_load_balance_metrics_interval) based on routing replay data. It introduces a metrics accumulator and helper functions in metric_utils.py, integrates them into the PPO trainer, updates configuration files, and adds unit tests. The review feedback highlights two important optimization opportunities: first, making _get_nested_attr more robust to support OmegaConf DictConfig objects by using hasattr(obj, 'get') instead of isinstance(obj, dict); second, improving performance in _compute_rollout_moe_load_counts by applying the boolean mask on the GPU before transferring the tensor to the CPU to minimize host-device transfer overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Luosuu · 2026-06-26T18:39:37Z

Update:

moved the rollout MoE metrics integration to the v1 trainer path and removed the deprecated ray_trainer.py integration
included the DictConfig-compatible nested config lookup and device-side mask selection before CPU transfer
updated the PR description with the current implementation and validation notes
smoke test confirmed rollout/moe/* scalar metrics are reported through the main experiment tracker; a 100-step follow-up run has also been submitted

Luosuu · 2026-06-27T03:30:42Z

Addressed the latest review comments in 48b976d:

removed the legacy trainer guard so v0 is not blocked by this opt-in config
moved the HF override helper into metric_utils.py
switched rollout config access to .get("moe_load_balance_metrics_interval", 0)
added routed_experts to the main kv_batch_get fields; if an older rollout path does not provide it, the code logs once and retries the base fields without issuing a separate routed-experts-only fetch

Validated with ruff, py_compile, and the rollout MoE CPU test suite. The 100-step follow-up smoke run completed.

Luosuu · 2026-06-27T03:45:57Z

Follow-up: amended the review-feedback commit to keep trainer_base.py shorter. The HF config loading, override application, and rollout MoE expert-count inference now live in metric_utils.py; trainer_base.py only keeps the cached call site and warning state.

Revalidated with ruff, py_compile, the rollout MoE CPU tests, and the commit hooks.

Luosuu · 2026-06-27T03:58:18Z

Follow-up: slimmed trainer_base.py further in 9e540b3a.

The num-experts lazy inference cache and one-time skip warning state now live in RolloutMoELoadBalanceMetricsAccumulator, so trainer_base.py only wires the accumulator into the trainer flow and keeps the TransferQueue fallback logic at the call site. Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate.

Luosuu · 2026-06-27T04:11:12Z

Follow-up: reduced the trainer_base.py MoE diff further in 007de1e.

trainer_base.py no longer has the local MoE try/except or the interval aggregation block. Optional routed_experts fetch fallback now lives in get_metric_data_with_optional_routed_experts(...), and interval update/flush lives in compute_moe_lb_metrics(...). kv_batch_get is injected from the trainer so metric_utils.py does not gain a module-level TransferQueue dependency.

Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate.

Luosuu · 2026-06-27T04:16:16Z

Follow-up: kept the expert-count helpers split by behavior in 6a463a1.

infer_moe_num_experts(...) remains the cheap in-memory config probe, while infer_rollout_moe_num_experts(...) remains the rollout wrapper that may load the HF config only after the pure probe fails. The duplicated field-scanning logic is now shared through a private helper, so the public API and behavior boundary stay intact.

Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate.

Luosuu · 2026-06-27T05:13:55Z

Follow-up: addressed the robustness findings for missing routed_experts and warning suppression in aeb0831.

Missing routed_experts no longer retries the optional field on every metrics call. A failed optional fetch now backs off until the next metrics interval, then retries, avoiding both repeated exception/refetch cost and permanent disable.
MoE skip warnings are keyed (missing_routed_experts, num_experts_missing, etc.) so one warning category does not hide another production failure mode.

trainer_base.py only gained the global_steps argument needed for interval retry; the MoE logic remains in metric_utils.py. Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and review gate.

Luosuu · 2026-06-27T05:17:08Z

Follow-up: removed the unnecessary private wrapper around infer_moe_num_experts(...) in d1094b2.

infer_moe_num_experts(...) now directly implements the cheap in-memory probe, while infer_rollout_moe_num_experts(...) remains the HF-loading fallback wrapper. Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and review gate.

Luosuu · 2026-06-27T06:29:10Z

Updated the branch to address the cpu_unit_tests failure.

Root cause: the keyed-warning unit test depended on caplog.messages, which was empty in the CI environment even though the warning path was exercised locally. The test now monkeypatches the module logger directly, so it validates the accumulator one-warning-per-key behavior without relying on logging capture internals.

Local verification:

ruff format tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
ruff check tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py verl/trainer/ppo/metric_utils.py verl/trainer/ppo/v1/trainer_base.py
PYTHONPATH=. pytest -q tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py

Luosuu requested review from FoolPlayer, PeterSH6, eric-haibin-lin, tongyx361, vermouth1992, wucong25 and wuxibin89 as code owners June 25, 2026 23:53

Luosuu mentioned this pull request Jun 25, 2026

[ppo] feat: log rollout moe load-balance metrics #6852

Closed

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread verl/trainer/ppo/metric_utils.py

Comment thread verl/trainer/ppo/metric_utils.py Outdated

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 9fe0ee6 to 1c84b58 Compare June 25, 2026 23:56

Luosuu changed the title ~~[ppo] feat: log rollout moe load-balance metrics~~ [trainer, rollout] feat: log rollout moe load-balance metrics Jun 25, 2026

feat: log rollout moe load-balance metrics

e4608af

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 1c84b58 to e4608af Compare June 26, 2026 01:18

wuxibin89 requested changes Jun 26, 2026

View reviewed changes

Comment thread verl/trainer/ppo/ray_trainer.py Outdated

Luosuu added 2 commits June 26, 2026 03:09

[trainer] fix rollout moe lb metrics v1 integration

b914ea3

[trainer] validate rollout moe lb metrics trainer mode

854dd68

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 1b913a3 to 854dd68 Compare June 26, 2026 03:16

fix: infer rollout moe expert count from hf config

04141f8

Luosuu marked this pull request as draft June 26, 2026 05:02

Luosuu added 2 commits June 26, 2026 05:30

fix: report rollout moe replay diagnostics

7f244e3

fix: align rollout moe hf config overrides

33b30af

Luosuu marked this pull request as ready for review June 26, 2026 18:36

wuxibin89 reviewed Jun 27, 2026

View reviewed changes

Comment thread verl/trainer/main_ppo.py Outdated

Comment thread verl/trainer/ppo/v1/trainer_base.py Outdated

Comment thread verl/trainer/ppo/v1/trainer_base.py Outdated

Comment thread verl/trainer/ppo/v1/trainer_base.py Outdated

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 48b976d to 8adfcff Compare June 27, 2026 03:45

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 8adfcff to 9e540b3 Compare June 27, 2026 03:58

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 9e540b3 to 007de1e Compare June 27, 2026 04:11

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 007de1e to 6a463a1 Compare June 27, 2026 04:16

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 6a463a1 to aeb0831 Compare June 27, 2026 05:13

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from aeb0831 to d1094b2 Compare June 27, 2026 05:16

fix: address rollout moe review feedback

e28d6b9

Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from d1094b2 to e28d6b9 Compare June 27, 2026 06:28

Uh oh!

Conversation

Luosuu commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luosuu commented Jun 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Luosuu commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Luosuu commented Jun 25, 2026 •

edited

Loading