Skip to content

[trainer, rollout] feat: log rollout moe load-balance metrics#6853

Open
Luosuu wants to merge 7 commits into
mainfrom
codex/rollout-moe-lb-metrics-pr
Open

[trainer, rollout] feat: log rollout moe load-balance metrics#6853
Luosuu wants to merge 7 commits into
mainfrom
codex/rollout-moe-lb-metrics-pr

Conversation

@Luosuu

@Luosuu Luosuu commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add rollout MoE load-balance metrics from routed expert replay data
  • aggregate routed expert counts across each logging interval before reporting rollout/moe/* metrics
  • report through the normal v1 trainer metrics path and remove the VeOmni actor-side MoE load-balance monitor
  • add an opt-in rollout.moe_load_balance_metrics_interval knob, defaulting to disabled
  • include replay diagnostics so runs can distinguish missing routed expert replay data from missing load-balance calculations
  • infer the global routed expert count from HF-style model config fields, including nested override_config.model_config

Testing

  • ruff format verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
  • ruff check verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
  • python -m py_compile verl/trainer/main_ppo.py verl/trainer/ppo/v1/trainer_base.py verl/trainer/ppo/metric_utils.py
  • PYTHONPATH=. pytest -q tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
  • scripts/generate_trainer_config.sh
  • smoke-tested Qwen3.5-35B-A3B with vLLM rollout and moe_load_balance_metrics_interval=1; confirmed rollout/moe/* scalar metrics are uploaded through the main experiment tracker
  • completed a 100-step follow-up run with the same rollout MoE metrics path enabled

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the VeOmni-specific MoE load-balance monitor with a rollout-side MoE load-balance metrics logging mechanism (moe_load_balance_metrics_interval) based on routing replay data. It introduces a metrics accumulator and helper functions in metric_utils.py, integrates them into the PPO trainer, updates configuration files, and adds unit tests. The review feedback highlights two important optimization opportunities: first, making _get_nested_attr more robust to support OmegaConf DictConfig objects by using hasattr(obj, 'get') instead of isinstance(obj, dict); second, improving performance in _compute_rollout_moe_load_counts by applying the boolean mask on the GPU before transferring the tensor to the CPU to minimize host-device transfer overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread verl/trainer/ppo/metric_utils.py
Comment thread verl/trainer/ppo/metric_utils.py Outdated
@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 9fe0ee6 to 1c84b58 Compare June 25, 2026 23:56
@Luosuu Luosuu changed the title [ppo] feat: log rollout moe load-balance metrics [trainer, rollout] feat: log rollout moe load-balance metrics Jun 25, 2026
@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 1c84b58 to e4608af Compare June 26, 2026 01:18
Comment thread verl/trainer/ppo/ray_trainer.py Outdated
@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 1b913a3 to 854dd68 Compare June 26, 2026 03:16
@Luosuu Luosuu marked this pull request as draft June 26, 2026 05:02
@Luosuu Luosuu marked this pull request as ready for review June 26, 2026 18:36
@Luosuu

Luosuu commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Update:

  • moved the rollout MoE metrics integration to the v1 trainer path and removed the deprecated ray_trainer.py integration
  • included the DictConfig-compatible nested config lookup and device-side mask selection before CPU transfer
  • updated the PR description with the current implementation and validation notes
  • smoke test confirmed rollout/moe/* scalar metrics are reported through the main experiment tracker; a 100-step follow-up run has also been submitted

Comment thread verl/trainer/main_ppo.py Outdated
Comment thread verl/trainer/ppo/v1/trainer_base.py Outdated
Comment thread verl/trainer/ppo/v1/trainer_base.py Outdated
Comment thread verl/trainer/ppo/v1/trainer_base.py Outdated
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the latest review comments in 48b976d:

  • removed the legacy trainer guard so v0 is not blocked by this opt-in config
  • moved the HF override helper into metric_utils.py
  • switched rollout config access to .get("moe_load_balance_metrics_interval", 0)
  • added routed_experts to the main kv_batch_get fields; if an older rollout path does not provide it, the code logs once and retries the base fields without issuing a separate routed-experts-only fetch

Validated with ruff, py_compile, and the rollout MoE CPU test suite. The 100-step follow-up smoke run completed.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 48b976d to 8adfcff Compare June 27, 2026 03:45
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: amended the review-feedback commit to keep trainer_base.py shorter. The HF config loading, override application, and rollout MoE expert-count inference now live in metric_utils.py; trainer_base.py only keeps the cached call site and warning state.

Revalidated with ruff, py_compile, the rollout MoE CPU tests, and the commit hooks.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 8adfcff to 9e540b3 Compare June 27, 2026 03:58
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: slimmed trainer_base.py further in 9e540b3a.

The num-experts lazy inference cache and one-time skip warning state now live in RolloutMoELoadBalanceMetricsAccumulator, so trainer_base.py only wires the accumulator into the trainer flow and keeps the TransferQueue fallback logic at the call site. Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 9e540b3 to 007de1e Compare June 27, 2026 04:11
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: reduced the trainer_base.py MoE diff further in 007de1e.

trainer_base.py no longer has the local MoE try/except or the interval aggregation block. Optional routed_experts fetch fallback now lives in get_metric_data_with_optional_routed_experts(...), and interval update/flush lives in compute_moe_lb_metrics(...). kv_batch_get is injected from the trainer so metric_utils.py does not gain a module-level TransferQueue dependency.

Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 007de1e to 6a463a1 Compare June 27, 2026 04:16
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: kept the expert-count helpers split by behavior in 6a463a1.

infer_moe_num_experts(...) remains the cheap in-memory config probe, while infer_rollout_moe_num_experts(...) remains the rollout wrapper that may load the HF config only after the pure probe fails. The duplicated field-scanning logic is now shared through a private helper, so the public API and behavior boundary stay intact.

Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and the mandatory review gate.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from 6a463a1 to aeb0831 Compare June 27, 2026 05:13
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: addressed the robustness findings for missing routed_experts and warning suppression in aeb0831.

  • Missing routed_experts no longer retries the optional field on every metrics call. A failed optional fetch now backs off until the next metrics interval, then retries, avoiding both repeated exception/refetch cost and permanent disable.
  • MoE skip warnings are keyed (missing_routed_experts, num_experts_missing, etc.) so one warning category does not hide another production failure mode.

trainer_base.py only gained the global_steps argument needed for interval retry; the MoE logic remains in metric_utils.py. Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and review gate.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from aeb0831 to d1094b2 Compare June 27, 2026 05:16
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up: removed the unnecessary private wrapper around infer_moe_num_experts(...) in d1094b2.

infer_moe_num_experts(...) now directly implements the cheap in-memory probe, while infer_rollout_moe_num_experts(...) remains the HF-loading fallback wrapper. Revalidated with ruff, py_compile, the rollout MoE CPU tests, commit hooks, and review gate.

@Luosuu Luosuu force-pushed the codex/rollout-moe-lb-metrics-pr branch from d1094b2 to e28d6b9 Compare June 27, 2026 06:28
@Luosuu

Luosuu commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Updated the branch to address the cpu_unit_tests failure.

Root cause: the keyed-warning unit test depended on caplog.messages, which was empty in the CI environment even though the warning path was exercised locally. The test now monkeypatches the module logger directly, so it validates the accumulator one-warning-per-key behavior without relying on logging capture internals.

Local verification:

  • ruff format tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py
  • ruff check tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py verl/trainer/ppo/metric_utils.py verl/trainer/ppo/v1/trainer_base.py
  • PYTHONPATH=. pytest -q tests/trainer/ppo/test_rollout_moe_lb_metrics_on_cpu.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants