Skip to content

[megatron] reduce R3 router replay memory with per-trajectory traces#6860

Open
tntnnlrw wants to merge 2 commits into
verl-project:mainfrom
tntnnlrw:codex/r3-router-replay-per-traj
Open

[megatron] reduce R3 router replay memory with per-trajectory traces#6860
tntnnlrw wants to merge 2 commits into
verl-project:mainfrom
tntnnlrw:codex/r3-router-replay-per-traj

Conversation

@tntnnlrw

Copy link
Copy Markdown

Summary

  • Add an opt-in VERL_R3_SPEEDUP=1 path that keeps R3 rollout routing traces as per-trajectory [actual_len, num_layers, topk] arrays instead of materializing dense [batch, seq_len, num_layers, topk] tensors in the agent loop.
  • Convert those per-trajectory traces to nested tensors during no-padding conversion, so the existing Megatron R3 replay path can consume them without engine changes.
  • Preserve the default R3 behavior when VERL_R3_SPEEDUP is unset or disabled.

Motivation

For long-context R3 rollouts, the dense padded routed-experts tensor can add avoidable memory pressure and copies before the current nested/no-padding Megatron path. This PR keeps the routing traces in their natural per-trajectory form until the no-padding conversion step.

Tests

  • PYTHONPATH=. pytest tests/utils/test_padding_on_cpu.py -q
  • PYTHONPATH=. pytest tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py -q
  • python -m ruff check verl/experimental/agent_loop/agent_loop.py verl/workers/utils/padding.py tests/utils/test_padding_on_cpu.py tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py
  • python -m py_compile verl/experimental/agent_loop/agent_loop.py verl/workers/utils/padding.py tests/utils/test_padding_on_cpu.py tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py
  • git diff --check

Checklist

  • CPU-only tests added
  • Default R3 path remains unchanged
  • No internal scripts or end-to-end debug files included

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a per-trajectory routing trace fast path ('R3 speedup') to optimize memory and performance in the agent loop and padding utilities, including functions to narrow routed expert dtypes and handle unpadded per-trajectory data. The review feedback highlights critical performance bottlenecks caused by synchronous GPU-to-CPU transfers (device-to-host copies). Specifically, it recommends avoiding .item() calls on GPU tensors by checking .is_cuda in narrow_routed_experts, converting seqlens to a CPU list before looping in _align_r3_per_traj_tensors, and removing a redundant narrowing operation on the concatenated GPU tensor flat.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +39 to +46
def narrow_routed_experts(routed_experts):
"""Convert routed expert ids to the smallest integer dtype that preserves their values."""
if routed_experts is None:
return None

if isinstance(routed_experts, torch.Tensor):
if routed_experts.numel() == 0:
return routed_experts.to(torch.uint8)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If routed_experts is a PyTorch tensor on GPU, calling .min().item() and .max().item() will trigger synchronous device-to-host copies, blocking the CPU thread. Add a check to return the tensor immediately if it is already on a CUDA device, as it should have already been narrowed on CPU before being moved to GPU.

Suggested change
def narrow_routed_experts(routed_experts):
"""Convert routed expert ids to the smallest integer dtype that preserves their values."""
if routed_experts is None:
return None
if isinstance(routed_experts, torch.Tensor):
if routed_experts.numel() == 0:
return routed_experts.to(torch.uint8)
def narrow_routed_experts(routed_experts):
"""Convert routed expert ids to the smallest integer dtype that preserves their values."""
if routed_experts is None:
return None
if isinstance(routed_experts, torch.Tensor):
if routed_experts.is_cuda:
return routed_experts
if routed_experts.numel() == 0:
return routed_experts.to(torch.uint8)

Comment thread verl/workers/utils/padding.py Outdated
Comment on lines +123 to +127
def _align_r3_per_traj_tensors(per_traj_list, seqlens, target_device):
aligned = []
for i, item in enumerate(per_traj_list):
tensor = _r3_per_traj_item_to_tensor(item).to(device=target_device)
target_len = int(seqlens[i].item())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling seqlens[i].item() inside a loop over the batch size triggers a synchronous device-to-host copy for every single iteration when seqlens is on GPU. This introduces a severe performance bottleneck. Convert seqlens to a CPU list once before the loop using seqlens.tolist() to avoid per-iteration GPU synchronization.

def _align_r3_per_traj_tensors(per_traj_list, seqlens, target_device):
    aligned = []
    seqlens_cpu = seqlens.tolist()
    for i, item in enumerate(per_traj_list):
        tensor = _r3_per_traj_item_to_tensor(item).to(device=target_device)
        target_len = int(seqlens_cpu[i])

Comment thread verl/workers/utils/padding.py Outdated
Comment on lines +189 to +193
if r3_speedup_enabled() and "routed_experts_per_traj" in data.keys():
per_traj_stack = data.pop("routed_experts_per_traj")
per_traj_list = normalize_r3_per_traj_list(per_traj_stack)
tensors = _align_r3_per_traj_tensors(per_traj_list, attention_mask.sum(dim=-1), input_ids_rmpad.device)
flat = narrow_routed_experts(torch.cat(tensors, dim=0))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since per_traj_list is already normalized and narrowed on CPU via normalize_r3_per_traj_list, the concatenated tensor flat is already narrowed. Calling narrow_routed_experts on the GPU tensor flat is redundant and triggers a GPU synchronization. You can safely concatenate the tensors directly.

Suggested change
if r3_speedup_enabled() and "routed_experts_per_traj" in data.keys():
per_traj_stack = data.pop("routed_experts_per_traj")
per_traj_list = normalize_r3_per_traj_list(per_traj_stack)
tensors = _align_r3_per_traj_tensors(per_traj_list, attention_mask.sum(dim=-1), input_ids_rmpad.device)
flat = narrow_routed_experts(torch.cat(tensors, dim=0))
if r3_speedup_enabled() and "routed_experts_per_traj" in data.keys():
per_traj_stack = data.pop("routed_experts_per_traj")
per_traj_list = normalize_r3_per_traj_list(per_traj_stack)
tensors = _align_r3_per_traj_tensors(per_traj_list, attention_mask.sum(dim=-1), input_ids_rmpad.device)
flat = torch.cat(tensors, dim=0)

@tntnnlrw

Copy link
Copy Markdown
Author

Pushed follow-up commit 9363794e to address the GPU sync concerns from the automated review:

  • skip min/max narrowing when routed_experts is already on CUDA
  • read R3 per-trajectory sequence lengths via one tolist() call before the loop
  • avoid re-narrowing the concatenated per-trajectory tensor in the fast path

Verified locally with:

  • PYTHONPATH=. pytest tests/utils/test_padding_on_cpu.py -q
  • PYTHONPATH=. pytest tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py -q
  • python -m ruff check verl/experimental/agent_loop/agent_loop.py verl/workers/utils/padding.py tests/utils/test_padding_on_cpu.py tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py
  • python -m py_compile ... && git diff --check

@tntnnlrw

Copy link
Copy Markdown
Author

Hi @wuxibin89, could you take a look at this PR when you have a chance? I kept the R3 per-trajectory trace path opt-in, added CPU-only tests, and pushed 9363794e to address the automated review feedback about avoiding unnecessary GPU syncs.

@tntnnlrw

Copy link
Copy Markdown
Author

Hi @HollowMan6, sorry to bother. Since you recently reviewed the Megatron router replay fallback fix in #6653, could you take a look at this related opt-in R3 per-trajectory trace path when you have bandwidth? The default path is unchanged and the PR includes CPU-only coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant