[megatron] reduce R3 router replay memory with per-trajectory traces by tntnnlrw · Pull Request #6860 · verl-project/verl

tntnnlrw · 2026-06-26T09:27:42Z

Summary

Add an opt-in VERL_R3_SPEEDUP=1 path that keeps R3 rollout routing traces as per-trajectory [actual_len, num_layers, topk] arrays instead of materializing dense [batch, seq_len, num_layers, topk] tensors in the agent loop.
Convert those per-trajectory traces to nested tensors during no-padding conversion, so the existing Megatron R3 replay path can consume them without engine changes.
Preserve the default R3 behavior when VERL_R3_SPEEDUP is unset or disabled.

Motivation

For long-context R3 rollouts, the dense padded routed-experts tensor can add avoidable memory pressure and copies before the current nested/no-padding Megatron path. This PR keeps the routing traces in their natural per-trajectory form until the no-padding conversion step.

Tests

PYTHONPATH=. pytest tests/utils/test_padding_on_cpu.py -q
PYTHONPATH=. pytest tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py -q
python -m ruff check verl/experimental/agent_loop/agent_loop.py verl/workers/utils/padding.py tests/utils/test_padding_on_cpu.py tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py
python -m py_compile verl/experimental/agent_loop/agent_loop.py verl/workers/utils/padding.py tests/utils/test_padding_on_cpu.py tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py
git diff --check

Checklist

CPU-only tests added
Default R3 path remains unchanged
No internal scripts or end-to-end debug files included

gemini-code-assist

Code Review

This pull request introduces a per-trajectory routing trace fast path ('R3 speedup') to optimize memory and performance in the agent loop and padding utilities, including functions to narrow routed expert dtypes and handle unpadded per-trajectory data. The review feedback highlights critical performance bottlenecks caused by synchronous GPU-to-CPU transfers (device-to-host copies). Specifically, it recommends avoiding .item() calls on GPU tensors by checking .is_cuda in narrow_routed_experts, converting seqlens to a CPU list before looping in _align_r3_per_traj_tensors, and removing a redundant narrowing operation on the concatenated GPU tensor flat.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T09:29:32Z

+def narrow_routed_experts(routed_experts):
+    """Convert routed expert ids to the smallest integer dtype that preserves their values."""
+    if routed_experts is None:
+        return None
+
+    if isinstance(routed_experts, torch.Tensor):
+        if routed_experts.numel() == 0:
+            return routed_experts.to(torch.uint8)


If routed_experts is a PyTorch tensor on GPU, calling .min().item() and .max().item() will trigger synchronous device-to-host copies, blocking the CPU thread. Add a check to return the tensor immediately if it is already on a CUDA device, as it should have already been narrowed on CPU before being moved to GPU.

Suggested change

def narrow_routed_experts(routed_experts):

"""Convert routed expert ids to the smallest integer dtype that preserves their values."""

if routed_experts is None:

return None

if isinstance(routed_experts, torch.Tensor):

if routed_experts.numel() == 0:

return routed_experts.to(torch.uint8)

def narrow_routed_experts(routed_experts):

"""Convert routed expert ids to the smallest integer dtype that preserves their values."""

if routed_experts is None:

return None

if isinstance(routed_experts, torch.Tensor):

if routed_experts.is_cuda:

return routed_experts

if routed_experts.numel() == 0:

return routed_experts.to(torch.uint8)

gemini-code-assist · 2026-06-26T09:29:32Z

+def _align_r3_per_traj_tensors(per_traj_list, seqlens, target_device):
+    aligned = []
+    for i, item in enumerate(per_traj_list):
+        tensor = _r3_per_traj_item_to_tensor(item).to(device=target_device)
+        target_len = int(seqlens[i].item())


Calling seqlens[i].item() inside a loop over the batch size triggers a synchronous device-to-host copy for every single iteration when seqlens is on GPU. This introduces a severe performance bottleneck. Convert seqlens to a CPU list once before the loop using seqlens.tolist() to avoid per-iteration GPU synchronization.

def _align_r3_per_traj_tensors(per_traj_list, seqlens, target_device): aligned = [] seqlens_cpu = seqlens.tolist() for i, item in enumerate(per_traj_list): tensor = _r3_per_traj_item_to_tensor(item).to(device=target_device) target_len = int(seqlens_cpu[i])

gemini-code-assist · 2026-06-26T09:29:32Z

+    if r3_speedup_enabled() and "routed_experts_per_traj" in data.keys():
+        per_traj_stack = data.pop("routed_experts_per_traj")
+        per_traj_list = normalize_r3_per_traj_list(per_traj_stack)
+        tensors = _align_r3_per_traj_tensors(per_traj_list, attention_mask.sum(dim=-1), input_ids_rmpad.device)
+        flat = narrow_routed_experts(torch.cat(tensors, dim=0))


Since per_traj_list is already normalized and narrowed on CPU via normalize_r3_per_traj_list, the concatenated tensor flat is already narrowed. Calling narrow_routed_experts on the GPU tensor flat is redundant and triggers a GPU synchronization. You can safely concatenate the tensors directly.

Suggested change

if r3_speedup_enabled() and "routed_experts_per_traj" in data.keys():

per_traj_stack = data.pop("routed_experts_per_traj")

per_traj_list = normalize_r3_per_traj_list(per_traj_stack)

tensors = _align_r3_per_traj_tensors(per_traj_list, attention_mask.sum(dim=-1), input_ids_rmpad.device)

flat = narrow_routed_experts(torch.cat(tensors, dim=0))

if r3_speedup_enabled() and "routed_experts_per_traj" in data.keys():

per_traj_stack = data.pop("routed_experts_per_traj")

per_traj_list = normalize_r3_per_traj_list(per_traj_stack)

tensors = _align_r3_per_traj_tensors(per_traj_list, attention_mask.sum(dim=-1), input_ids_rmpad.device)

flat = torch.cat(tensors, dim=0)

tntnnlrw · 2026-06-26T09:49:58Z

Pushed follow-up commit 9363794e to address the GPU sync concerns from the automated review:

skip min/max narrowing when routed_experts is already on CUDA
read R3 per-trajectory sequence lengths via one tolist() call before the loop
avoid re-narrowing the concatenated per-trajectory tensor in the fast path

Verified locally with:

PYTHONPATH=. pytest tests/utils/test_padding_on_cpu.py -q
PYTHONPATH=. pytest tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py -q
python -m ruff check verl/experimental/agent_loop/agent_loop.py verl/workers/utils/padding.py tests/utils/test_padding_on_cpu.py tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py
python -m py_compile ... && git diff --check

tntnnlrw · 2026-06-26T09:53:03Z

Hi @wuxibin89, could you take a look at this PR when you have a chance? I kept the R3 per-trajectory trace path opt-in, added CPU-only tests, and pushed 9363794e to address the automated review feedback about avoiding unnecessary GPU syncs.

tntnnlrw · 2026-06-26T09:59:43Z

Hi @HollowMan6, sorry to bother. Since you recently reviewed the Megatron router replay fallback fix in #6653, could you take a look at this related opt-in R3 per-trajectory trace path when you have bandwidth? The default path is unchanged and the PR includes CPU-only coverage.

feat: add R3 per-trajectory routing trace path

ebc14e7

tntnnlrw requested review from ArronHZG and wuxibin89 as code owners June 26, 2026 09:27

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

fix: avoid R3 per-traj padding GPU syncs

9363794

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] reduce R3 router replay memory with per-trajectory traces#6860

[megatron] reduce R3 router replay memory with per-trajectory traces#6860
tntnnlrw wants to merge 2 commits into
verl-project:mainfrom
tntnnlrw:codex/r3-router-replay-per-traj

tntnnlrw commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

tntnnlrw commented Jun 26, 2026

Uh oh!

tntnnlrw commented Jun 26, 2026

Uh oh!

tntnnlrw commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tntnnlrw commented Jun 26, 2026

Summary

Motivation

Tests

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

tntnnlrw commented Jun 26, 2026

Uh oh!

tntnnlrw commented Jun 26, 2026

Uh oh!

tntnnlrw commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant