[sglang] fix seed collisions in deterministic GRPO rollouts#6857
[sglang] fix seed collisions in deterministic GRPO rollouts#6857tntnnlrw wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces stable, deterministic sampling seed generation for SGLang rollouts within the agent loop. It adds helper functions to compute stable seeds based on step, sample index, rollout number, and base seed, and ensures these seeds are correctly distributed across parallel worker chunks to avoid duplicates. Additionally, it normalizes SGLang engine configuration arguments and enforces the PyTorch sampling backend when deterministic sampling is enabled. Unit tests are added to verify these behaviors. I have no further feedback to provide as there are no review comments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Hi @wuxibin89, could you take a look at this PR when you have a chance? This fixes seed collisions for SGLang deterministic GRPO rollouts, so reruns stay reproducible while different responses in the same rollout group keep distinct deterministic seeds. |
|
Hi @wucong25, sorry to bother. If this is within your review scope, could you take a look when you have bandwidth? This is a small SGLang deterministic sampling fix with CPU-only regression coverage, preventing deterministic GRPO/DAPO rollout groups from collapsing to identical responses due to seed collisions. |
What this PR does
This PR fixes SGLang agent rollouts that need request-local deterministic sampling seeds.
sampling_seedfrom(global_step, sample_index, rollout_n, base_seed)when SGLang deterministic inference is enabledtrajectory_infoon the full batch before splitting across agent-loop workers, sorollout_nstays globally unique across chunksengine_kwargsbefore mutating them for server launchWhy it is needed
For GRPO/DAPO-style training, deterministic rollout does not mean every response for the same prompt should be identical. It means rerunning the same training step should reproduce the same set of rollout samples, while each response in the rollout group still receives a different, stable sampling seed.
When
trajectory_infois computed independently inside each worker chunk, repeated prompts can get duplicaterollout_nvalues across chunks. If SGLang uses request-localsampling_seed, those duplicate(sample_index, rollout_n, step)tuples can collide and produce duplicate samples within a rollout group. That breaks the intended deterministic SGLang training behavior: reproducible across reruns, but still diverse across rollout responses.This PR makes SGLang deterministic inference usable for real GRPO/DAPO deterministic training by assigning stable, globally unique per-response seeds before chunking work across agent-loop workers.
Tests
PYTHONPATH=. pytest tests/experimental/agent_loop/test_sglang_sampling_seed_on_cpu.py -qpython -m ruff check verl/experimental/agent_loop/agent_loop.py verl/workers/rollout/sglang_rollout/async_sglang_server.py tests/experimental/agent_loop/test_sglang_sampling_seed_on_cpu.pypython -m py_compile verl/experimental/agent_loop/agent_loop.py verl/workers/rollout/sglang_rollout/async_sglang_server.py tests/experimental/agent_loop/test_sglang_sampling_seed_on_cpu.pygit diff --check