Skip to content

[sglang] fix seed collisions in deterministic GRPO rollouts#6857

Open
tntnnlrw wants to merge 1 commit into
verl-project:mainfrom
tntnnlrw:codex/sglang-sampling-seed
Open

[sglang] fix seed collisions in deterministic GRPO rollouts#6857
tntnnlrw wants to merge 1 commit into
verl-project:mainfrom
tntnnlrw:codex/sglang-sampling-seed

Conversation

@tntnnlrw

@tntnnlrw tntnnlrw commented Jun 26, 2026

Copy link
Copy Markdown

What this PR does

This PR fixes SGLang agent rollouts that need request-local deterministic sampling seeds.

  • derives a stable sampling_seed from (global_step, sample_index, rollout_n, base_seed) when SGLang deterministic inference is enabled
  • keeps different responses in the same GRPO/DAPO rollout group on distinct deterministic seeds, so deterministic mode remains reproducible without collapsing rollout diversity
  • computes trajectory_info on the full batch before splitting across agent-loop workers, so rollout_n stays globally unique across chunks
  • copies and normalizes SGLang engine_kwargs before mutating them for server launch
  • adds a CPU-only regression test for seed stability, chunk splitting, and SGLang kwargs normalization

Why it is needed

For GRPO/DAPO-style training, deterministic rollout does not mean every response for the same prompt should be identical. It means rerunning the same training step should reproduce the same set of rollout samples, while each response in the rollout group still receives a different, stable sampling seed.

When trajectory_info is computed independently inside each worker chunk, repeated prompts can get duplicate rollout_n values across chunks. If SGLang uses request-local sampling_seed, those duplicate (sample_index, rollout_n, step) tuples can collide and produce duplicate samples within a rollout group. That breaks the intended deterministic SGLang training behavior: reproducible across reruns, but still diverse across rollout responses.

This PR makes SGLang deterministic inference usable for real GRPO/DAPO deterministic training by assigning stable, globally unique per-response seeds before chunking work across agent-loop workers.

Tests

  • PYTHONPATH=. pytest tests/experimental/agent_loop/test_sglang_sampling_seed_on_cpu.py -q
  • python -m ruff check verl/experimental/agent_loop/agent_loop.py verl/workers/rollout/sglang_rollout/async_sglang_server.py tests/experimental/agent_loop/test_sglang_sampling_seed_on_cpu.py
  • python -m py_compile verl/experimental/agent_loop/agent_loop.py verl/workers/rollout/sglang_rollout/async_sglang_server.py tests/experimental/agent_loop/test_sglang_sampling_seed_on_cpu.py
  • git diff --check

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces stable, deterministic sampling seed generation for SGLang rollouts within the agent loop. It adds helper functions to compute stable seeds based on step, sample index, rollout number, and base seed, and ensures these seeds are correctly distributed across parallel worker chunks to avoid duplicates. Additionally, it normalizes SGLang engine configuration arguments and enforces the PyTorch sampling backend when deterministic sampling is enabled. Unit tests are added to verify these behaviors. I have no further feedback to provide as there are no review comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@tntnnlrw tntnnlrw changed the title [sglang] fix stable sampling seeds for agent rollouts [sglang] fix seed collisions in deterministic GRPO rollouts Jun 26, 2026
@tntnnlrw

Copy link
Copy Markdown
Author

Hi @wuxibin89, could you take a look at this PR when you have a chance? This fixes seed collisions for SGLang deterministic GRPO rollouts, so reruns stay reproducible while different responses in the same rollout group keep distinct deterministic seeds.

@tntnnlrw

Copy link
Copy Markdown
Author

Hi @wucong25, sorry to bother. If this is within your review scope, could you take a look when you have bandwidth? This is a small SGLang deterministic sampling fix with CPU-only regression coverage, preventing deterministic GRPO/DAPO rollout groups from collapsing to identical responses due to seed collisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant