fix(sglang): keep multi-turn prompts prefix-stable via token-splicing by Kh4L · Pull Request #1787 · NVIDIA-NeMo/Gym

Kh4L · 2026-06-26T21:36:42Z

Problem

With the SGLang generation backend, multi-turn SWE-bench (agentic) rollouts hit a fatal prefix-stability assert in nemo_gym.py on ~every tool-using turn (48/48 turns failed in our runs). The proxy must guarantee that each turn's freshly-built prompt has the prior accumulated tokens as an exact prefix (seen == prompt[:len(seen)]). Two root causes broke this:

Proxy parse drift — re-rendering a prior assistant turn from parsed text dropped multi-line tool-call JSON and mangled </think>.
Retokenization — byte-identical text re-tokenizes to a different BPE split, so prior tokens were no longer a prefix.

A second issue: on the pinned SGLang v0.5.10, /v1/chat/completions does not expose the exact sampled integer token ids (its logprobs.token is a decoded string; there is no return_tokens_as_token_ids), and /tokenize only accepts a raw prompt string — so the proxy cannot get the token ids token-level RL needs from the chat endpoint.

Fix

Adds an opt-in SGLang engine path to the vLLM responses-API proxy (VLLMModelConfig.engine = "sglang", default "vllm" — the vLLM path is unchanged):

Token-splice contiguity fix. Build each turn's prompt as prompt_{K-1} + gen_{K-1}(verbatim) + delta_K, splicing the prior assistant turn's exact sampled generation_token_ids instead of re-tokenizing them (_build_sglang_prompt_ids, _update_sglang_session_seq, _sglang_followup_fragment_ids, keyed by session id). Prefix-stable by construction; cache-miss falls back to a full chat-template tokenize.
Exact token ids via /generate. Generate through SGLang's native /generate (return_logprob=True) and read ids+logprobs from the same meta_info.output_token_logprobs list (1:1 aligned). New NeMoGymAsyncOpenAI.create_generate.
</think> preservation. Decode the sampled ids with skip_special_tokens=False so </think> (id 151668) survives, then re-parse into reasoning + hermes tool_calls (_parse_sglang_generation) so the returned object is shaped exactly like the vLLM /v1/chat/completions response — every downstream Responses-API conversion is identical.

Local tokenization uses transformers (added to the proxy deps, pinned to the validated version).

Result

In our SWE-bench runs: multi-turn contiguity failures 48 → 0 (8/8 rollouts complete), throughput ≈ the vLLM path, and the engine emits training-grade per-token logprobs (validated to within the model's own bf16/MoE numerical noise vs vLLM).

Status — draft, untested against current `main`

⚠️ This is a port of a fix developed against an older Gym revision onto the refactored main proxy. The logic was validated end-to-end on the older base, but this rebased version has not been run on a GPU. Opening as a draft for review; needs a functional run before merge. The companion NeMo-RL recipe is at NVIDIA-NeMo/RL#2961 (and depends on the enhanced SGLang backend in NVIDIA-NeMo/RL#2447). Research-only logprob-parity instrumentation from the source branch is intentionally excluded.

Signed-off-by: Serge Panev <spanev@nvidia.com>

copy-pr-bot · 2026-06-26T21:36:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cwing-nvidia · 2026-06-26T22:46:00Z

@Kh4L can you also take a look at #1557

fix(sglang): keep multi-turn prompts prefix-stable via token-splicing

1f913b7

Signed-off-by: Serge Panev <spanev@nvidia.com>

Kh4L mentioned this pull request Jun 26, 2026

feat(swe): add Qwen3-30B SWE-bench async-GRPO recipe (vLLM + SGLang) NVIDIA-NeMo/RL#2961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sglang): keep multi-turn prompts prefix-stable via token-splicing#1787

fix(sglang): keep multi-turn prompts prefix-stable via token-splicing#1787
Kh4L wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Kh4L:sglang-splice-fix

Kh4L commented Jun 26, 2026

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

cwing-nvidia commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Kh4L commented Jun 26, 2026

Problem

Fix

Result

Status — draft, untested against current main

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

cwing-nvidia commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Status — draft, untested against current `main`