Skip to content

fix(sglang): keep multi-turn prompts prefix-stable via token-splicing#1787

Draft
Kh4L wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Kh4L:sglang-splice-fix
Draft

fix(sglang): keep multi-turn prompts prefix-stable via token-splicing#1787
Kh4L wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Kh4L:sglang-splice-fix

Conversation

@Kh4L

@Kh4L Kh4L commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Problem

With the SGLang generation backend, multi-turn SWE-bench (agentic) rollouts hit a fatal prefix-stability assert in nemo_gym.py on ~every tool-using turn (48/48 turns failed in our runs). The proxy must guarantee that each turn's freshly-built prompt has the prior accumulated tokens as an exact prefix (seen == prompt[:len(seen)]). Two root causes broke this:

  1. Proxy parse drift — re-rendering a prior assistant turn from parsed text dropped multi-line tool-call JSON and mangled </think>.
  2. Retokenization — byte-identical text re-tokenizes to a different BPE split, so prior tokens were no longer a prefix.

A second issue: on the pinned SGLang v0.5.10, /v1/chat/completions does not expose the exact sampled integer token ids (its logprobs.token is a decoded string; there is no return_tokens_as_token_ids), and /tokenize only accepts a raw prompt string — so the proxy cannot get the token ids token-level RL needs from the chat endpoint.

Fix

Adds an opt-in SGLang engine path to the vLLM responses-API proxy (VLLMModelConfig.engine = "sglang", default "vllm" — the vLLM path is unchanged):

  • Token-splice contiguity fix. Build each turn's prompt as prompt_{K-1} + gen_{K-1}(verbatim) + delta_K, splicing the prior assistant turn's exact sampled generation_token_ids instead of re-tokenizing them (_build_sglang_prompt_ids, _update_sglang_session_seq, _sglang_followup_fragment_ids, keyed by session id). Prefix-stable by construction; cache-miss falls back to a full chat-template tokenize.
  • Exact token ids via /generate. Generate through SGLang's native /generate (return_logprob=True) and read ids+logprobs from the same meta_info.output_token_logprobs list (1:1 aligned). New NeMoGymAsyncOpenAI.create_generate.
  • </think> preservation. Decode the sampled ids with skip_special_tokens=False so </think> (id 151668) survives, then re-parse into reasoning + hermes tool_calls (_parse_sglang_generation) so the returned object is shaped exactly like the vLLM /v1/chat/completions response — every downstream Responses-API conversion is identical.

Local tokenization uses transformers (added to the proxy deps, pinned to the validated version).

Result

In our SWE-bench runs: multi-turn contiguity failures 48 → 0 (8/8 rollouts complete), throughput ≈ the vLLM path, and the engine emits training-grade per-token logprobs (validated to within the model's own bf16/MoE numerical noise vs vLLM).

Status — draft, untested against current main

⚠️ This is a port of a fix developed against an older Gym revision onto the refactored main proxy. The logic was validated end-to-end on the older base, but this rebased version has not been run on a GPU. Opening as a draft for review; needs a functional run before merge. The companion NeMo-RL recipe is at NVIDIA-NeMo/RL#2961 (and depends on the enhanced SGLang backend in NVIDIA-NeMo/RL#2447). Research-only logprob-parity instrumentation from the source branch is intentionally excluded.

Signed-off-by: Serge Panev <spanev@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cwing-nvidia

Copy link
Copy Markdown
Contributor

@Kh4L can you also take a look at #1557

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants