fix(sglang): keep multi-turn prompts prefix-stable via token-splicing#1787
Draft
Kh4L wants to merge 1 commit into
Draft
fix(sglang): keep multi-turn prompts prefix-stable via token-splicing#1787Kh4L wants to merge 1 commit into
Kh4L wants to merge 1 commit into
Conversation
Signed-off-by: Serge Panev <spanev@nvidia.com>
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
With the SGLang generation backend, multi-turn SWE-bench (agentic) rollouts hit a fatal prefix-stability assert in
nemo_gym.pyon ~every tool-using turn (48/48 turns failed in our runs). The proxy must guarantee that each turn's freshly-built prompt has the prior accumulated tokens as an exact prefix (seen == prompt[:len(seen)]). Two root causes broke this:</think>.A second issue: on the pinned SGLang v0.5.10,
/v1/chat/completionsdoes not expose the exact sampled integer token ids (itslogprobs.tokenis a decoded string; there is noreturn_tokens_as_token_ids), and/tokenizeonly accepts a raw prompt string — so the proxy cannot get the token ids token-level RL needs from the chat endpoint.Fix
Adds an opt-in SGLang engine path to the vLLM responses-API proxy (
VLLMModelConfig.engine = "sglang", default"vllm"— the vLLM path is unchanged):prompt_{K-1} + gen_{K-1}(verbatim) + delta_K, splicing the prior assistant turn's exact sampledgeneration_token_idsinstead of re-tokenizing them (_build_sglang_prompt_ids,_update_sglang_session_seq,_sglang_followup_fragment_ids, keyed by session id). Prefix-stable by construction; cache-miss falls back to a full chat-template tokenize./generate. Generate through SGLang's native/generate(return_logprob=True) and read ids+logprobs from the samemeta_info.output_token_logprobslist (1:1 aligned). NewNeMoGymAsyncOpenAI.create_generate.</think>preservation. Decode the sampled ids withskip_special_tokens=Falseso</think>(id 151668) survives, then re-parse into reasoning + hermestool_calls(_parse_sglang_generation) so the returned object is shaped exactly like the vLLM/v1/chat/completionsresponse — every downstream Responses-API conversion is identical.Local tokenization uses
transformers(added to the proxy deps, pinned to the validated version).Result
In our SWE-bench runs: multi-turn contiguity failures 48 → 0 (8/8 rollouts complete), throughput ≈ the vLLM path, and the engine emits training-grade per-token logprobs (validated to within the model's own bf16/MoE numerical noise vs vLLM).
Status — draft, untested against current
main