Skip to content

fix: handle vllm context length errors in nano v3 recipe#1752

Open
snowmanwwg wants to merge 1 commit into
mainfrom
fix-vllm-context-length-errors
Open

fix: handle vllm context length errors in nano v3 recipe#1752
snowmanwwg wants to merge 1 commit into
mainfrom
fix-vllm-context-length-errors

Conversation

@snowmanwwg

Copy link
Copy Markdown
Collaborator

Summary

  • Detect vLLM context-length failures from chat completions and tokenization responses.
  • Return token/context length metadata in the error body when vLLM reports the limit.
  • Add focused tests for chat completions and tokenization context-length handling.

Testing

  • uv run --extra dev python -m pytest responses_api_models/vllm_model/tests/test_app.py (72 passed, 1 warning)

Cherry-picked from 7ad66da onto current main.

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@snowmanwwg snowmanwwg requested a review from yfw June 26, 2026 05:53
@snowmanwwg snowmanwwg changed the title fix: handle vllm context length errors fix: handle vllm context length errors in nano v3 recipe Jun 26, 2026
"max_model_len",
"max model len",
"max_tokens",
"maximum context length",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE"max_tokens" is the broadest substring here. A vLLM 400 whose body mentions max_tokens for a non-context-length reason (e.g., invalid value, type mismatch) would be silently swallowed and surfaced as finish_reason="length" instead of raising. This was already true in the old code so it's not a regression, but worth keeping in mind — if mysterious "length" finishes appear in logs, this matcher could be the culprit.

Not blocking; the substring approach is the pragmatic choice given vLLM's unstructured error surface.

Comment on lines +450 to +459
response_content = getattr(error, "response_content", b"")
if isinstance(response_content, bytes):
response_content_text = response_content.decode(errors="replace")
elif response_content is None:
response_content_text = ""
else:
response_content_text = str(response_content)

error_text = f"{error.message} {response_content_text}".lower()
return any(substring in error_text for substring in CONTEXT_LENGTH_ERROR_SUBSTRINGS)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good defensive coding. The old code did e.response_content.decode() (no errors="replace", no guard for missing attribute). This handles bytes/None/str and uses errors="replace" per CLAUDE.md conventions. Nice fix.

Comment on lines +553 to +560
try:
tokenize_response = await client.create_tokenize(**tokenize_body_dict)
except ClientResponseError as e:
if self._is_context_length_error(e):
res = self._create_empty_chat_completion()
res.choices[0].finish_reason = "length"
return res
raise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE — When the tokenize call fails here, a successful chat completion (line 477) is silently discarded. This is the correct behavior: without prompt_token_ids the training pipeline can't build a loss mask, so "length" correctly tells the caller to skip this example. But it's worth a comment since a future reader will wonder why a successful generation is thrown away.

For pure evaluation (return_token_id_information=False), this path is never reached, so no eval score impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant