fix: handle vllm context length errors in nano v3 recipe#1752
fix: handle vllm context length errors in nano v3 recipe#1752snowmanwwg wants to merge 1 commit into
Conversation
Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
| "max_model_len", | ||
| "max model len", | ||
| "max_tokens", | ||
| "maximum context length", |
There was a problem hiding this comment.
NOTE — "max_tokens" is the broadest substring here. A vLLM 400 whose body mentions max_tokens for a non-context-length reason (e.g., invalid value, type mismatch) would be silently swallowed and surfaced as finish_reason="length" instead of raising. This was already true in the old code so it's not a regression, but worth keeping in mind — if mysterious "length" finishes appear in logs, this matcher could be the culprit.
Not blocking; the substring approach is the pragmatic choice given vLLM's unstructured error surface.
| response_content = getattr(error, "response_content", b"") | ||
| if isinstance(response_content, bytes): | ||
| response_content_text = response_content.decode(errors="replace") | ||
| elif response_content is None: | ||
| response_content_text = "" | ||
| else: | ||
| response_content_text = str(response_content) | ||
|
|
||
| error_text = f"{error.message} {response_content_text}".lower() | ||
| return any(substring in error_text for substring in CONTEXT_LENGTH_ERROR_SUBSTRINGS) |
There was a problem hiding this comment.
Good defensive coding. The old code did e.response_content.decode() (no errors="replace", no guard for missing attribute). This handles bytes/None/str and uses errors="replace" per CLAUDE.md conventions. Nice fix.
| try: | ||
| tokenize_response = await client.create_tokenize(**tokenize_body_dict) | ||
| except ClientResponseError as e: | ||
| if self._is_context_length_error(e): | ||
| res = self._create_empty_chat_completion() | ||
| res.choices[0].finish_reason = "length" | ||
| return res | ||
| raise |
There was a problem hiding this comment.
NOTE — When the tokenize call fails here, a successful chat completion (line 477) is silently discarded. This is the correct behavior: without prompt_token_ids the training pipeline can't build a loss mask, so "length" correctly tells the caller to skip this example. But it's worth a comment since a future reader will wonder why a successful generation is thrown away.
For pure evaluation (return_token_id_information=False), this path is never reached, so no eval score impact.
Summary
Testing
uv run --extra dev python -m pytest responses_api_models/vllm_model/tests/test_app.py(72 passed, 1 warning)Cherry-picked from 7ad66da onto current
main.