Fix gpt-5/reasoning-model accuracy: model-aware max_tokens default#260
Merged
Conversation
Reasoning models (gpt-5, o-series) spend hidden reasoning tokens from the same max_completion_tokens budget as the visible answer. With the flat 512-token default, gpt-5 exhausts the budget on hard rows before emitting any visible text; the empty completion is then silently coerced to the operator default (e.g. sem_filter default=True), tanking accuracy with no error (issue #255). Verified empirically: hard arithmetic claims cost gpt-5 448-512 hidden reasoning tokens; truncated rows return content='' with finish_reason='length' (or a 400 'Could not finish the message'). - Default max_tokens is now model-aware: 8192 for reasoning models (litellm.supports_reasoning), 512 otherwise; explicit max_tokens wins. - Warn when a completion is truncated by max_tokens, with a reasoning-model-specific hint. - Offline unit tests in tests/test_lm_defaults.py, wired into the settings CI suite. Fixes #255 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Include a copy-pasteable lotus.settings.configure(lm=LM(model=..., max_tokens=...)) snippet (and reasoning_effort hint for reasoning models) in the truncation warning, plus offline tests asserting the message content. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merged
liana313
added a commit
that referenced
this pull request
Jun 13, 2026
Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates `uv.lock` to match (so the locked-constraints CI step stays green). Notable changes shipping in this release since 1.2.1: - **#260** — gpt-5 / reasoning-model accuracy fix: model-aware `max_tokens` default + truncation warning (closes #255) - **#262** — fix flaky `test_pairwise_judge` - **#261** — consolidate duplicate benchmark directories - **#219** — Biodex + reranking benchmark suites (resolves #227) On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to build and publish to PyPI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #255.
Root cause
Reasoning models (gpt-5, o-series) spend hidden reasoning tokens from the same
max_completion_tokensbudget as the visible answer. Lotus's flatmax_tokens=512default starves them:finish_reason='length')content=''(or OpenAI 400s with "Could not finish the message because max_tokens or model output limit was reached")filter_postprocessfinds neither output token in the empty string and silently coerces the row todefault=True— bad accuracy, zero errorsEmpirical confirmation (debug branch
debug-gpt5-repro, runs in Actions)Answer: True✓ (8/8 — easy tasks unaffected)finish_reason=length,content='', sometimes OpenAI 400On real workloads (longer docs, harder predicates) most rows blow the 512 budget, which is why gpt-5 looked uniformly broken.
Fix
LMdefaultmax_tokensis now model-aware: 8192 for reasoning models (detected vialitellm.supports_reasoning), 512 otherwise. An explicitmax_tokensalways wins.max_completion_tokensis a cap, not a purchase — this only affects worst-case per-row cost on reasoning models._get_top_choicenow warns onfinish_reason='length'with a reasoning-model-specific hint, so this failure mode is never silent again.reasoning_effort="minimal"(forwarded via kwargs) to trade reasoning depth for cost — verified working in the probes.Tests
tests/test_lm_defaults.py(7 tests, no API calls) pinning the model-aware defaults; wired into thesettingsCI suite.🤖 Generated with Claude Code