Skip to content

Fix gpt-5/reasoning-model accuracy: model-aware max_tokens default#260

Merged
liana313 merged 2 commits into
mainfrom
fix-gpt5-reasoning-budget
Jun 12, 2026
Merged

Fix gpt-5/reasoning-model accuracy: model-aware max_tokens default#260
liana313 merged 2 commits into
mainfrom
fix-gpt5-reasoning-budget

Conversation

@liana313

Copy link
Copy Markdown
Collaborator

Fixes #255.

Root cause

Reasoning models (gpt-5, o-series) spend hidden reasoning tokens from the same max_completion_tokens budget as the visible answer. Lotus's flat max_tokens=512 default starves them:

  1. On non-trivial rows, gpt-5 burns the entire 512-token budget on internal reasoning (finish_reason='length')
  2. The completion comes back with content='' (or OpenAI 400s with "Could not finish the message because max_tokens or model output limit was reached")
  3. filter_postprocess finds neither output token in the empty string and silently coerces the row to default=True — bad accuracy, zero errors

Empirical confirmation (debug branch debug-gpt5-repro, runs in Actions)

Scenario reasoning tokens result
trivial sentiment claim @512 64 Answer: True ✓ (8/8 — easy tasks unaffected)
hard arithmetic claim @512 448–512 one row empty → defaulted → 5/6
same, @8000 budget ~512 6/6
hard ZS_COT @512 truncated/empty rows → 5/6
forced 64-token budget 64 finish_reason=length, content='', sometimes OpenAI 400
gpt-4o-mini @512 (control) 0 6/6

On real workloads (longer docs, harder predicates) most rows blow the 512 budget, which is why gpt-5 looked uniformly broken.

Fix

  • LM default max_tokens is now model-aware: 8192 for reasoning models (detected via litellm.supports_reasoning), 512 otherwise. An explicit max_tokens always wins. max_completion_tokens is a cap, not a purchase — this only affects worst-case per-row cost on reasoning models.
  • _get_top_choice now warns on finish_reason='length' with a reasoning-model-specific hint, so this failure mode is never silent again.
  • Users can still pass reasoning_effort="minimal" (forwarded via kwargs) to trade reasoning depth for cost — verified working in the probes.

Tests

  • New offline tests/test_lm_defaults.py (7 tests, no API calls) pinning the model-aware defaults; wired into the settings CI suite.
  • End-to-end verification run on the debug branch: gpt-5 with pure library defaults on the previously-failing hard set.

🤖 Generated with Claude Code

liana313 and others added 2 commits June 11, 2026 15:52
Reasoning models (gpt-5, o-series) spend hidden reasoning tokens from the
same max_completion_tokens budget as the visible answer. With the flat
512-token default, gpt-5 exhausts the budget on hard rows before emitting
any visible text; the empty completion is then silently coerced to the
operator default (e.g. sem_filter default=True), tanking accuracy with no
error (issue #255). Verified empirically: hard arithmetic claims cost
gpt-5 448-512 hidden reasoning tokens; truncated rows return content=''
with finish_reason='length' (or a 400 'Could not finish the message').

- Default max_tokens is now model-aware: 8192 for reasoning models
  (litellm.supports_reasoning), 512 otherwise; explicit max_tokens wins.
- Warn when a completion is truncated by max_tokens, with a
  reasoning-model-specific hint.
- Offline unit tests in tests/test_lm_defaults.py, wired into the
  settings CI suite.

Fixes #255

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Include a copy-pasteable lotus.settings.configure(lm=LM(model=..., max_tokens=...))
snippet (and reasoning_effort hint for reasoning models) in the truncation
warning, plus offline tests asserting the message content.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@liana313 liana313 merged commit bdaa5dc into main Jun 12, 2026
9 checks passed
@liana313 liana313 mentioned this pull request Jun 13, 2026
liana313 added a commit that referenced this pull request Jun 13, 2026
Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates
`uv.lock` to match (so the locked-constraints CI step stays green).

Notable changes shipping in this release since 1.2.1:
- **#260** — gpt-5 / reasoning-model accuracy fix: model-aware
`max_tokens` default + truncation warning (closes #255)
- **#262** — fix flaky `test_pairwise_judge`
- **#261** — consolidate duplicate benchmark directories
- **#219** — Biodex + reranking benchmark suites (resolves #227)

On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to
build and publish to PyPI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gpt-5 not working well

1 participant