Fix gpt-5/reasoning-model accuracy: model-aware max_tokens default by liana313 · Pull Request #260 · lotus-data/lotus

liana313 · 2026-06-11T22:54:34Z

Fixes #255.

Root cause

Reasoning models (gpt-5, o-series) spend hidden reasoning tokens from the same max_completion_tokens budget as the visible answer. Lotus's flat max_tokens=512 default starves them:

On non-trivial rows, gpt-5 burns the entire 512-token budget on internal reasoning (finish_reason='length')
The completion comes back with content='' (or OpenAI 400s with "Could not finish the message because max_tokens or model output limit was reached")
filter_postprocess finds neither output token in the empty string and silently coerces the row to default=True — bad accuracy, zero errors

Empirical confirmation (debug branch `debug-gpt5-repro`, runs in Actions)

Scenario	reasoning tokens	result
trivial sentiment claim @512	64	`Answer: True` ✓ (8/8 — easy tasks unaffected)
hard arithmetic claim @512	448–512	one row empty → defaulted → 5/6
same, @8000 budget	~512	6/6
hard ZS_COT @512	—	truncated/empty rows → 5/6
forced 64-token budget	64	`finish_reason=length`, `content=''`, sometimes OpenAI 400
gpt-4o-mini @512 (control)	0	6/6

On real workloads (longer docs, harder predicates) most rows blow the 512 budget, which is why gpt-5 looked uniformly broken.

Fix

LM default max_tokens is now model-aware: 8192 for reasoning models (detected via litellm.supports_reasoning), 512 otherwise. An explicit max_tokens always wins. max_completion_tokens is a cap, not a purchase — this only affects worst-case per-row cost on reasoning models.
_get_top_choice now warns on finish_reason='length' with a reasoning-model-specific hint, so this failure mode is never silent again.
Users can still pass reasoning_effort="minimal" (forwarded via kwargs) to trade reasoning depth for cost — verified working in the probes.

Tests

New offline tests/test_lm_defaults.py (7 tests, no API calls) pinning the model-aware defaults; wired into the settings CI suite.
End-to-end verification run on the debug branch: gpt-5 with pure library defaults on the previously-failing hard set.

🤖 Generated with Claude Code

Reasoning models (gpt-5, o-series) spend hidden reasoning tokens from the same max_completion_tokens budget as the visible answer. With the flat 512-token default, gpt-5 exhausts the budget on hard rows before emitting any visible text; the empty completion is then silently coerced to the operator default (e.g. sem_filter default=True), tanking accuracy with no error (issue #255). Verified empirically: hard arithmetic claims cost gpt-5 448-512 hidden reasoning tokens; truncated rows return content='' with finish_reason='length' (or a 400 'Could not finish the message'). - Default max_tokens is now model-aware: 8192 for reasoning models (litellm.supports_reasoning), 512 otherwise; explicit max_tokens wins. - Warn when a completion is truncated by max_tokens, with a reasoning-model-specific hint. - Offline unit tests in tests/test_lm_defaults.py, wired into the settings CI suite. Fixes #255 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Include a copy-pasteable lotus.settings.configure(lm=LM(model=..., max_tokens=...)) snippet (and reasoning_effort hint for reasoning models) in the truncation warning, plus offline tests asserting the message content. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates `uv.lock` to match (so the locked-constraints CI step stays green). Notable changes shipping in this release since 1.2.1: - **#260** — gpt-5 / reasoning-model accuracy fix: model-aware `max_tokens` default + truncation warning (closes #255) - **#262** — fix flaky `test_pairwise_judge` - **#261** — consolidate duplicate benchmark directories - **#219** — Biodex + reranking benchmark suites (resolves #227) On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to build and publish to PyPI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

liana313 and others added 2 commits June 11, 2026 15:52

liana313 merged commit bdaa5dc into main Jun 12, 2026
9 checks passed

liana313 mentioned this pull request Jun 13, 2026

Bump version to 1.2.2 #263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix gpt-5/reasoning-model accuracy: model-aware max_tokens default#260

Fix gpt-5/reasoning-model accuracy: model-aware max_tokens default#260
liana313 merged 2 commits into
mainfrom
fix-gpt5-reasoning-budget

liana313 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liana313 commented Jun 11, 2026

Root cause

Empirical confirmation (debug branch debug-gpt5-repro, runs in Actions)

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Empirical confirmation (debug branch `debug-gpt5-repro`, runs in Actions)