Fix flaky test_pairwise_judge (ambiguous row removed)#262
Merged
Conversation
Row 1 compared 'Meeting request.' vs a longer subject line for a '1:1 meeting' prompt with no clear winner, so gpt-4o-mini's judge flip-flopped and the exact A/B assertion failed intermittently (3 spurious lm-openai failures observed in one day). Replace it with an unambiguous pair: a casual 'ayo lets meet' (A) vs a polite, professional subject line (B), so the expected verdict is stable like row 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merged
liana313
added a commit
that referenced
this pull request
Jun 13, 2026
Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates `uv.lock` to match (so the locked-constraints CI step stays green). Notable changes shipping in this release since 1.2.1: - **#260** — gpt-5 / reasoning-model accuracy fix: model-aware `max_tokens` default + truncation warning (closes #255) - **#262** — fix flaky `test_pairwise_judge` - **#261** — consolidate duplicate benchmark directories - **#219** — Biodex + reranking benchmark suites (resolves #227) On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to build and publish to PyPI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test_pairwise_judge[gpt-4o-mini]failed intermittently — 3 spuriouslm-openaifailures in a single day, each passing on rerun. Always the same assertion:Row 1 of the fixture was genuinely ambiguous:
"Meeting request.""Requesting a 1:1: finding time to connect next week?""Meeting request."is a perfectly acceptable subject line, so the judge legitimately preferred it part of the time → non-deterministic verdict → flaky test.Fix
Make row 1 have a clear winner, the way row 0 already does (a strong summary vs
"Exercise is good."):"ayo lets meet"— casual, unprofessional"Request to schedule a brief 1:1 meeting at your convenience"— polite, professionalThe expected verdict (B better) is now unambiguous, so the exact A/B assertion is stable. No production code changed.