Fix flaky test_pairwise_judge (ambiguous row removed) by liana313 · Pull Request #262 · lotus-data/lotus

liana313 · 2026-06-13T19:34:19Z

Problem

test_pairwise_judge[gpt-4o-mini] failed intermittently — 3 spurious lm-openai failures in a single day, each passing on rerun. Always the same assertion:

assert list(df["_judge_0"].values) == ["A", "B"]
E  AssertionError: assert ['A', 'A'] == ['A', 'B']

Row 1 of the fixture was genuinely ambiguous:

instruction	model_a	model_b	expected
"polite email subject line to schedule a 1:1"	`"Meeting request."`	`"Requesting a 1:1: finding time to connect next week?"`	B

"Meeting request." is a perfectly acceptable subject line, so the judge legitimately preferred it part of the time → non-deterministic verdict → flaky test.

Fix

Make row 1 have a clear winner, the way row 0 already does (a strong summary vs "Exercise is good."):

model_a: "ayo lets meet" — casual, unprofessional
model_b: "Request to schedule a brief 1:1 meeting at your convenience" — polite, professional

The expected verdict (B better) is now unambiguous, so the exact A/B assertion is stable. No production code changed.

Row 1 compared 'Meeting request.' vs a longer subject line for a '1:1 meeting' prompt with no clear winner, so gpt-4o-mini's judge flip-flopped and the exact A/B assertion failed intermittently (3 spurious lm-openai failures observed in one day). Replace it with an unambiguous pair: a casual 'ayo lets meet' (A) vs a polite, professional subject line (B), so the expected verdict is stable like row 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates `uv.lock` to match (so the locked-constraints CI step stays green). Notable changes shipping in this release since 1.2.1: - **#260** — gpt-5 / reasoning-model accuracy fix: model-aware `max_tokens` default + truncation warning (closes #255) - **#262** — fix flaky `test_pairwise_judge` - **#261** — consolidate duplicate benchmark directories - **#219** — Biodex + reranking benchmark suites (resolves #227) On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to build and publish to PyPI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

liana313 merged commit 8eaede3 into main Jun 13, 2026
26 of 27 checks passed

liana313 deleted the fix-flaky-pairwise-judge branch June 13, 2026 19:41

liana313 mentioned this pull request Jun 13, 2026

Bump version to 1.2.2 #263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky test_pairwise_judge (ambiguous row removed)#262

Fix flaky test_pairwise_judge (ambiguous row removed)#262
liana313 merged 1 commit into
mainfrom
fix-flaky-pairwise-judge

liana313 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liana313 commented Jun 13, 2026

Problem

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant