Skip to content

Fix flaky test_pairwise_judge (ambiguous row removed)#262

Merged
liana313 merged 1 commit into
mainfrom
fix-flaky-pairwise-judge
Jun 13, 2026
Merged

Fix flaky test_pairwise_judge (ambiguous row removed)#262
liana313 merged 1 commit into
mainfrom
fix-flaky-pairwise-judge

Conversation

@liana313

Copy link
Copy Markdown
Collaborator

Problem

test_pairwise_judge[gpt-4o-mini] failed intermittently — 3 spurious lm-openai failures in a single day, each passing on rerun. Always the same assertion:

assert list(df["_judge_0"].values) == ["A", "B"]
E  AssertionError: assert ['A', 'A'] == ['A', 'B']

Row 1 of the fixture was genuinely ambiguous:

instruction model_a model_b expected
"polite email subject line to schedule a 1:1" "Meeting request." "Requesting a 1:1: finding time to connect next week?" B

"Meeting request." is a perfectly acceptable subject line, so the judge legitimately preferred it part of the time → non-deterministic verdict → flaky test.

Fix

Make row 1 have a clear winner, the way row 0 already does (a strong summary vs "Exercise is good."):

  • model_a: "ayo lets meet" — casual, unprofessional
  • model_b: "Request to schedule a brief 1:1 meeting at your convenience" — polite, professional

The expected verdict (B better) is now unambiguous, so the exact A/B assertion is stable. No production code changed.

Row 1 compared 'Meeting request.' vs a longer subject line for a '1:1
meeting' prompt with no clear winner, so gpt-4o-mini's judge flip-flopped
and the exact A/B assertion failed intermittently (3 spurious lm-openai
failures observed in one day). Replace it with an unambiguous pair: a
casual 'ayo lets meet' (A) vs a polite, professional subject line (B), so
the expected verdict is stable like row 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@liana313 liana313 merged commit 8eaede3 into main Jun 13, 2026
26 of 27 checks passed
@liana313 liana313 deleted the fix-flaky-pairwise-judge branch June 13, 2026 19:41
@liana313 liana313 mentioned this pull request Jun 13, 2026
liana313 added a commit that referenced this pull request Jun 13, 2026
Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates
`uv.lock` to match (so the locked-constraints CI step stays green).

Notable changes shipping in this release since 1.2.1:
- **#260** — gpt-5 / reasoning-model accuracy fix: model-aware
`max_tokens` default + truncation warning (closes #255)
- **#262** — fix flaky `test_pairwise_judge`
- **#261** — consolidate duplicate benchmark directories
- **#219** — Biodex + reranking benchmark suites (resolves #227)

On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to
build and publish to PyPI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant