Consolidate duplicate benchmark directories (benchmark/ -> benchmarks/)#261
Merged
Conversation
PR #219 added a singular benchmark/ folder (biodex, reranking) alongside the pre-existing plural benchmarks/ (failure_mode_discovery, llm_as_judge, rag_pubmedqa). Move the two new suites under benchmarks/ so there is a single benchmark directory. No subdir name collisions; the biodex/reranking scripts use sibling imports unaffected by the parent move, and benchmarks/main.py only registers its existing subcommands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merged
liana313
added a commit
that referenced
this pull request
Jun 13, 2026
Release **v1.2.2**. Bumps `pyproject.toml` 1.2.1 → 1.2.2 and regenerates `uv.lock` to match (so the locked-constraints CI step stays green). Notable changes shipping in this release since 1.2.1: - **#260** — gpt-5 / reasoning-model accuracy fix: model-aware `max_tokens` default + truncation warning (closes #255) - **#262** — fix flaky `test_pairwise_judge` - **#261** — consolidate duplicate benchmark directories - **#219** — Biodex + reranking benchmark suites (resolves #227) On merge I'll tag `v1.2.2` off `main`, which triggers `publish.yml` to build and publish to PyPI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After #219 merged, the repo had two benchmark directories:
benchmark/(singular, new in Experiment scripts for Biodex and Reranking #219) —biodex,rerankingbenchmarks/(plural, pre-existing) —failure_mode_discovery,llm_as_judge,rag_pubmedqaFix
Move the two new suites into the pre-existing
benchmarks/:benchmark/biodex→benchmarks/biodexbenchmark/reranking→benchmarks/rerankingbenchmark/is removed. All 10 files move as git renames (history preserved).Safety
benchmark/path.from metrics import ...) that resolve within their own folder, unaffected by the parent move.benchmarks/main.pyonly registers its existing subcommands; the moved suites are run standalone per their READMEs.🤖 Generated with Claude Code