Refresh harness leaderboard snapshots#25
Open
evanatpizzarobot wants to merge 1 commit into
Open
Conversation
Aider Polyglot YAML (aider.chat/Aider-AI/aider) is the only source that returned HTTP 200 this cycle; swebench.com, tbench.ai, and the SWELancer repo all returned HTTP 403 so those scores are held. Score changes (aider harness, aider_polyglot benchmark): Claude Opus 4.7: 84.2 -> 72.0 (upstream: claude-opus-4, 32k thinking) GPT-5.5: 81.8 -> 88.0 (upstream: gpt-5, high effort) DeepSeek V4 Pro: 73.4 -> 74.2 (upstream: DeepSeek-V3.2-Exp, Reasoner) New entries (aider harness, aider_polyglot only): claude-sonnet-4 (32k thinking): 61.3 gemini-2.5-pro (32k think): 83.1 o3-pro (high): 84.9 Note: appended TODO for new harnesses (Antigravity, OpenCode, Hermes) spotted in June 2026 coverage.
Deploying tensorfeed with
|
| Latest commit: |
99e4f72
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://19a6eea4.tensorfeed.pages.dev |
| Branch Preview URL: | https://harness-refresh-2026-06-08.tensorfeed.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Automated weekly refresh of
data/harnesses.jsonandworker/src/harnesses.tsagainst upstream leaderboard sources. Both files are kept byte-for-byte in sync on data fields. Bothnpx tsc --noEmitchecks pass.Source access this cycle
Score changes (aider harness, aider_polyglot benchmark)
New entries added (aider harness, aider_polyglot only)
TODOs appended to note field
New harnesses spotted in June 2026 coverage (Antigravity, OpenCode, Hermes) have been flagged for editorial review. Each would also require an entry in
src/lib/harness-directory.tsbefore being added to results.Test plan
npx tsc --noEmitpasses at repo rootcd worker && npx tsc --noEmitpassesdata/harnesses.jsonandworker/src/harnesses.tsdata fields are in syncGenerated by Claude Code