Skip to content

Refresh harness leaderboard snapshots#25

Open
evanatpizzarobot wants to merge 1 commit into
mainfrom
harness-refresh-2026-06-08
Open

Refresh harness leaderboard snapshots#25
evanatpizzarobot wants to merge 1 commit into
mainfrom
harness-refresh-2026-06-08

Conversation

@evanatpizzarobot

Copy link
Copy Markdown
Collaborator

Summary

Automated weekly refresh of data/harnesses.json and worker/src/harnesses.ts against upstream leaderboard sources. Both files are kept byte-for-byte in sync on data fields. Both npx tsc --noEmit checks pass.

Source access this cycle

Source Status
Aider Polyglot (raw.githubusercontent.com/Aider-AI/aider) 200 OK, full YAML fetched
SWE-bench Verified (swebench.com) 403 Forbidden, scores held
Terminal-Bench (tbench.ai) 403 Forbidden, scores held
SWELancer (github.com/openai/SWELancer-Benchmark) Repo archived, no new data

Score changes (aider harness, aider_polyglot benchmark)

Model Before After Upstream model
Claude Opus 4.7 84.2 72.0 claude-opus-4 (32k thinking)
GPT-5.5 81.8 88.0 gpt-5 (high effort)
DeepSeek V4 Pro 73.4 74.2 DeepSeek-V3.2-Exp (Reasoner)

New entries added (aider harness, aider_polyglot only)

Model Score
claude-sonnet-4 (32k thinking) 61.3
gemini-2.5-pro (32k think) 83.1
o3-pro (high) 84.9

TODOs appended to note field

New harnesses spotted in June 2026 coverage (Antigravity, OpenCode, Hermes) have been flagged for editorial review. Each would also require an entry in src/lib/harness-directory.ts before being added to results.

Test plan

  • npx tsc --noEmit passes at repo root
  • cd worker && npx tsc --noEmit passes
  • data/harnesses.json and worker/src/harnesses.ts data fields are in sync
  • No existing entries deleted from results
  • No schema or field names changed
  • No harness ids changed
  • No em dashes or double hyphens in any changed content

Generated by Claude Code

Aider Polyglot YAML (aider.chat/Aider-AI/aider) is the only source
that returned HTTP 200 this cycle; swebench.com, tbench.ai, and the
SWELancer repo all returned HTTP 403 so those scores are held.

Score changes (aider harness, aider_polyglot benchmark):
  Claude Opus 4.7:  84.2 -> 72.0  (upstream: claude-opus-4, 32k thinking)
  GPT-5.5:          81.8 -> 88.0  (upstream: gpt-5, high effort)
  DeepSeek V4 Pro:  73.4 -> 74.2  (upstream: DeepSeek-V3.2-Exp, Reasoner)

New entries (aider harness, aider_polyglot only):
  claude-sonnet-4 (32k thinking): 61.3
  gemini-2.5-pro (32k think):     83.1
  o3-pro (high):                  84.9

Note: appended TODO for new harnesses (Antigravity, OpenCode, Hermes)
spotted in June 2026 coverage.
@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying tensorfeed with  Cloudflare Pages  Cloudflare Pages

Latest commit: 99e4f72
Status: ✅  Deploy successful!
Preview URL: https://19a6eea4.tensorfeed.pages.dev
Branch Preview URL: https://harness-refresh-2026-06-08.tensorfeed.pages.dev

View logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants