Skip to content

Refresh harness leaderboard snapshots#40

Open
evanatpizzarobot wants to merge 1 commit into
mainfrom
harness-refresh-2026-06-15
Open

Refresh harness leaderboard snapshots#40
evanatpizzarobot wants to merge 1 commit into
mainfrom
harness-refresh-2026-06-15

Conversation

@evanatpizzarobot

Copy link
Copy Markdown
Collaborator

Summary

  • lastUpdated bumped to 2026-06-15 in both data/harnesses.json and worker/src/harnesses.ts (in sync, byte-for-byte on the data fields).
  • One-line TODO appended to the note field: Roo Code shut down on 2026-05-15 and Kilo Code (kilocode.ai) is the official successor. A human needs to decide whether to add Kilo Code and update src/lib/harness-directory.ts.
  • No benchmark scores were changed this run (see Access Issues below).

Access Issues Encountered

The three canonical upstream sources blocked automated fetching:

  • swebench.com (SWE-bench Verified): HTTP 403
  • tbench.ai (Terminal-Bench): HTTP 403
  • aider.chat/docs/leaderboards/ (Aider Polyglot): HTTP 403

More than 15 third-party aggregator and mirror sites also returned 403. The one source that was accessible was the Aider Polyglot YAML on GitHub (Aider-AI/aider raw file), but its most recent entries date to October 2025 and do not include the model names tracked in our data (Claude Opus 4.7, GPT-5.5, DeepSeek V4 Pro). Scores for those entries cannot be updated until the Aider team publishes results for those models.

SWE-Lancer: The original benchmark repository (openai/SWELancer-Benchmark) has been archived and merged into openai/preparedness. No new public leaderboard with per-agent scores was found.

Scores Changed

None. Per-entry scores are unchanged from the 2026-04-30 snapshot.

TODOs Added to Note

  • roo-code shut down 2026-05-15; Kilo Code (kilocode.ai) is the official successor and should be evaluated for addition to harness-directory.ts after editorial review.

Test Plan

  • npx tsc --noEmit passes at repo root (after npm install)
  • cd worker && npx tsc --noEmit passes (after npm install)
  • Human: confirm Kilo Code editorial decision and re-run this routine once upstream leaderboard sites become accessible again

https://claude.ai/code/session_01SVZWBi9UTPJ1VrrBUnKda9


Generated by Claude Code

Updated lastUpdated to 2026-06-15. Benchmark scores unchanged: official
upstream sites (swebench.com, tbench.ai, aider.chat) returned HTTP 403
during this run so per-entry scores could not be re-verified. The Aider
Polyglot YAML on GitHub is current only through October 2025 and does
not yet include our tracked model names (Claude Opus 4.7, GPT-5.5,
DeepSeek V4 Pro). Added a one-line TODO in note for Kilo Code, the
official successor to roo-code which shut down 2026-05-15.

https://claude.ai/code/session_01SVZWBi9UTPJ1VrrBUnKda9
@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying tensorfeed with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5a67f9a
Status: ✅  Deploy successful!
Preview URL: https://c3cd488e.tensorfeed.pages.dev
Branch Preview URL: https://harness-refresh-2026-06-15.tensorfeed.pages.dev

View logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants