Weekly benchmark review (2026-06-15)
Automated check from scripts/weekly-benchmarks-check.mjs. Triage and either:
- Update
data/benchmarks.json if a new flagship model dropped this week, then close this issue, OR
- Comment
noop and close if nothing actionable surfaced.
Current state of data/benchmarks.json
Model-release-flavored news, last 7 days
Matched 10 articles (keyword scan; not all will be real releases).
| Date |
Source |
Title |
| 2026-06-15 |
Hacker News AI |
Show HN: Spotlight shows what your Claude Code/Codex are doing |
| 2026-06-15 |
Hacker News AI |
Claude Corps |
| 2026-06-15 |
Google AI Blog |
We’re strengthening our presence in Alabama through new investments and community support. |
| 2026-06-15 |
ZDNet AI |
Google Maps vs. Waze: I've driven 100+ miles with the two best navigation apps - this one's better |
| 2026-06-15 |
arXiv cs.AI |
WorkBench Revisited: Workplace Agents Two Years On |
| 2026-06-15 |
arXiv cs.AI |
YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications |
| 2026-06-13 |
The Verge AI |
My yard is dying, so I made an app for that |
| 2026-06-13 |
WIRED AI |
Anthropic Says It’s Taking Claude Fable 5 Offline to Comply With US Government Order |
| 2026-06-10 |
NVIDIA AI Blog |
NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI |
| 2026-06-09 |
NVIDIA AI Blog |
NVIDIA Confidential Computing to Help Expand Apple’s Private Cloud Compute |
Sources: Hacker News AI (2), arXiv cs.AI (2), NVIDIA AI Blog (2), Google AI Blog (1), ZDNet AI (1), The Verge AI (1), WIRED AI (1)
HF Open LLM Leaderboard top 10
Captured: 2026-06-15
| Rank |
Model |
| 1 |
MaziyarPanahi_calme-3.2-instruct-78b_bfloat16 (avg 52.08 · 78B) |
| 2 |
MaziyarPanahi_calme-3.1-instruct-78b_bfloat16 (avg 51.29 · 78B) |
| 3 |
dfurman_CalmeRys-78B-Orpo-v0.1_bfloat16 (avg 51.23 · 78B) |
| 4 |
MaziyarPanahi_calme-2.4-rys-78b_bfloat16 (avg 50.77 · 78B) |
| 5 |
huihui-ai_Qwen2.5-72B-Instruct-abliterated_bfloat16 (avg 48.11 · 73B) |
| 6 |
Qwen_Qwen2.5-72B-Instruct_bfloat16 (avg 47.98 · 73B) |
| 7 |
MaziyarPanahi_calme-2.1-qwen2.5-72b_bfloat16 (avg 47.86 · 73B) |
| 8 |
newsbang_Homer-v1.0-Qwen2.5-72B_bfloat16 (avg 47.46 · 73B) |
| 9 |
ehristoforu_qwen2.5-test-32b-it_bfloat16 (avg 47.37 · 33B) |
| 10 |
Saxo_Linkbricks-Horizon-AI-Avengers-V1-32B_bfloat16 (avg 47.34 · 33B) |
What "needs update" usually means
- A flagship from Anthropic / OpenAI / Google / Meta / Mistral / DeepSeek / xAI launched this week → add a row to
data/benchmarks.json.
- A tracked model has materially-shifted benchmark scores (re-running, methodology change) → update the row.
- A new benchmark itself (e.g. a successor to MMLU-Pro) is becoming canonical → add it.
What it usually does NOT mean
- Research papers about benchmarks (those land on /research, not /benchmarks).
- HN opinion threads about a model.
- Pricing-only changes (those go in
data/pricing.json).
Bump lastUpdated in data/benchmarks.json whenever you change anything else in the file.
Weekly benchmark review (2026-06-15)
Automated check from
scripts/weekly-benchmarks-check.mjs. Triage and either:data/benchmarks.jsonif a new flagship model dropped this week, then close this issue, ORnoopand close if nothing actionable surfaced.Current state of
data/benchmarks.jsonlastUpdated:
2026-06-11(4 days ago)Models tracked: 20
Benchmarks tracked: 5
Models released within last 60 days: 2
Model-release-flavored news, last 7 days
Matched 10 articles (keyword scan; not all will be real releases).
Sources: Hacker News AI (2), arXiv cs.AI (2), NVIDIA AI Blog (2), Google AI Blog (1), ZDNet AI (1), The Verge AI (1), WIRED AI (1)
HF Open LLM Leaderboard top 10
Captured:
2026-06-15What "needs update" usually means
data/benchmarks.json.What it usually does NOT mean
data/pricing.json).Bump
lastUpdatedindata/benchmarks.jsonwhenever you change anything else in the file.