Weekly benchmark review: 2026-06-22

# Weekly benchmark review (2026-06-22)

Automated check from `scripts/weekly-benchmarks-check.mjs`. Triage and either:
- Update `data/benchmarks.json` if a new flagship model dropped this week, then close this issue, OR
- Comment `noop` and close if nothing actionable surfaced.

## Current state of `data/benchmarks.json`
- **lastUpdated:** `2026-06-11` (11 days ago)
- **Models tracked:** 20
- **Benchmarks tracked:** 5
- **Models released within last 60 days:** 2

  - 2026-06 | Anthropic | Claude Fable 5
  - 2026-05 | Anthropic | Claude Opus 4.8

## Model-release-flavored news, last 7 days
Matched **7** articles (keyword scan; not all will be real releases).

| Date | Source | Title |
|---|---|---|
| 2026-06-22 | Hacker News AI | Show HN: Open-source job search plugin for Claude Code |
| 2026-06-22 | WIRED AI | OpenAI Launches Full-Scale Effort to Patch Open-Source Bugs as It Takes on Anthropic’s Mythos |
| 2026-06-22 | Hacker News AI | A public Sentry key is all it takes to hijack Claude Code, Cursor, and Codex |
| 2026-06-22 | NVIDIA AI Blog | From Materials Simulation to Experimental Astronomy, New NVIDIA AI Software Unlocks Scientific Discoveries |
| 2026-06-19 | MIT Technology Review | A startup claims it broke through a bottleneck that’s holding back LLMs |
| 2026-06-18 | WIRED AI | The White House Is Making Up Its Rules for AI in Real Time |
| 2026-06-18 | The Verge AI | Adobe’s redesigned AI studio remembers what your creations look like |

**Sources:** Hacker News AI (2), WIRED AI (2), NVIDIA AI Blog (1), MIT Technology Review (1), The Verge AI (1)

## HF Open LLM Leaderboard top 10
Captured: `2026-06-22`

| Rank | Model |
|---|---|
| 1 | MaziyarPanahi_calme-3.2-instruct-78b_bfloat16 (avg 52.08 · 78B) |
| 2 | MaziyarPanahi_calme-3.1-instruct-78b_bfloat16 (avg 51.29 · 78B) |
| 3 | dfurman_CalmeRys-78B-Orpo-v0.1_bfloat16 (avg 51.23 · 78B) |
| 4 | MaziyarPanahi_calme-2.4-rys-78b_bfloat16 (avg 50.77 · 78B) |
| 5 | huihui-ai_Qwen2.5-72B-Instruct-abliterated_bfloat16 (avg 48.11 · 73B) |
| 6 | Qwen_Qwen2.5-72B-Instruct_bfloat16 (avg 47.98 · 73B) |
| 7 | MaziyarPanahi_calme-2.1-qwen2.5-72b_bfloat16 (avg 47.86 · 73B) |
| 8 | newsbang_Homer-v1.0-Qwen2.5-72B_bfloat16 (avg 47.46 · 73B) |
| 9 | ehristoforu_qwen2.5-test-32b-it_bfloat16 (avg 47.37 · 33B) |
| 10 | Saxo_Linkbricks-Horizon-AI-Avengers-V1-32B_bfloat16 (avg 47.34 · 33B) |

---

### What "needs update" usually means
1. A flagship from Anthropic / OpenAI / Google / Meta / Mistral / DeepSeek / xAI launched this week → add a row to `data/benchmarks.json`.
2. A tracked model has materially-shifted benchmark scores (re-running, methodology change) → update the row.
3. A new benchmark itself (e.g. a successor to MMLU-Pro) is becoming canonical → add it.

### What it usually does NOT mean
- Research papers about benchmarks (those land on /research, not /benchmarks).
- HN opinion threads about a model.
- Pricing-only changes (those go in `data/pricing.json`).

Bump `lastUpdated` in `data/benchmarks.json` whenever you change anything else in the file.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weekly benchmark review: 2026-06-22 #63

Weekly benchmark review (2026-06-22)

Current state of `data/benchmarks.json`

Model-release-flavored news, last 7 days

HF Open LLM Leaderboard top 10

What "needs update" usually means

What it usually does NOT mean

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Date	Source	Title
2026-06-22	Hacker News AI	Show HN: Open-source job search plugin for Claude Code
2026-06-22	WIRED AI	OpenAI Launches Full-Scale Effort to Patch Open-Source Bugs as It Takes on Anthropic’s Mythos
2026-06-22	Hacker News AI	A public Sentry key is all it takes to hijack Claude Code, Cursor, and Codex
2026-06-22	NVIDIA AI Blog	From Materials Simulation to Experimental Astronomy, New NVIDIA AI Software Unlocks Scientific Discoveries
2026-06-19	MIT Technology Review	A startup claims it broke through a bottleneck that’s holding back LLMs
2026-06-18	WIRED AI	The White House Is Making Up Its Rules for AI in Real Time
2026-06-18	The Verge AI	Adobe’s redesigned AI studio remembers what your creations look like

Rank	Model
1	MaziyarPanahi_calme-3.2-instruct-78b_bfloat16 (avg 52.08 · 78B)
2	MaziyarPanahi_calme-3.1-instruct-78b_bfloat16 (avg 51.29 · 78B)
3	dfurman_CalmeRys-78B-Orpo-v0.1_bfloat16 (avg 51.23 · 78B)
4	MaziyarPanahi_calme-2.4-rys-78b_bfloat16 (avg 50.77 · 78B)
5	huihui-ai_Qwen2.5-72B-Instruct-abliterated_bfloat16 (avg 48.11 · 73B)
6	Qwen_Qwen2.5-72B-Instruct_bfloat16 (avg 47.98 · 73B)
7	MaziyarPanahi_calme-2.1-qwen2.5-72b_bfloat16 (avg 47.86 · 73B)
8	newsbang_Homer-v1.0-Qwen2.5-72B_bfloat16 (avg 47.46 · 73B)
9	ehristoforu_qwen2.5-test-32b-it_bfloat16 (avg 47.37 · 33B)
10	Saxo_Linkbricks-Horizon-AI-Avengers-V1-32B_bfloat16 (avg 47.34 · 33B)

Weekly benchmark review: 2026-06-22 #63

Description

Weekly benchmark review (2026-06-22)

Current state of data/benchmarks.json

Model-release-flavored news, last 7 days

HF Open LLM Leaderboard top 10

What "needs update" usually means

What it usually does NOT mean

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Current state of `data/benchmarks.json`