Open Weight Models

A curated list of open-weight AI models with commercially exploitable licenses, verified benchmarks, and no geographic restrictions. Built to decide which models to support in herbert-rs, a local LLM inference engine in Rust and hand-written assembly.

Selection criteria:

Commercially exploitable license, no geographic restriction (EU ok)
VRAM Q4 ≤ 128 GB (main) — what fits Q4-quantized on a single high-end workstation. An extended tier (128 < Q4 ≤ 256 GB) flags models that require a Mac Studio M3 Ultra or multi-GPU server.
Released after April 2024

This excludes Llama 4 multimodal (EU exclusion), Qwen 3.6 Plus (closed-source), DeepSeek V3/R1 full (671B → ~370 GB Q4, beyond 256 GB), and others. Note: Llama text-only models (3.3 70B, 3.2 1B/3B) are EU-exploitable. See Rejected models for details.

Maintained by Philippe Anel. Last updated: May 2026.

v3 additions (May 2026) — Selection criterion switched from "< 200B params" to "VRAM Q4 ≤ 128 GB" + extended 256 GB tier (better proxy for what's actually runnable on consumer/prosumer hardware). Generalists: Ling-2.6-flash 104B/A7.4B (Ant), Mistral Medium 3.5 128B (🔴 Modified MIT), MiniMax M2.7 230B/A10B. Extended tier: DeepSeek-V4-Flash 284B/A13B (1M ctx native, FP4+FP8). Alternative architectures: ZAYA1-8B (Zyphra), Kimi-Linear 48B/A3B (Moonshot, KDA hybrid), Bonsai-8B (1-bit end-to-end, 1.15 GB). Vision/Multimodal: Nemotron 3 Nano Omni 30B-A3B, DeepSeek-OCR + DeepSeek-OCR-2. Compact/Edge: LFM2.5-VL-450M (vision edge). Theorem provers: SGS algorithm (Stanford, 7B beats 671B pass@4 on D3k). New license row: 🔴 Modified MIT with explicit warning.

v2 additions (April 2026) — Generalists: GLM-4.7-Flash, Hermes 4-70B. Code: NousCoder-14B, OmniCoder-9B (new LCB/Terminal-Bench subsection). Compact/Edge: Pleias-RAG-1B, Pleias-3B. Reasoning/Math: Qwen2.5-Math-72B (historical). Alternative architectures: URM. Decentralized training: Hermes 4.3-36B-Psyche. Theorem provers: Nomos 1 (natural-language track).

LLMs

Generalists

Model	Publisher	Active	Total	Arch	Ctx	License	Key scores
Gemma 4 31B	Google	31B	31B	Dense	256K	Apache 2.0	GPQA 84.3, MMLU-Pro 85.2
Qwen3.5-27B	Alibaba	27B	27B	Dense	128K	Apache 2.0	201 languages
Qwen3.5-9B	Alibaba	9B	9B	Dense	128K	Apache 2.0	GPQA 81.7 (9B!)
Qwen3.5-122B-A10B	Alibaba	10B	122B	MoE	256K	Apache 2.0	201 languages, multimodal
GPT-OSS-120B	OpenAI	5.1B	117B	MoE	128K	Apache 2.0	GPQA 80.9, Codeforces 2622, AIME 96.6%
GPT-OSS-20B	OpenAI	3.6B	21B	MoE	128K	Apache 2.0	AIME 96%, fits 16GB
Mistral Small 4	Mistral	6B	119B	MoE	256K	Apache 2.0	GPQA 71.2, unified instruct/reasoning/coding
GLM-4.5-Air	Zhipu AI	12B	106B	MoE	128K	MIT	MATH-500 98.1%, MMLU-Pro 81.4
GLM-4.7-Flash	Zhipu AI	3B	30B	MoE (MLA)	200K	MIT	SWE-bench 59.2, AIME25 91.6, GPQA 75.2
Ling-2.6-flash	Ant Group	7.4B	104B	MoE (hybrid linear attn 1:7 MLA+Lightning)	262K	MIT	Token-efficient agent (~15M tokens on full AA suite vs 40-100M for long-reasoners)
QwQ-32B	Alibaba	32B	32B	Dense	128K	Apache 2.0	AIME ~80%, reasoning RL
DeepSeek R1-Distill-32B	DeepSeek	32B	32B	Dense	128K	MIT	Beats o1-mini
Step-3.5-Flash	StepFun	11B	196B	MoE	262K	Apache 2.0	SWE-bench 74.4%, 350 tok/s
Llama 3.3 70B	Meta	70B	70B	Dense	128K	Llama Community (EU OK)	MMLU 86.0, HumanEval 88.4, MATH 77.0
Hermes 4-70B	Nous Research	70B	70B	Dense	128K	Llama Community (EU OK)	SOTA RefusalBench, hybrid reasoning, tool calling
InternVL3-78B	Shanghai AI Lab	78B	78B	Dense	--	Apache 2.0	MMMU 72.2, SOTA open-source VLM
Mistral Medium 3.5 128B	Mistral AI	128B	128B	Dense + Pixtral vision	256K	🔴 Modified MIT (revenue cap)	First Mistral merged flagship: Medium 3.1 + Magistral + Devstral 2 unified, configurable `reasoning_effort`
MiniMax M2.7	MiniMax AI	10B	230B	MoE (256 experts, 8 active, 4.3% ratio)	~200K	MIT (verify on HF)	Agentic workflows alt to Claude Opus 4.6 / GPT-5.3-Codex, IQ1_M @ 60.7 GB

🔴 License warning — Modified MIT (revenue/MAU caps). Mistral Medium 3.5 falls under a Mistral Open License variant with a revenue threshold; MiniMax M2.7 and Kimi K2.5 historically shipped with similar caps (100M MAU for Kimi). They are listed for completeness but you must read the actual license before any commercial deployment — these are not interchangeable with Apache 2.0/MIT. The revenue/MAU clauses can flip a free model into a paid one once your product takes off.

Mistral Medium 3.5 (Apr/May 2026) is the first merged flagship from Mistral: a single set of weights unifying what used to be three distinct models — Medium 3.1 (instruct), Magistral (reasoning), Devstral 2 (coding agent). Behavior switches via reasoning_effort per request (none / high). Replaces Medium 3.1 + Magistral in Le Chat and Devstral 2 in Vibe CLI. 88-layer dense (no MoE), Pixtral vision tower trained from scratch.

MiniMax M2.7 (Apr 2026 open-weight release) pushes the active/total ratio to 4.3% (10B/230B), targeting agentic long-running workflows (coding, multi-step troubleshooting, document editing). Positioned as open-weight alternative to Claude Opus 4.6 / GPT-5.3-Codex with IQ1_M weights at 60.7 GB making 230B practically deployable on a single workstation.

Extended tier (128 < Q4 ≤ 256 GB)

Models that exceed the 128 GB Q4 main cap but fit a 256 GB workstation (Mac Studio M3 Ultra max, multi-GPU server).

Model	Publisher	Active	Total	Arch	Ctx	License	Key scores
DeepSeek-V4-Flash	DeepSeek	13B	284B	MoE (hybrid CSA+HCA, mHC, Muon optimizer)	1M native	MIT	First < 200B-active LLM with native 1M ctx, FP4+FP8 mixed, 32T pre-train, 27% FLOPs / 10% KV cache vs V3.2

DeepSeek-V4-Flash (May 2026) is the small sibling of V4-Pro (1.6T/49B). Q4 ≈ 156 GB. Three inference modes integrated in the chat template: non-think, think-high, think-max (recommended at ≥ 384K context). The architecture introduces three new ideas — hybrid attention (CSA + HCA), multi-head computation (mHC), and the Muon optimizer — pushing the efficiency frontier rather than the parameter frontier.

Code

Model	SWE-bench	Codeforces	Active	License
Claude Opus 4.6 (closed)	80.8%	--	--	--
Gemini 3.1 Pro (closed)	80.6%	--	--	--
GPT-5.4 (closed)	~80%	--	--	--
Step-3.5-Flash	74.4%	--	11B	Apache 2.0
Devstral 2	72.2%	--	~12B	MIT modified
Qwen3-Coder-Next 80B-A3B	70.6%	--	3B	Apache 2.0
Qwen2.5-Coder-32B	69.6%	--	32B	Apache 2.0
Devstral Small 2	68.0%	--	24B	Apache 2.0
GLM-4.7-Flash	59.2%	--	3B	MIT
GPT-OSS-120B	62.4%	2622	5.1B	Apache 2.0
Gemma 4 31B	--	2150	31B	Apache 2.0

SWE-bench = real bugs in real GitHub repos (Django, Flask, scikit-learn). 500 human-validated issues. Codeforces = algorithmic competition, ELO-scored like chess. Different skills: fixing a codebase vs solving a puzzle.

Code — LiveCodeBench & Terminal-Bench

Specialized coders measured on benchmarks other than SWE-bench.

Model	LiveCodeBench v6	Terminal-Bench 2.0	Active	License
OmniCoder-9B	--	23.6%	9.4B	Apache 2.0
NousCoder-14B	67.87%	--	14.8B	Apache 2.0
Qwen3.5-9B (baseline)	60.79%	14.6%	9B	Apache 2.0

LiveCodeBench (rotating ≈700 problems from LeetCode/AtCoder/Codeforces, collected after model cutoffs) measures fresh competitive programming, vs SWE-bench (fixing real-world bugs) and Codeforces ELO (pure algorithms). Terminal-Bench 2.0 measures agentic coding skills (read-before-write, LSP responsiveness, minimal diffs).

OmniCoder-9B is a LoRA agentic fine-tune of Qwen3.5-9B on 425K Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro trajectories — +61% relative on Terminal-Bench vs base. NousCoder-14B is a pure-RL fine-tune of Qwen3-14B (+7.08 pts on LCB v6, no SFT). Same 9-14B class, opposite methods.

Reasoning

GPQA Diamond (198 questions)

Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark available.

Model	GPQA	Active
Gemini 3.1 Pro (closed)	94.3	--
GPT-5.4 (closed)	92.8	--
Claude Opus 4.6 (closed)	91.3	--
Gemma 4 31B	84.3	31B
Gemma 4 26B-A4B	82.3	3.8B
Qwen3.5-9B	81.7	9B
GPT-OSS-120B	80.9	5.1B
GLM-4.5-Air	75.0	12B
Nemotron 3 Nano	73.0	3.5B
Mistral Small 4	71.2	6B
Llama 3.3 70B	50.5	70B

Math (AIME, 15 problems/year)

Competition-level math requiring creativity and multi-step reasoning. Each year's edition is different and harder. Only compare within the same version.

Model	AIME	Conditions	Active
GPT-5.4 (closed)	~100%	2025	--
Claude Opus 4.6 (closed)	~98%	2025	--
Nemotron 3 Nano	99.2%	2025, with tools	3.5B
GPT-OSS-120B	96.6%	2024, with tools	5.1B
GPT-OSS-20B	96.0%	2024, with tools	3.6B
Gemma 4 31B	89.2%	2026	31B
Gemma 4 26B-A4B	88.3%	2026	3.8B
Ministral 14B	85.0%	2025	14B
Nemotron Nano 9B v2	97.8%	MATH-500, /think mode	9B
Qwen2.5-Math-72B	40.0%	2024, TIR (Python)	72B

AIME versions (2024/2025/2026) are not comparable. Each year is harder.

Qwen2.5-Math (Sep 2024) is the first open model to make real AIME progress (12/30 on AIME 2024 — 6× GPT-4 Turbo at the time). Since surpassed by generalists (Nemotron, GPT-OSS 96%+). Historical reference, still useful for its TIR mode (Tool-Integrated Reasoning with Python interpreter) which eliminates arithmetic errors. Qwen License, not Apache — commercial OK but check terms.

Compact / Edge

Models that run on smartphones, laptops, or edge devices.

Model	Active	VRAM Q4	Strength	License
SmolLM3-3B	3B	~2 GB	Best 3B, AIME 36.7%, /think mode, 64K ctx	Apache 2.0
SmolLM2-1.7B	1.7B	~1 GB	11T tokens, data-centric	Apache 2.0
SmolLM2-360M	360M	< 1 GB	4T tokens	Apache 2.0
SmolLM2-135M	135M	< 1 GB	Ultra-compact, few MB quantized	Apache 2.0
Gemma 4 E2B	2.3B	~4 GB	Multimodal + audio	Apache 2.0
Gemma 4 E4B	4.5B	~6 GB	Multimodal + audio	Apache 2.0
Phi-4-mini	3.8B	~2 GB	MATH-500 92.5%	MIT
Phi-4-multimodal	5.6B	~3 GB	Text + image + audio	MIT
Ministral 3B	3B	~2 GB	Vision + reasoning, 256K ctx	Apache 2.0
Ministral 8B	8B	~5 GB	AIME 78.7%, vision	Apache 2.0
Ministral 14B	14B	~8 GB	AIME 85%, vision, 256K ctx	Apache 2.0
LFM2.5-1.2B	1.2B	~1 GB	IFBench 47.3 (2x Qwen3-1.7B), thinking, vision, audio	LFM Open v1.0
Llama 3.2 1B/3B	1-3B	< 2 GB	128K ctx, edge/mobile, EU OK (text-only)	Llama Community
InternLM3-8B	8B	~5 GB	Thinking mode, 4T tokens (75% less training)	Apache 2.0
InternVL3-1B→38B	1-38B	1-20 GB	Vision SOTA, full range edge→server	Apache 2.0
Chocolatine-2-4B-DPO	4B	~2.5 GB	French-optimized DPO fine-tune of Qwen3-4B, 262K ctx, no `<think>`	Apache 2.0
Pleias-RAG-1B	1.2B	~1 GB	100% public-domain training data, native citation with literal quotes, EU multilingual	Apache 2.0
Pleias-RAG-350M	350M	< 1 GB	Same as Pleias-RAG-1B, ultra-compact	Apache 2.0
Baguettotron	0.3B	< 1 GB	Latest Pleias base (Dec 2025), French-focused SLM	Apache 2.0
LFM2.5-VL-450M	450M	< 1 GB	Vision edge: SigLIP2 + 512×512 native, object detection, WebGPU	LFM Open v1.0
Bonsai-8B	8B	1.15 GB	1-bit Qwen3-8B fine-tune, CUDA/Metal/CPU/Android/iPhone	Apache 2.0

SmolLM3-3B beats all other 3B models and competes with 4B models (Qwen3-4B, Gemma3-4B). Data quality matters more than model size: SmolLM2-1.7B trained on 11T tokens beats larger models trained on less data.

Chocolatine-2-4B (Jonathan Pacifico) is a DPO fine-tune of Qwen3-4B-Instruct-2507 on French preference datasets (Compar:IA from the French Ministry of Culture + French-ORCA), merged with TIES. Gains on every French benchmark tested (GPQA-FR, French MMLU, French Bench, FR-MT-Bench) without degrading English performance. One of the rare French-focused open-weight models built by an individual contributor rather than a lab.

Pleias (Paris-based lab, partners with NVIDIA and Mozilla Builders) trains exclusively on public-domain or CC-licensed data (Common Corpus, 2T tokens). Raw benchmark scores trail Qwen/Gemma of equivalent size, but the trade-off is unique: zero copyright ambiguity (EU AI Act / GDPR friendly), strong EU multilingual (FR, DE, IT, ES, NL, PL), and the Pleias-RAG variants emit literal-quote citations natively. Positioned for regulated sectors (public, legal, press, education) where data provenance matters more than peak scores.

Long context

Model	Max ctx	RULER 1M	Architecture	Active	License
Nemotron 3 Nano	1M	86.3%	Mamba/MoE	3.5B	Nemotron OML
Nemotron 3 Super	1M	--	Mamba/MoE	12B	Nemotron OML
DeepSeek-V4-Flash (extended tier)	1M native	--	MoE (CSA+HCA hybrid)	13B	MIT
Jamba 1.6 Mini	256K	--	SSM+Transformer/MoE	12B	Jamba OML

RULER (GitHub) tests retrieval in long contexts with multiple needles, multi-hop tracing, and aggregation. Parametric by length (4K to 1M). Many models claim "1M context" without publishing RULER scores at that length. Without measurement, it's marketing.

Alternative architectures

Non-Transformer or hybrid models.

Model	Architecture	Active	Key metric	License
Granite 4.0	90% Mamba-2 / 10% Attention	3-9B	70% memory reduction, 2x speed	Apache 2.0
LFM2/2.5	Convolutions + grouped attention	2.3B	112 tok/s CPU, 2x Qwen3. LFM2.5: vision, audio, thinking	LFM Open v1.0
Jamba 1.6 Mini	Mamba + Transformer + MoE	12B	2.5x Transformer speed	Jamba OML
URM	Recursive Universal Transformer (ConvSwiGLU + TBPTL)	4× params (tiny)	ARC-AGI 1: 53.8%, Sudoku 77.6%	Open-source (research)
ZAYA1-8B	Hybrid Mamba + Compressed Cross Attention (CCA) + MoD + EDA	760M / 8.4B (9% active)	On-device deployable, test-time-compute friendly, 128K ctx	Apache 2.0
Kimi-Linear-48B-A3B	MoE hybrid: 3 KDA (linear) layers per 1 MLA (global)	3B / 48B	1M context, 5.7T tokens, demonstrates linear attention can match full attention	MIT
Bonsai-8B	Qwen3-8B fine-tuned at 1-bit end-to-end (GGUF Q1_0), all projections + LM head 1-bit	8.19B	1.15 GB on disk (14.2× FP16), runs on CPU/Android/iPhone	Apache 2.0

URM (Ubiquant, Dec 2025) loops its 4 layers 12× instead of stacking 48 distinct layers. With 4× parameters it reaches 53.8% on ARC-AGI 1 where a vanilla Transformer with 32× parameters stays under 40%. Key claim of the paper: the FFN, not attention, is the source of reasoning — counterintuitive given the community's focus on attention variants. Research model, not a production LLM, but architecturally interesting for future LLM designs. See arXiv:2512.14693.

ZAYA1-8B (Zyphra, May 2026) is a Zamba-2 successor: 80 layers mixing SSM-Mamba and attention, with Compressed Cross Attention (CCA) plus Mixture-of-Depths (MoD) and Expert Decision Attention (EDA) on top of 16 top-1 experts. The angle is intelligence per active parameter: 760M actifs gives sub-1B inference cost while keeping 8B-class capacity. Positioned for on-device + thinking-mode workflows where compute scales with active params, not totals. Tech report on zyphra.com/zaya1-8b-technical-report.

Kimi-Linear (Moonshot, Oct 2025, arXiv:2510.26692) is Moonshot's open research vehicle for linear attention outside the closed K2 family. The architecture is a 3:1 ratio of KDA (Kimi Delta Attention, linear) to MLA (full attention, global) layers. The point isn't frontier performance — it's the demonstration that linear attention can match full attention across short, long, and RL-style regimes while reducing memory cost. Useful baseline for engine work like herbert-rs.

Bonsai-8B (Prism ML, Mar 2026) is a 1-bit end-to-end fine-tune of Qwen3-8B: every projection + the LM head quantized to 1 bit (GGUF Q1_0), shrinking the deployed model to 1.15 GB. Direct competitor to BitNet, but trained as a fine-tune rather than natively 1.58-bit from scratch. Runs on CUDA, Metal, Android, CPU, and iPhone (via Locally AI). The radical end of the quantization spectrum — accept the quality drop in exchange for ubiquity.

Decentralized training

Models pre-trained outside traditional data centers, using distributed peer-to-peer or blockchain-coordinated networks. The story is the training method, not the model quality.

Model	Method	Size	Tokens	Architecture	License
Covenant-72B	Permissionless P2P, SparseLoCo optimizer, Bittensor blockchain (Subnet 3)	72B dense	1.1T (+14.8B SFT)	LLaMA-3 style, GQA, 80 layers, d=8192, 64 heads, 8 KV heads, RoPE 500K, ctx 2048→8192	Apache 2.0 (checkpoints)
Hermes 4.3-36B-Psyche	Internet-decentralized fine-tuning via Psyche	36B dense	— (post-training on Seed-36B)	ByteDance Seed-36B base, Llama-3 chat template, hybrid `<think>` mode	Apache 2.0

Pre-training benchmarks (0-shot) vs other dense baselines :

Benchmark	Covenant-72B	LLaMA-2-70B (centralized)	LLM360 K2 (65B, centralized)	INTELLECT-1 (10B, P2P)
ARC-Challenge	56.8	57.4	53.8	44.8
ARC-Easy	80.9	79.6	76.0	71.8
PIQA	81.6	82.6	82.5	77.4
OpenBookQA	44.0	49.4	48.0	43.8
HellaSwag	80.6	84.3	82.9	70.3
WinoGrande	75.9	80.4	76.4	63.3
MMLU	67.1	65.6	65.5	32.7

Covenant-72B-Chat (post-SFT) vs other chat models :

Benchmark	Covenant-72B-Chat	LLaMA-2-70B-Chat	K2-Chat (65B)
ARC-Challenge	64.2	65.4	62.0
MMLU	67.4	63.1	67.9
IFEval	64.7	40.7	45.5
MATH	26.3	10.7	19.1
MMLU-Pro	40.9	35.2	45.4
GSM8K	63.9	52.2	79.0

Hermes 4.3-36B-Psyche (Nous Research, Nov 2025) is a different point in the same space: not pre-training from scratch but post-training decentralized over internet. Built on ByteDance's Seed-36B, fine-tuned via Nous's Psyche network, released under Apache 2.0. The Psyche variant matches or beats the centralized 4.3-36B twin on every benchmark (AIME25 69.3 vs 66.8, MMLU-Pro 80.7 vs 79.7) — decentralized post-training did not degrade quality. Complements Covenant: two different decentralization angles (pre-training at 72B / post-training at 36B).

Why Covenant matters: Covenant-72B is the first proof-of-concept that 72B-scale pre-training is possible without data centers, with peers joining and leaving freely. Coordination via the Bittensor blockchain (Subnet 3), communication via SparseLoCo (146× compression vs dense gradients), peers running 8×B200 GPUs over commodity internet (500 Mb/s down, 110 Mb/s up). The model achieves 94.5% compute utilization despite the network constraints, with an average of 16.9 contributing peers per round and 70+ unique peers over the run. On benchmarks, it beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU (despite 1.8× fewer training tokens), and the chat variant has the best IFEval and MATH scores in its comparison group. It's the first credible alternative to the data-center duopoly for pre-training at 70B scale. Authors: Covenant AI + Mila. See arXiv 2603.08163.

Specialized

Theorem provers (Lean 4)

miniF2F (GitHub): 488 formal Olympiad-level math problems. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible on mathematical correctness.

Model	miniF2F	PutnamBench	Active	License
BFS-Prover-V2-32B	95.0%	--	32B	Apache 2.0
Goedel-Prover-V2-32B	90.4%	#1	32B	Apache 2.0
DeepSeek-Prover-V2-7B	88.9%	--	7B	MIT
DeepSeek-Prover-V2-7B + SGS	--	--	7B	MIT (model) / CC-BY-4.0 (paper)
Leanstral	--	--	32B	Apache 2.0
Kimina-Prover-72B	84.0%	--	72B	MIT
Leanabell-Prover-V2-7B	78.2%	--	7B	Apache 2.0

Lean 4 proofs are verified by the compiler. Either correct or rejected. Zero hallucination on mathematical correctness.

The sweet spot is 32B: BFS-Prover (95%) and Goedel-V2 (90.4%) both beat the 72B Kimina (84%).

SGS — Self-Guided Self-Play (Stanford, arXiv:2604.20209, Apr 2026) is not a model but an RL self-play algorithm applied to DeepSeek-Prover-V2-7B. After 200 rounds and 6.3M generations, the 7B fine-tune surpasses the pass@4 of DeepSeek-Prover-V2-671B on D3k (3 323 Lean 4 problems from Goedel-Pset-V1). Caveat: D3k is the SGS run's own training-target set, not a held-out public benchmark like miniF2F or PutnamBench — the 7B-beats-671B headline is real for in-distribution problems, scope-restricted otherwise. Demonstrates that well-tuned RL can collapse a 100× parameter gap on a target dataset. Authors: Bailey, Wen, Dong, Hashimoto, Ma.

Natural-language provers (not Lean 4)

A parallel track: models that write proofs in natural English, not formal Lean 4. Not compiler-verified, but closer to how mathematicians actually work.

Model	Benchmark	Active	Total	License
Nomos 1	Putnam 2025: 87/120 (72.5%) with Nomos Harness	~3B	~30B	Apache 2.0

Nomos 1 (Nous Research × Hillclimb AI, Dec 2025) is a Qwen3-30B-A3B-Thinking fine-tune specialized for natural-language proof writing, not Lean 4. On Putnam 2025 with the open-sourced Nomos Reasoning Harness, it jumps from 24/120 (base) to 87/120 — a +63 point gain where the inference harness matters as much as the model. Complementary to the Lean provers above, which offer compiler-verification guarantees that natural-language proofs cannot.

GUI agents

ScreenSpot (GitHub): 1,200+ instructions across desktop, mobile, web. Tests if the model can locate the right UI element from a natural language instruction.

Model	ScreenSpot	OSWorld	Active	License
UI-TARS-1.5-7B	94.2%	42.5	7B	Apache 2.0
Qwen2.5-VL-7B	84.7	--	7B	Apache 2.0
ShowUI-2B	--	--	2B	MIT

UI-TARS-7B beats Claude (87.6%) on ScreenSpot. 7B, Apache 2.0, runs on a laptop.

Search agents

Model	Specialty	Active	License
WebThinker-32B	RL web search, beats Gemini Deep Research	32B	Apache 2.0
DeepResearcher-7B	Emergent multi-step planning via RL	7B	Apache 2.0
Search-R1	Framework: teach any LLM to search (+26% on 7B)	any	Apache 2.0

Tool calling

BFCL (GitHub): Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory.

Model	BFCL	Active	License
Hammer2.1-7B	#1	7B	CC-BY-NC 4.0
xLAM-8B	#1 (alternate)	8B	CC-BY-NC 4.0
Hammer-0.5B	On-device	0.5B	CC-BY-NC 4.0

Specialized tool-calling models clearly beat generalists. xLAM-8B beats GPT-4o on BFCL.

Rust

Model	Strandset-Rust	RustEvo2	Active	License
Strand-Rust-Coder-14B	0.50	0.43	14B	Apache 2.0 (base)

Beats GPT-5-Codex and Claude Sonnet 4.5 on Rust benchmarks. Fine-tuned on 191K examples from 2,383 crates.

Vision / Multimodal

Model	MMMU	Active	Key feature	License
InternVL3-78B	72.2	78B	SOTA open-source VLM, custom InternViT	Apache 2.0
InternVL3-1B→38B	--	1-38B	Full range edge→server	Apache 2.0
Gemma 4 31B	Pro 76.9	31B	Text + image + video	Apache 2.0
Gemma 4 E2B/E4B	--	2.3-4.5B	Multimodal + audio, edge	Apache 2.0
Qwen2.5-VL-7B	--	7B	Computer/phone use, DocVQA 95.7	Apache 2.0
Nemotron 3 Nano Omni 30B-A3B	--	3B	any-to-any (text+audio+image+video → text), 256K ctx, Mamba-Transformer hybrid MoE	Nemotron OML
DeepSeek-OCR	--	3.3B	OCR specialist — Contexts Optical Compression: encode long text as compressed image, feed books/papers as pixels not tokens	MIT
DeepSeek-OCR-2	--	3.4B	OCR v2 with Visual Causal Flow (sequential reading order)	Apache 2.0

InternVL3-78B (72.2 MMMU) is on par with GPT-4o on multimodal. The InternViT encoder (300M–6B) is trained jointly with the LLM — not bolted on after the fact.

Nemotron 3 Nano Omni (NVIDIA, Apr 2026) extends the Nano family with native audio/video/image inputs. Targets enterprise document intelligence (contracts, SOW/MSA, finance), customer service (drive-thru order verification, delivery video OCR), GUI/browser/email agents, and dense video captioning. Stack of three components around the Nano LLM (vision encoder + audio encoder + LLM), not a monolithic any-to-any architecture. English-only. See arXiv:2604.24954.

Observations

Patterns observed across 60+ models. Not definitive truths.

Architecture

Dense retreats above 35B, but doesn't die. For generalists above 35B, MoE clearly dominates (GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super, all MoE). But dense survives where it has a structural advantage: Llama 3.3 70B (generalist), InternVL3-78B (vision), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), Covenant-72B (decentralized training), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice.
Parameter count is no longer the determining factor. Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond.
The 40-79B segment is the dense survivors' refuge. New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is well populated by quality dense models (Llama 3.3 70B, InternVL3-78B, Kimina-Prover-72B, Qwen 2.5-72B, Covenant-72B, R1-Distill-70B, Jamba 1.6 Mini 52B). This is where dense resists, and where you find both solid generalists and specialists.
InternVL3 is the best open-source VLM nobody was talking about. InternVL3-78B (Shanghai AI Lab) reaches 72.2 MMMU under Apache 2.0 — on par with GPT-4o. InternLM3-8B achieves SOTA with 75% fewer training tokens (4T vs 15-18T). Less press than Alibaba, comparable results.
Qwen is the de facto base model for fine-tuning. BFS-Prover, Goedel-Prover, Kimina-Prover, most community distillations: all built on Qwen. The ResNet of LLMs.
Decentralized pre-training is no longer a toy. Covenant-72B (Mar 2026) pre-trained a 72B dense LLaMA-3-style model over a permissionless blockchain network (Bittensor Subnet 3) on 1.1T tokens. It beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU despite 1.8× fewer training tokens, with 94.5% compute utilization over commodity internet (500/110 Mb/s) and dynamic peer participation. The data-center duopoly for pre-training at 70B scale now has a credible alternative. SparseLoCo + 2-bit quantization gives 146× compression on gradient communication.

Benchmarks

GPQA Diamond is the most discriminating benchmark for reasoning: 198 doctoral-level questions, impossible to solve by retrieval.
SWE-bench vs Codeforces measure different things. GPT-OSS-120B dominates competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).
Many models claim "1M context" without RULER scores at that length. Without measurement, it's marketing.
AIME versions (2024/2025/2026) are not comparable. Each year is harder. Only compare within the same version.

Specialization

Specialized models dominate on narrow tasks. UI-TARS-7B beats Claude on GUI (94.2% vs 87.6%). BFS-Prover-32B beats DeepSeek-671B on theorem proving (95% vs 88.9%).
The sweet spot for theorem proving is 32B. Method (tree search, self-correction) compensates for size.
Domain-specific models (medical, legal, finance) are less mature than code/math specialists. Generalists often outperform them on domain benchmarks. Specialization helps mainly for specific vocabulary, regulatory compliance, and private data fine-tuning.

Licenses

Gemma 4 under Apache 2.0 is a turning point. Google moved from a restrictive custom license to standard open-source for the first time.
Llama 4 excludes the EU for multimodal models. But text-only Llama (3.3 70B, 3.2 1B/3B) is EU-exploitable — the exclusion only applies to multimodal.
"Open-weight" is more nuanced than "open-source". Llama is technically open-weight but with geographic restrictions on multimodal. Always check the fine print.

Benchmarks reference

What each benchmark measures, how many questions it has, and where to find more.

Reasoning & Knowledge

GPQA Diamond (198 questions) — Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark.
MMLU-Pro (12K+ questions) — Hardened version of MMLU: 10 choices instead of 4, requires chain-of-thought reasoning. 14 domains. Drops accuracy 16-33% vs MMLU. Published at NeurIPS 2024.

Math

AIME (15 problems/year) — American Invitational Mathematics Examination. Competition-level math requiring creativity and multi-step reasoning. Each year's edition is harder. Only compare within the same version (2024/2025/2026).
MATH-500 (500 problems) — Diverse math problems (algebra, geometry, combinatorics, number theory). Good general math evaluation but easier to saturate than AIME.

Code

SWE-bench Verified (500 issues) — Real bugs from GitHub repos (Django, Flask, scikit-learn). The model must understand the codebase, find the bug, and produce a working patch. Human-validated by OpenAI. Paper
Codeforces (ELO system) — Algorithmic competition performance, scored like chess ELO. Measures pure algorithmic skill, not real-world coding. Different skill from SWE-bench.
LiveCodeBench (rotating, 700+) — Fresh competitive programming problems collected after model training cutoffs. Eliminates data contamination. Problems from LeetCode, AtCoder, Codeforces. GitHub

Long context

RULER (parametric) — Sophisticated "needle in a haystack" with multiple needles, multi-hop tracing, and aggregation. Tests at different lengths (4K to 1M). By NVIDIA. Many models claiming 1M context fail above 32K. GitHub

Agents & Tools

BFCL (2K+) — Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory. By UC Berkeley. GitHub

Theorem proving

miniF2F (488 problems) — Formal Olympiad-level math problems in Lean 4 (also Isabelle, HOL Light). Covers AMC, AIME, IMO, and university math. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible. GitHub

GUI

ScreenSpot (1.2K+ instructions) — GUI element grounding across desktop, mobile, and web. Tests if the model can locate the right UI element from a natural language instruction. GitHub

Licenses

License	Models	Commercial	EU	Patent grant	OSI
Apache 2.0	Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash, NousCoder, OmniCoder, Nomos 1, URM, ZAYA1, Bonsai, DeepSeek-OCR-2, Hermes 4.3-36B, Pleias (all variants), Baguettotron	Yes	Yes	Yes	Yes
MIT	GLM-4.5-Air, GLM-4.7-Flash, DeepSeek R1-Distill, DeepSeek-V4-Flash, DeepSeek-OCR (v1), Ling-2.6-flash, Kimi-Linear, MiniMax M2.7 (verify on HF), Phi-4	Yes	Yes	No (implicit)	Yes
🔴 Modified MIT (revenue/MAU caps)	Mistral Medium 3.5 (revenue cap), historically Kimi K2.5 (100M MAU), MiniMax M2.5	Conditional	Conditional	--	No
Nemotron OML	Nemotron 3 Nano/Super, Nemotron 3 Nano Omni	Yes	Yes	Yes	No
Jamba OML	Jamba 1.6	Yes	Yes	--	No
Llama Community	Llama 3.3 70B, Llama 3.2 1B/3B (text-only), Hermes 4-70B	Yes	Yes (text-only)	--	No
LFM Open v1.0	LFM2, LFM2.5, LFM2.5-VL	Yes (< $10M)	Yes	--	No
Qwen License	Qwen2.5-Math	Yes	Yes	--	No

How to choose

Constraint	Recommendation
Smartphone / edge (< 4 GB)	SmolLM3-3B, SmolLM2-135M/360M/1.7B, Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B
Laptop 16 GB	GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B
Desktop 24 GB	Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2, GLM-4.7-Flash Q4 (agent coding on RTX 4090)
Desktop 48+ GB (dense 70B)	Llama 3.3 70B (MMLU 86.0, EU OK), InternVL3-78B (vision)
Server single-GPU (80 GB)	GPT-OSS-120B
Server multi-GPU	Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B, Ling-2.6-flash 104B
Workstation 256 GB (extended tier)	DeepSeek-V4-Flash 284B/A13B (native 1M ctx, FP4+FP8)
Long context (> 256K)	Nemotron 3 Nano (1M, RULER 86.3%), DeepSeek-V4-Flash (1M native)
Token-efficient agent loops	Ling-2.6-flash (15M tokens on full AA suite)
On-device + thinking mode	ZAYA1-8B (760M active, 8.4B total)
Multimodal any-to-any	Nemotron 3 Nano Omni 30B-A3B (text+audio+image+video)
OCR / long-doc compression	DeepSeek-OCR-2 (Apache 2.0, Optical Compression)
Linear-attention research baseline	Kimi-Linear 48B/A3B (1M ctx, KDA + MLA hybrid)
Extreme quantization (mobile)	Bonsai-8B (1-bit, 1.15 GB)
Vision on edge (< 1 GB)	LFM2.5-VL-450M
RL self-play research (Lean)	DeepSeek-Prover-V2-7B + SGS (Stanford)
Math	Nemotron Nano 9B v2 (/think mode), GPT-OSS-120B
Code (real bugs)	Step-3.5-Flash, Devstral Small 2
Code (competition)	GPT-OSS-120B (Codeforces 2622)
Multilingual (100+ langs)	Qwen 3.5 (201), Qwen 3 (119)
Theorem proving (Lean 4)	BFS-Prover-V2-32B (95% miniF2F)
Theorem proving (natural language)	Nomos 1 (Putnam 2025: 87/120 with harness)
RAG with literal citations	Pleias-RAG-1B (native citation, 1B)
Copyright-safe training data	Pleias (100% public domain / CC)
Agent coding (laptop)	OmniCoder-9B (Terminal-Bench 23.6%), GLM-4.7-Flash Q4
Competitive coding (LCB)	NousCoder-14B (LCB v6 67.87%)
Uncensored generalist	Hermes 4-70B (SOTA RefusalBench)
GUI automation	UI-TARS-1.5-7B (94.2% ScreenSpot)
Throughput	Step-3.5-Flash (350 tok/s)

Rejected models

Model	License	Reason
Llama 4 (Maverick, Scout)	Llama Community License	EU exclusion (multimodal)
Llama 3.2 Vision 11B/90B	Llama Community License	EU exclusion (multimodal)
Llama-Nemotron-Super-49B	Llama 3.3 License	Inherits EU exclusion (multimodal base)
Qwen 3.6 Plus	Proprietary	Closed-source, API-only
Codestral	Non-commercial	Research only
Falcon 3	Ambiguous	Potential 10% royalty
Kimi K2.5	🔴 Modified MIT (100M MAU)	Listed-with-warning candidate; left out pending updated fiche — see warning callout in Generalists for the principle
DeepSeek V3/R1 full (671B)	MIT	Q4 ≈ 370 GB, beyond 256 GB extended cap
DeepSeek-V4-Pro (1.6T/49B)	MIT	Q4 ≈ 800 GB, datacenter only
Qwen 3 235B / Qwen 3.5 397B	Apache 2.0	235B Q4 ≈ 130 GB and 397B Q4 ≈ 218 GB technically fit extended tier — left out as MoE generalist territory is already covered by Ling-2.6-flash 104B and DeepSeek-V4-Flash with better efficiency

Contributing

Found an error? Missing a model? Open an issue or submit a PR.

Sources: HuggingFace, Papers With Code, official model repos and papers.

License

This list is licensed under CC-BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Open Weight Models

Table of Contents

LLMs

Generalists

Extended tier (128 < Q4 ≤ 256 GB)

Code

Code — LiveCodeBench & Terminal-Bench

Reasoning

GPQA Diamond (198 questions)

Math (AIME, 15 problems/year)

Compact / Edge

Long context

Alternative architectures

Decentralized training

Specialized

Theorem provers (Lean 4)

Natural-language provers (not Lean 4)

GUI agents

Search agents

Tool calling

Rust

Vision / Multimodal

Observations

Architecture

Benchmarks

Specialization

Licenses

Benchmarks reference

Reasoning & Knowledge

Math

Code

Long context

Agents & Tools

Theorem proving

GUI

Licenses

How to choose

Rejected models

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages