A curated list of open-weight AI models with commercially exploitable licenses, verified benchmarks, and no geographic restrictions. Built to decide which models to support in herbert-rs, a local LLM inference engine in Rust and hand-written assembly.
Selection criteria:
- Commercially exploitable license, no geographic restriction (EU ok)
- VRAM Q4 ≤ 128 GB (main) — what fits Q4-quantized on a single high-end workstation. An extended tier (128 < Q4 ≤ 256 GB) flags models that require a Mac Studio M3 Ultra or multi-GPU server.
- Released after April 2024
This excludes Llama 4 multimodal (EU exclusion), Qwen 3.6 Plus (closed-source), DeepSeek V3/R1 full (671B → ~370 GB Q4, beyond 256 GB), and others. Note: Llama text-only models (3.3 70B, 3.2 1B/3B) are EU-exploitable. See Rejected models for details.
Maintained by Philippe Anel. Last updated: May 2026.
v3 additions (May 2026) — Selection criterion switched from "< 200B params" to "VRAM Q4 ≤ 128 GB" + extended 256 GB tier (better proxy for what's actually runnable on consumer/prosumer hardware). Generalists: Ling-2.6-flash 104B/A7.4B (Ant), Mistral Medium 3.5 128B (🔴 Modified MIT), MiniMax M2.7 230B/A10B. Extended tier: DeepSeek-V4-Flash 284B/A13B (1M ctx native, FP4+FP8). Alternative architectures: ZAYA1-8B (Zyphra), Kimi-Linear 48B/A3B (Moonshot, KDA hybrid), Bonsai-8B (1-bit end-to-end, 1.15 GB). Vision/Multimodal: Nemotron 3 Nano Omni 30B-A3B, DeepSeek-OCR + DeepSeek-OCR-2. Compact/Edge: LFM2.5-VL-450M (vision edge). Theorem provers: SGS algorithm (Stanford, 7B beats 671B pass@4 on D3k). New license row: 🔴 Modified MIT with explicit warning.
v2 additions (April 2026) — Generalists: GLM-4.7-Flash, Hermes 4-70B. Code: NousCoder-14B, OmniCoder-9B (new LCB/Terminal-Bench subsection). Compact/Edge: Pleias-RAG-1B, Pleias-3B. Reasoning/Math: Qwen2.5-Math-72B (historical). Alternative architectures: URM. Decentralized training: Hermes 4.3-36B-Psyche. Theorem provers: Nomos 1 (natural-language track).
- LLMs
- Specialized
- Observations
- Benchmarks reference
- Licenses
- How to choose
- Rejected models
- Contributing
| Model | Publisher | Active | Total | Arch | Ctx | License | Key scores |
|---|---|---|---|---|---|---|---|
| Gemma 4 31B | 31B | 31B | Dense | 256K | Apache 2.0 | GPQA 84.3, MMLU-Pro 85.2 | |
| Qwen3.5-27B | Alibaba | 27B | 27B | Dense | 128K | Apache 2.0 | 201 languages |
| Qwen3.5-9B | Alibaba | 9B | 9B | Dense | 128K | Apache 2.0 | GPQA 81.7 (9B!) |
| Qwen3.5-122B-A10B | Alibaba | 10B | 122B | MoE | 256K | Apache 2.0 | 201 languages, multimodal |
| GPT-OSS-120B | OpenAI | 5.1B | 117B | MoE | 128K | Apache 2.0 | GPQA 80.9, Codeforces 2622, AIME 96.6% |
| GPT-OSS-20B | OpenAI | 3.6B | 21B | MoE | 128K | Apache 2.0 | AIME 96%, fits 16GB |
| Mistral Small 4 | Mistral | 6B | 119B | MoE | 256K | Apache 2.0 | GPQA 71.2, unified instruct/reasoning/coding |
| GLM-4.5-Air | Zhipu AI | 12B | 106B | MoE | 128K | MIT | MATH-500 98.1%, MMLU-Pro 81.4 |
| GLM-4.7-Flash | Zhipu AI | 3B | 30B | MoE (MLA) | 200K | MIT | SWE-bench 59.2, AIME25 91.6, GPQA 75.2 |
| Ling-2.6-flash | Ant Group | 7.4B | 104B | MoE (hybrid linear attn 1:7 MLA+Lightning) | 262K | MIT | Token-efficient agent (~15M tokens on full AA suite vs 40-100M for long-reasoners) |
| QwQ-32B | Alibaba | 32B | 32B | Dense | 128K | Apache 2.0 | AIME ~80%, reasoning RL |
| DeepSeek R1-Distill-32B | DeepSeek | 32B | 32B | Dense | 128K | MIT | Beats o1-mini |
| Step-3.5-Flash | StepFun | 11B | 196B | MoE | 262K | Apache 2.0 | SWE-bench 74.4%, 350 tok/s |
| Llama 3.3 70B | Meta | 70B | 70B | Dense | 128K | Llama Community (EU OK) | MMLU 86.0, HumanEval 88.4, MATH 77.0 |
| Hermes 4-70B | Nous Research | 70B | 70B | Dense | 128K | Llama Community (EU OK) | SOTA RefusalBench, hybrid reasoning, tool calling |
| InternVL3-78B | Shanghai AI Lab | 78B | 78B | Dense | -- | Apache 2.0 | MMMU 72.2, SOTA open-source VLM |
| Mistral Medium 3.5 128B | Mistral AI | 128B | 128B | Dense + Pixtral vision | 256K | 🔴 Modified MIT (revenue cap) | First Mistral merged flagship: Medium 3.1 + Magistral + Devstral 2 unified, configurable reasoning_effort |
| MiniMax M2.7 | MiniMax AI | 10B | 230B | MoE (256 experts, 8 active, 4.3% ratio) | ~200K | MIT (verify on HF) | Agentic workflows alt to Claude Opus 4.6 / GPT-5.3-Codex, IQ1_M @ 60.7 GB |
🔴 License warning — Modified MIT (revenue/MAU caps). Mistral Medium 3.5 falls under a Mistral Open License variant with a revenue threshold; MiniMax M2.7 and Kimi K2.5 historically shipped with similar caps (100M MAU for Kimi). They are listed for completeness but you must read the actual license before any commercial deployment — these are not interchangeable with Apache 2.0/MIT. The revenue/MAU clauses can flip a free model into a paid one once your product takes off.
Mistral Medium 3.5 (Apr/May 2026) is the first merged flagship from Mistral: a single set of weights unifying what used to be three distinct models — Medium 3.1 (instruct), Magistral (reasoning), Devstral 2 (coding agent). Behavior switches via
reasoning_effortper request (none/high). Replaces Medium 3.1 + Magistral in Le Chat and Devstral 2 in Vibe CLI. 88-layer dense (no MoE), Pixtral vision tower trained from scratch.
MiniMax M2.7 (Apr 2026 open-weight release) pushes the active/total ratio to 4.3% (10B/230B), targeting agentic long-running workflows (coding, multi-step troubleshooting, document editing). Positioned as open-weight alternative to Claude Opus 4.6 / GPT-5.3-Codex with IQ1_M weights at 60.7 GB making 230B practically deployable on a single workstation.
Models that exceed the 128 GB Q4 main cap but fit a 256 GB workstation (Mac Studio M3 Ultra max, multi-GPU server).
| Model | Publisher | Active | Total | Arch | Ctx | License | Key scores |
|---|---|---|---|---|---|---|---|
| DeepSeek-V4-Flash | DeepSeek | 13B | 284B | MoE (hybrid CSA+HCA, mHC, Muon optimizer) | 1M native | MIT | First < 200B-active LLM with native 1M ctx, FP4+FP8 mixed, 32T pre-train, 27% FLOPs / 10% KV cache vs V3.2 |
DeepSeek-V4-Flash (May 2026) is the small sibling of V4-Pro (1.6T/49B). Q4 ≈ 156 GB. Three inference modes integrated in the chat template: non-think, think-high, think-max (recommended at ≥ 384K context). The architecture introduces three new ideas — hybrid attention (CSA + HCA), multi-head computation (mHC), and the Muon optimizer — pushing the efficiency frontier rather than the parameter frontier.
| Model | SWE-bench | Codeforces | Active | License |
|---|---|---|---|---|
| Claude Opus 4.6 (closed) | 80.8% | -- | -- | -- |
| Gemini 3.1 Pro (closed) | 80.6% | -- | -- | -- |
| GPT-5.4 (closed) | ~80% | -- | -- | -- |
| Step-3.5-Flash | 74.4% | -- | 11B | Apache 2.0 |
| Devstral 2 | 72.2% | -- | ~12B | MIT modified |
| Qwen3-Coder-Next 80B-A3B | 70.6% | -- | 3B | Apache 2.0 |
| Qwen2.5-Coder-32B | 69.6% | -- | 32B | Apache 2.0 |
| Devstral Small 2 | 68.0% | -- | 24B | Apache 2.0 |
| GLM-4.7-Flash | 59.2% | -- | 3B | MIT |
| GPT-OSS-120B | 62.4% | 2622 | 5.1B | Apache 2.0 |
| Gemma 4 31B | -- | 2150 | 31B | Apache 2.0 |
SWE-bench = real bugs in real GitHub repos (Django, Flask, scikit-learn). 500 human-validated issues. Codeforces = algorithmic competition, ELO-scored like chess. Different skills: fixing a codebase vs solving a puzzle.
Specialized coders measured on benchmarks other than SWE-bench.
| Model | LiveCodeBench v6 | Terminal-Bench 2.0 | Active | License |
|---|---|---|---|---|
| OmniCoder-9B | -- | 23.6% | 9.4B | Apache 2.0 |
| NousCoder-14B | 67.87% | -- | 14.8B | Apache 2.0 |
| Qwen3.5-9B (baseline) | 60.79% | 14.6% | 9B | Apache 2.0 |
LiveCodeBench (rotating ≈700 problems from LeetCode/AtCoder/Codeforces, collected after model cutoffs) measures fresh competitive programming, vs SWE-bench (fixing real-world bugs) and Codeforces ELO (pure algorithms). Terminal-Bench 2.0 measures agentic coding skills (read-before-write, LSP responsiveness, minimal diffs).
OmniCoder-9B is a LoRA agentic fine-tune of Qwen3.5-9B on 425K Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro trajectories — +61% relative on Terminal-Bench vs base. NousCoder-14B is a pure-RL fine-tune of Qwen3-14B (+7.08 pts on LCB v6, no SFT). Same 9-14B class, opposite methods.
GPQA Diamond (198 questions)
Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark available.
| Model | GPQA | Active |
|---|---|---|
| Gemini 3.1 Pro (closed) | 94.3 | -- |
| GPT-5.4 (closed) | 92.8 | -- |
| Claude Opus 4.6 (closed) | 91.3 | -- |
| Gemma 4 31B | 84.3 | 31B |
| Gemma 4 26B-A4B | 82.3 | 3.8B |
| Qwen3.5-9B | 81.7 | 9B |
| GPT-OSS-120B | 80.9 | 5.1B |
| GLM-4.5-Air | 75.0 | 12B |
| Nemotron 3 Nano | 73.0 | 3.5B |
| Mistral Small 4 | 71.2 | 6B |
| Llama 3.3 70B | 50.5 | 70B |
Math (AIME, 15 problems/year)
Competition-level math requiring creativity and multi-step reasoning. Each year's edition is different and harder. Only compare within the same version.
| Model | AIME | Conditions | Active |
|---|---|---|---|
| GPT-5.4 (closed) | ~100% | 2025 | -- |
| Claude Opus 4.6 (closed) | ~98% | 2025 | -- |
| Nemotron 3 Nano | 99.2% | 2025, with tools | 3.5B |
| GPT-OSS-120B | 96.6% | 2024, with tools | 5.1B |
| GPT-OSS-20B | 96.0% | 2024, with tools | 3.6B |
| Gemma 4 31B | 89.2% | 2026 | 31B |
| Gemma 4 26B-A4B | 88.3% | 2026 | 3.8B |
| Ministral 14B | 85.0% | 2025 | 14B |
| Nemotron Nano 9B v2 | 97.8% | MATH-500, /think mode | 9B |
| Qwen2.5-Math-72B | 40.0% | 2024, TIR (Python) | 72B |
AIME versions (2024/2025/2026) are not comparable. Each year is harder.
Qwen2.5-Math (Sep 2024) is the first open model to make real AIME progress (12/30 on AIME 2024 — 6× GPT-4 Turbo at the time). Since surpassed by generalists (Nemotron, GPT-OSS 96%+). Historical reference, still useful for its TIR mode (Tool-Integrated Reasoning with Python interpreter) which eliminates arithmetic errors. Qwen License, not Apache — commercial OK but check terms.
Models that run on smartphones, laptops, or edge devices.
| Model | Active | VRAM Q4 | Strength | License |
|---|---|---|---|---|
| SmolLM3-3B | 3B | ~2 GB | Best 3B, AIME 36.7%, /think mode, 64K ctx | Apache 2.0 |
| SmolLM2-1.7B | 1.7B | ~1 GB | 11T tokens, data-centric | Apache 2.0 |
| SmolLM2-360M | 360M | < 1 GB | 4T tokens | Apache 2.0 |
| SmolLM2-135M | 135M | < 1 GB | Ultra-compact, few MB quantized | Apache 2.0 |
| Gemma 4 E2B | 2.3B | ~4 GB | Multimodal + audio | Apache 2.0 |
| Gemma 4 E4B | 4.5B | ~6 GB | Multimodal + audio | Apache 2.0 |
| Phi-4-mini | 3.8B | ~2 GB | MATH-500 92.5% | MIT |
| Phi-4-multimodal | 5.6B | ~3 GB | Text + image + audio | MIT |
| Ministral 3B | 3B | ~2 GB | Vision + reasoning, 256K ctx | Apache 2.0 |
| Ministral 8B | 8B | ~5 GB | AIME 78.7%, vision | Apache 2.0 |
| Ministral 14B | 14B | ~8 GB | AIME 85%, vision, 256K ctx | Apache 2.0 |
| LFM2.5-1.2B | 1.2B | ~1 GB | IFBench 47.3 (2x Qwen3-1.7B), thinking, vision, audio | LFM Open v1.0 |
| Llama 3.2 1B/3B | 1-3B | < 2 GB | 128K ctx, edge/mobile, EU OK (text-only) | Llama Community |
| InternLM3-8B | 8B | ~5 GB | Thinking mode, 4T tokens (75% less training) | Apache 2.0 |
| InternVL3-1B→38B | 1-38B | 1-20 GB | Vision SOTA, full range edge→server | Apache 2.0 |
| Chocolatine-2-4B-DPO | 4B | ~2.5 GB | French-optimized DPO fine-tune of Qwen3-4B, 262K ctx, no <think> |
Apache 2.0 |
| Pleias-RAG-1B | 1.2B | ~1 GB | 100% public-domain training data, native citation with literal quotes, EU multilingual | Apache 2.0 |
| Pleias-RAG-350M | 350M | < 1 GB | Same as Pleias-RAG-1B, ultra-compact | Apache 2.0 |
| Baguettotron | 0.3B | < 1 GB | Latest Pleias base (Dec 2025), French-focused SLM | Apache 2.0 |
| LFM2.5-VL-450M | 450M | < 1 GB | Vision edge: SigLIP2 + 512×512 native, object detection, WebGPU | LFM Open v1.0 |
| Bonsai-8B | 8B | 1.15 GB | 1-bit Qwen3-8B fine-tune, CUDA/Metal/CPU/Android/iPhone | Apache 2.0 |
SmolLM3-3B beats all other 3B models and competes with 4B models (Qwen3-4B, Gemma3-4B). Data quality matters more than model size: SmolLM2-1.7B trained on 11T tokens beats larger models trained on less data.
Chocolatine-2-4B (Jonathan Pacifico) is a DPO fine-tune of Qwen3-4B-Instruct-2507 on French preference datasets (Compar:IA from the French Ministry of Culture + French-ORCA), merged with TIES. Gains on every French benchmark tested (GPQA-FR, French MMLU, French Bench, FR-MT-Bench) without degrading English performance. One of the rare French-focused open-weight models built by an individual contributor rather than a lab.
Pleias (Paris-based lab, partners with NVIDIA and Mozilla Builders) trains exclusively on public-domain or CC-licensed data (Common Corpus, 2T tokens). Raw benchmark scores trail Qwen/Gemma of equivalent size, but the trade-off is unique: zero copyright ambiguity (EU AI Act / GDPR friendly), strong EU multilingual (FR, DE, IT, ES, NL, PL), and the Pleias-RAG variants emit literal-quote citations natively. Positioned for regulated sectors (public, legal, press, education) where data provenance matters more than peak scores.
| Model | Max ctx | RULER 1M | Architecture | Active | License |
|---|---|---|---|---|---|
| Nemotron 3 Nano | 1M | 86.3% | Mamba/MoE | 3.5B | Nemotron OML |
| Nemotron 3 Super | 1M | -- | Mamba/MoE | 12B | Nemotron OML |
| DeepSeek-V4-Flash (extended tier) | 1M native | -- | MoE (CSA+HCA hybrid) | 13B | MIT |
| Jamba 1.6 Mini | 256K | -- | SSM+Transformer/MoE | 12B | Jamba OML |
RULER (GitHub) tests retrieval in long contexts with multiple needles, multi-hop tracing, and aggregation. Parametric by length (4K to 1M). Many models claim "1M context" without publishing RULER scores at that length. Without measurement, it's marketing.
Non-Transformer or hybrid models.
| Model | Architecture | Active | Key metric | License |
|---|---|---|---|---|
| Granite 4.0 | 90% Mamba-2 / 10% Attention | 3-9B | 70% memory reduction, 2x speed | Apache 2.0 |
| LFM2/2.5 | Convolutions + grouped attention | 2.3B | 112 tok/s CPU, 2x Qwen3. LFM2.5: vision, audio, thinking | LFM Open v1.0 |
| Jamba 1.6 Mini | Mamba + Transformer + MoE | 12B | 2.5x Transformer speed | Jamba OML |
| URM | Recursive Universal Transformer (ConvSwiGLU + TBPTL) | 4× params (tiny) | ARC-AGI 1: 53.8%, Sudoku 77.6% | Open-source (research) |
| ZAYA1-8B | Hybrid Mamba + Compressed Cross Attention (CCA) + MoD + EDA | 760M / 8.4B (9% active) | On-device deployable, test-time-compute friendly, 128K ctx | Apache 2.0 |
| Kimi-Linear-48B-A3B | MoE hybrid: 3 KDA (linear) layers per 1 MLA (global) | 3B / 48B | 1M context, 5.7T tokens, demonstrates linear attention can match full attention | MIT |
| Bonsai-8B | Qwen3-8B fine-tuned at 1-bit end-to-end (GGUF Q1_0), all projections + LM head 1-bit | 8.19B | 1.15 GB on disk (14.2× FP16), runs on CPU/Android/iPhone | Apache 2.0 |
URM (Ubiquant, Dec 2025) loops its 4 layers 12× instead of stacking 48 distinct layers. With 4× parameters it reaches 53.8% on ARC-AGI 1 where a vanilla Transformer with 32× parameters stays under 40%. Key claim of the paper: the FFN, not attention, is the source of reasoning — counterintuitive given the community's focus on attention variants. Research model, not a production LLM, but architecturally interesting for future LLM designs. See arXiv:2512.14693.
ZAYA1-8B (Zyphra, May 2026) is a Zamba-2 successor: 80 layers mixing SSM-Mamba and attention, with Compressed Cross Attention (CCA) plus Mixture-of-Depths (MoD) and Expert Decision Attention (EDA) on top of 16 top-1 experts. The angle is intelligence per active parameter: 760M actifs gives sub-1B inference cost while keeping 8B-class capacity. Positioned for on-device + thinking-mode workflows where compute scales with active params, not totals. Tech report on zyphra.com/zaya1-8b-technical-report.
Kimi-Linear (Moonshot, Oct 2025, arXiv:2510.26692) is Moonshot's open research vehicle for linear attention outside the closed K2 family. The architecture is a 3:1 ratio of KDA (Kimi Delta Attention, linear) to MLA (full attention, global) layers. The point isn't frontier performance — it's the demonstration that linear attention can match full attention across short, long, and RL-style regimes while reducing memory cost. Useful baseline for engine work like herbert-rs.
Bonsai-8B (Prism ML, Mar 2026) is a 1-bit end-to-end fine-tune of Qwen3-8B: every projection + the LM head quantized to 1 bit (GGUF Q1_0), shrinking the deployed model to 1.15 GB. Direct competitor to BitNet, but trained as a fine-tune rather than natively 1.58-bit from scratch. Runs on CUDA, Metal, Android, CPU, and iPhone (via Locally AI). The radical end of the quantization spectrum — accept the quality drop in exchange for ubiquity.
Models pre-trained outside traditional data centers, using distributed peer-to-peer or blockchain-coordinated networks. The story is the training method, not the model quality.
| Model | Method | Size | Tokens | Architecture | License |
|---|---|---|---|---|---|
| Covenant-72B | Permissionless P2P, SparseLoCo optimizer, Bittensor blockchain (Subnet 3) | 72B dense | 1.1T (+14.8B SFT) | LLaMA-3 style, GQA, 80 layers, d=8192, 64 heads, 8 KV heads, RoPE 500K, ctx 2048→8192 | Apache 2.0 (checkpoints) |
| Hermes 4.3-36B-Psyche | Internet-decentralized fine-tuning via Psyche | 36B dense | — (post-training on Seed-36B) | ByteDance Seed-36B base, Llama-3 chat template, hybrid <think> mode |
Apache 2.0 |
Pre-training benchmarks (0-shot) vs other dense baselines :
| Benchmark | Covenant-72B | LLaMA-2-70B (centralized) | LLM360 K2 (65B, centralized) | INTELLECT-1 (10B, P2P) |
|---|---|---|---|---|
| ARC-Challenge | 56.8 | 57.4 | 53.8 | 44.8 |
| ARC-Easy | 80.9 | 79.6 | 76.0 | 71.8 |
| PIQA | 81.6 | 82.6 | 82.5 | 77.4 |
| OpenBookQA | 44.0 | 49.4 | 48.0 | 43.8 |
| HellaSwag | 80.6 | 84.3 | 82.9 | 70.3 |
| WinoGrande | 75.9 | 80.4 | 76.4 | 63.3 |
| MMLU | 67.1 | 65.6 | 65.5 | 32.7 |
Covenant-72B-Chat (post-SFT) vs other chat models :
| Benchmark | Covenant-72B-Chat | LLaMA-2-70B-Chat | K2-Chat (65B) |
|---|---|---|---|
| ARC-Challenge | 64.2 | 65.4 | 62.0 |
| MMLU | 67.4 | 63.1 | 67.9 |
| IFEval | 64.7 | 40.7 | 45.5 |
| MATH | 26.3 | 10.7 | 19.1 |
| MMLU-Pro | 40.9 | 35.2 | 45.4 |
| GSM8K | 63.9 | 52.2 | 79.0 |
Hermes 4.3-36B-Psyche (Nous Research, Nov 2025) is a different point in the same space: not pre-training from scratch but post-training decentralized over internet. Built on ByteDance's Seed-36B, fine-tuned via Nous's Psyche network, released under Apache 2.0. The Psyche variant matches or beats the centralized 4.3-36B twin on every benchmark (AIME25 69.3 vs 66.8, MMLU-Pro 80.7 vs 79.7) — decentralized post-training did not degrade quality. Complements Covenant: two different decentralization angles (pre-training at 72B / post-training at 36B).
Why Covenant matters: Covenant-72B is the first proof-of-concept that 72B-scale pre-training is possible without data centers, with peers joining and leaving freely. Coordination via the Bittensor blockchain (Subnet 3), communication via SparseLoCo (146× compression vs dense gradients), peers running 8×B200 GPUs over commodity internet (500 Mb/s down, 110 Mb/s up). The model achieves 94.5% compute utilization despite the network constraints, with an average of 16.9 contributing peers per round and 70+ unique peers over the run. On benchmarks, it beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU (despite 1.8× fewer training tokens), and the chat variant has the best IFEval and MATH scores in its comparison group. It's the first credible alternative to the data-center duopoly for pre-training at 70B scale. Authors: Covenant AI + Mila. See arXiv 2603.08163.
miniF2F (GitHub): 488 formal Olympiad-level math problems. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible on mathematical correctness.
| Model | miniF2F | PutnamBench | Active | License |
|---|---|---|---|---|
| BFS-Prover-V2-32B | 95.0% | -- | 32B | Apache 2.0 |
| Goedel-Prover-V2-32B | 90.4% | #1 | 32B | Apache 2.0 |
| DeepSeek-Prover-V2-7B | 88.9% | -- | 7B | MIT |
| DeepSeek-Prover-V2-7B + SGS | -- | -- | 7B | MIT (model) / CC-BY-4.0 (paper) |
| Leanstral | -- | -- | 32B | Apache 2.0 |
| Kimina-Prover-72B | 84.0% | -- | 72B | MIT |
| Leanabell-Prover-V2-7B | 78.2% | -- | 7B | Apache 2.0 |
Lean 4 proofs are verified by the compiler. Either correct or rejected. Zero hallucination on mathematical correctness.
The sweet spot is 32B: BFS-Prover (95%) and Goedel-V2 (90.4%) both beat the 72B Kimina (84%).
SGS — Self-Guided Self-Play (Stanford, arXiv:2604.20209, Apr 2026) is not a model but an RL self-play algorithm applied to DeepSeek-Prover-V2-7B. After 200 rounds and 6.3M generations, the 7B fine-tune surpasses the pass@4 of DeepSeek-Prover-V2-671B on D3k (3 323 Lean 4 problems from Goedel-Pset-V1). Caveat: D3k is the SGS run's own training-target set, not a held-out public benchmark like miniF2F or PutnamBench — the 7B-beats-671B headline is real for in-distribution problems, scope-restricted otherwise. Demonstrates that well-tuned RL can collapse a 100× parameter gap on a target dataset. Authors: Bailey, Wen, Dong, Hashimoto, Ma.
A parallel track: models that write proofs in natural English, not formal Lean 4. Not compiler-verified, but closer to how mathematicians actually work.
| Model | Benchmark | Active | Total | License |
|---|---|---|---|---|
| Nomos 1 | Putnam 2025: 87/120 (72.5%) with Nomos Harness | ~3B | ~30B | Apache 2.0 |
Nomos 1 (Nous Research × Hillclimb AI, Dec 2025) is a Qwen3-30B-A3B-Thinking fine-tune specialized for natural-language proof writing, not Lean 4. On Putnam 2025 with the open-sourced Nomos Reasoning Harness, it jumps from 24/120 (base) to 87/120 — a +63 point gain where the inference harness matters as much as the model. Complementary to the Lean provers above, which offer compiler-verification guarantees that natural-language proofs cannot.
ScreenSpot (GitHub): 1,200+ instructions across desktop, mobile, web. Tests if the model can locate the right UI element from a natural language instruction.
| Model | ScreenSpot | OSWorld | Active | License |
|---|---|---|---|---|
| UI-TARS-1.5-7B | 94.2% | 42.5 | 7B | Apache 2.0 |
| Qwen2.5-VL-7B | 84.7 | -- | 7B | Apache 2.0 |
| ShowUI-2B | -- | -- | 2B | MIT |
UI-TARS-7B beats Claude (87.6%) on ScreenSpot. 7B, Apache 2.0, runs on a laptop.
| Model | Specialty | Active | License |
|---|---|---|---|
| WebThinker-32B | RL web search, beats Gemini Deep Research | 32B | Apache 2.0 |
| DeepResearcher-7B | Emergent multi-step planning via RL | 7B | Apache 2.0 |
| Search-R1 | Framework: teach any LLM to search (+26% on 7B) | any | Apache 2.0 |
BFCL (GitHub): Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory.
| Model | BFCL | Active | License |
|---|---|---|---|
| Hammer2.1-7B | #1 | 7B | CC-BY-NC 4.0 |
| xLAM-8B | #1 (alternate) | 8B | CC-BY-NC 4.0 |
| Hammer-0.5B | On-device | 0.5B | CC-BY-NC 4.0 |
Specialized tool-calling models clearly beat generalists. xLAM-8B beats GPT-4o on BFCL.
| Model | Strandset-Rust | RustEvo2 | Active | License |
|---|---|---|---|---|
| Strand-Rust-Coder-14B | 0.50 | 0.43 | 14B | Apache 2.0 (base) |
Beats GPT-5-Codex and Claude Sonnet 4.5 on Rust benchmarks. Fine-tuned on 191K examples from 2,383 crates.
| Model | MMMU | Active | Key feature | License |
|---|---|---|---|---|
| InternVL3-78B | 72.2 | 78B | SOTA open-source VLM, custom InternViT | Apache 2.0 |
| InternVL3-1B→38B | -- | 1-38B | Full range edge→server | Apache 2.0 |
| Gemma 4 31B | Pro 76.9 | 31B | Text + image + video | Apache 2.0 |
| Gemma 4 E2B/E4B | -- | 2.3-4.5B | Multimodal + audio, edge | Apache 2.0 |
| Qwen2.5-VL-7B | -- | 7B | Computer/phone use, DocVQA 95.7 | Apache 2.0 |
| Nemotron 3 Nano Omni 30B-A3B | -- | 3B | any-to-any (text+audio+image+video → text), 256K ctx, Mamba-Transformer hybrid MoE | Nemotron OML |
| DeepSeek-OCR | -- | 3.3B | OCR specialist — Contexts Optical Compression: encode long text as compressed image, feed books/papers as pixels not tokens | MIT |
| DeepSeek-OCR-2 | -- | 3.4B | OCR v2 with Visual Causal Flow (sequential reading order) | Apache 2.0 |
InternVL3-78B (72.2 MMMU) is on par with GPT-4o on multimodal. The InternViT encoder (300M–6B) is trained jointly with the LLM — not bolted on after the fact.
Nemotron 3 Nano Omni (NVIDIA, Apr 2026) extends the Nano family with native audio/video/image inputs. Targets enterprise document intelligence (contracts, SOW/MSA, finance), customer service (drive-thru order verification, delivery video OCR), GUI/browser/email agents, and dense video captioning. Stack of three components around the Nano LLM (vision encoder + audio encoder + LLM), not a monolithic any-to-any architecture. English-only. See arXiv:2604.24954.
Patterns observed across 60+ models. Not definitive truths.
-
Dense retreats above 35B, but doesn't die. For generalists above 35B, MoE clearly dominates (GPT-OSS-120B, Mistral Small 4, Qwen3.5-122B-A10B, GLM-4.5-Air, Step-3.5-Flash, Nemotron 3 Super, all MoE). But dense survives where it has a structural advantage: Llama 3.3 70B (generalist), InternVL3-78B (vision), Kimina-Prover-72B (theorem proving), Qwen 2.5-72B (production NLP), Covenant-72B (decentralized training), DeepSeek R1-Distill-70B (distilled reasoning). Dense is becoming a specialization choice.
-
Parameter count is no longer the determining factor. Qwen3.5-9B (9B) beats GPT-OSS-120B (5.1B active, 117B total) on GPQA Diamond.
-
The 40-79B segment is the dense survivors' refuge. New models often jump from ~35B straight to ~120B total via MoE. But the 40-79B range is well populated by quality dense models (Llama 3.3 70B, InternVL3-78B, Kimina-Prover-72B, Qwen 2.5-72B, Covenant-72B, R1-Distill-70B, Jamba 1.6 Mini 52B). This is where dense resists, and where you find both solid generalists and specialists.
-
InternVL3 is the best open-source VLM nobody was talking about. InternVL3-78B (Shanghai AI Lab) reaches 72.2 MMMU under Apache 2.0 — on par with GPT-4o. InternLM3-8B achieves SOTA with 75% fewer training tokens (4T vs 15-18T). Less press than Alibaba, comparable results.
-
Qwen is the de facto base model for fine-tuning. BFS-Prover, Goedel-Prover, Kimina-Prover, most community distillations: all built on Qwen. The ResNet of LLMs.
-
Decentralized pre-training is no longer a toy. Covenant-72B (Mar 2026) pre-trained a 72B dense LLaMA-3-style model over a permissionless blockchain network (Bittensor Subnet 3) on 1.1T tokens. It beats LLaMA-2-70B on ARC-Challenge, ARC-Easy and MMLU despite 1.8× fewer training tokens, with 94.5% compute utilization over commodity internet (500/110 Mb/s) and dynamic peer participation. The data-center duopoly for pre-training at 70B scale now has a credible alternative. SparseLoCo + 2-bit quantization gives 146× compression on gradient communication.
-
GPQA Diamond is the most discriminating benchmark for reasoning: 198 doctoral-level questions, impossible to solve by retrieval.
-
SWE-bench vs Codeforces measure different things. GPT-OSS-120B dominates competition (ELO 2622) but gets beaten on real bugs by Step-3.5-Flash (74.4% vs 62.4%).
-
Many models claim "1M context" without RULER scores at that length. Without measurement, it's marketing.
-
AIME versions (2024/2025/2026) are not comparable. Each year is harder. Only compare within the same version.
-
Specialized models dominate on narrow tasks. UI-TARS-7B beats Claude on GUI (94.2% vs 87.6%). BFS-Prover-32B beats DeepSeek-671B on theorem proving (95% vs 88.9%).
-
The sweet spot for theorem proving is 32B. Method (tree search, self-correction) compensates for size.
-
Domain-specific models (medical, legal, finance) are less mature than code/math specialists. Generalists often outperform them on domain benchmarks. Specialization helps mainly for specific vocabulary, regulatory compliance, and private data fine-tuning.
-
Gemma 4 under Apache 2.0 is a turning point. Google moved from a restrictive custom license to standard open-source for the first time.
-
Llama 4 excludes the EU for multimodal models. But text-only Llama (3.3 70B, 3.2 1B/3B) is EU-exploitable — the exclusion only applies to multimodal.
-
"Open-weight" is more nuanced than "open-source". Llama is technically open-weight but with geographic restrictions on multimodal. Always check the fine print.
What each benchmark measures, how many questions it has, and where to find more.
-
GPQA Diamond (198 questions) — Graduate-level questions in physics, chemistry, biology. Designed to be unsolvable by Google search. Experts reach 65%, non-experts 34%. The most discriminating reasoning benchmark.
-
MMLU-Pro (12K+ questions) — Hardened version of MMLU: 10 choices instead of 4, requires chain-of-thought reasoning. 14 domains. Drops accuracy 16-33% vs MMLU. Published at NeurIPS 2024.
-
AIME (15 problems/year) — American Invitational Mathematics Examination. Competition-level math requiring creativity and multi-step reasoning. Each year's edition is harder. Only compare within the same version (2024/2025/2026).
-
MATH-500 (500 problems) — Diverse math problems (algebra, geometry, combinatorics, number theory). Good general math evaluation but easier to saturate than AIME.
-
SWE-bench Verified (500 issues) — Real bugs from GitHub repos (Django, Flask, scikit-learn). The model must understand the codebase, find the bug, and produce a working patch. Human-validated by OpenAI. Paper
-
Codeforces (ELO system) — Algorithmic competition performance, scored like chess ELO. Measures pure algorithmic skill, not real-world coding. Different skill from SWE-bench.
-
LiveCodeBench (rotating, 700+) — Fresh competitive programming problems collected after model training cutoffs. Eliminates data contamination. Problems from LeetCode, AtCoder, Codeforces. GitHub
- RULER (parametric) — Sophisticated "needle in a haystack" with multiple needles, multi-hop tracing, and aggregation. Tests at different lengths (4K to 1M). By NVIDIA. Many models claiming 1M context fail above 32K. GitHub
- BFCL (2K+) — Berkeley Function Calling Leaderboard. Tests function/tool calling accuracy: correct names, parameters, types. V4 adds web search and memory. By UC Berkeley. GitHub
- miniF2F (488 problems) — Formal Olympiad-level math problems in Lean 4 (also Isabelle, HOL Light). Covers AMC, AIME, IMO, and university math. Proofs are compiler-verified: either correct or rejected. Zero hallucination possible. GitHub
- ScreenSpot (1.2K+ instructions) — GUI element grounding across desktop, mobile, and web. Tests if the model can locate the right UI element from a natural language instruction. GitHub
| License | Models | Commercial | EU | Patent grant | OSI |
|---|---|---|---|---|---|
| Apache 2.0 | Gemma 4, Qwen 3/3.5, GPT-OSS, Ministral, Step-3.5-Flash, NousCoder, OmniCoder, Nomos 1, URM, ZAYA1, Bonsai, DeepSeek-OCR-2, Hermes 4.3-36B, Pleias (all variants), Baguettotron | Yes | Yes | Yes | Yes |
| MIT | GLM-4.5-Air, GLM-4.7-Flash, DeepSeek R1-Distill, DeepSeek-V4-Flash, DeepSeek-OCR (v1), Ling-2.6-flash, Kimi-Linear, MiniMax M2.7 (verify on HF), Phi-4 | Yes | Yes | No (implicit) | Yes |
| 🔴 Modified MIT (revenue/MAU caps) | Mistral Medium 3.5 (revenue cap), historically Kimi K2.5 (100M MAU), MiniMax M2.5 | Conditional | Conditional | -- | No |
| Nemotron OML | Nemotron 3 Nano/Super, Nemotron 3 Nano Omni | Yes | Yes | Yes | No |
| Jamba OML | Jamba 1.6 | Yes | Yes | -- | No |
| Llama Community | Llama 3.3 70B, Llama 3.2 1B/3B (text-only), Hermes 4-70B | Yes | Yes (text-only) | -- | No |
| LFM Open v1.0 | LFM2, LFM2.5, LFM2.5-VL | Yes (< $10M) | Yes | -- | No |
| Qwen License | Qwen2.5-Math | Yes | Yes | -- | No |
| Constraint | Recommendation |
|---|---|
| Smartphone / edge (< 4 GB) | SmolLM3-3B, SmolLM2-135M/360M/1.7B, Gemma 4 E2B, Phi-4-mini, Ministral 3B, LFM2.5-1.2B, Llama 3.2 1B/3B |
| Laptop 16 GB | GPT-OSS-20B, Ministral 14B, Gemma 4 26B-A4B |
| Desktop 24 GB | Gemma 4 31B, DeepSeek R1-Distill-32B, Devstral Small 2, GLM-4.7-Flash Q4 (agent coding on RTX 4090) |
| Desktop 48+ GB (dense 70B) | Llama 3.3 70B (MMLU 86.0, EU OK), InternVL3-78B (vision) |
| Server single-GPU (80 GB) | GPT-OSS-120B |
| Server multi-GPU | Step-3.5-Flash, Nemotron 3 Super, Qwen3.5-122B, Ling-2.6-flash 104B |
| Workstation 256 GB (extended tier) | DeepSeek-V4-Flash 284B/A13B (native 1M ctx, FP4+FP8) |
| Long context (> 256K) | Nemotron 3 Nano (1M, RULER 86.3%), DeepSeek-V4-Flash (1M native) |
| Token-efficient agent loops | Ling-2.6-flash (15M tokens on full AA suite) |
| On-device + thinking mode | ZAYA1-8B (760M active, 8.4B total) |
| Multimodal any-to-any | Nemotron 3 Nano Omni 30B-A3B (text+audio+image+video) |
| OCR / long-doc compression | DeepSeek-OCR-2 (Apache 2.0, Optical Compression) |
| Linear-attention research baseline | Kimi-Linear 48B/A3B (1M ctx, KDA + MLA hybrid) |
| Extreme quantization (mobile) | Bonsai-8B (1-bit, 1.15 GB) |
| Vision on edge (< 1 GB) | LFM2.5-VL-450M |
| RL self-play research (Lean) | DeepSeek-Prover-V2-7B + SGS (Stanford) |
| Math | Nemotron Nano 9B v2 (/think mode), GPT-OSS-120B |
| Code (real bugs) | Step-3.5-Flash, Devstral Small 2 |
| Code (competition) | GPT-OSS-120B (Codeforces 2622) |
| Multilingual (100+ langs) | Qwen 3.5 (201), Qwen 3 (119) |
| Theorem proving (Lean 4) | BFS-Prover-V2-32B (95% miniF2F) |
| Theorem proving (natural language) | Nomos 1 (Putnam 2025: 87/120 with harness) |
| RAG with literal citations | Pleias-RAG-1B (native citation, 1B) |
| Copyright-safe training data | Pleias (100% public domain / CC) |
| Agent coding (laptop) | OmniCoder-9B (Terminal-Bench 23.6%), GLM-4.7-Flash Q4 |
| Competitive coding (LCB) | NousCoder-14B (LCB v6 67.87%) |
| Uncensored generalist | Hermes 4-70B (SOTA RefusalBench) |
| GUI automation | UI-TARS-1.5-7B (94.2% ScreenSpot) |
| Throughput | Step-3.5-Flash (350 tok/s) |
| Model | License | Reason |
|---|---|---|
| Llama 4 (Maverick, Scout) | Llama Community License | EU exclusion (multimodal) |
| Llama 3.2 Vision 11B/90B | Llama Community License | EU exclusion (multimodal) |
| Llama-Nemotron-Super-49B | Llama 3.3 License | Inherits EU exclusion (multimodal base) |
| Qwen 3.6 Plus | Proprietary | Closed-source, API-only |
| Codestral | Non-commercial | Research only |
| Falcon 3 | Ambiguous | Potential 10% royalty |
| Kimi K2.5 | 🔴 Modified MIT (100M MAU) | Listed-with-warning candidate; left out pending updated fiche — see warning callout in Generalists for the principle |
| DeepSeek V3/R1 full (671B) | MIT | Q4 ≈ 370 GB, beyond 256 GB extended cap |
| DeepSeek-V4-Pro (1.6T/49B) | MIT | Q4 ≈ 800 GB, datacenter only |
| Qwen 3 235B / Qwen 3.5 397B | Apache 2.0 | 235B Q4 ≈ 130 GB and 397B Q4 ≈ 218 GB technically fit extended tier — left out as MoE generalist territory is already covered by Ling-2.6-flash 104B and DeepSeek-V4-Flash with better efficiency |
Found an error? Missing a model? Open an issue or submit a PR.
Sources: HuggingFace, Papers With Code, official model repos and papers.
This list is licensed under CC-BY 4.0.