A head-to-head benchmark comparing cloud LLMs (Anthropic, DeepSeek) against locally-hosted models (Ollama) on a real coding task: generating a complete Python CLI todo app with SQLite persistence.
Each model receives the same prompt, and the output is scored on 10 static feature checks and 7 functional tests (the code is actually executed). Quality score is on a 0–100 scale.
| Model | Provider | Type | Parameters |
|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic API | Cloud | — |
| Claude Opus 4.6 | Anthropic API | Cloud | — |
| DeepSeek V4-Flash | DeepSeek API | Cloud | 284B total, 13B active (MoE) |
| DeepSeek V4-Pro | DeepSeek API | Cloud | 1.6T total, 49B active (MoE) |
| Codestral 22B | Ollama | Local | 22B |
| DeepSeek R1 14B | Ollama | Local | 14B |
| Devstral | Ollama | Local | — |
| Qwen 3.5B (MoE) | Ollama | Local | 35B total, 3.5B active |
| Model | Type | TTFT (s) | Total Time (s) | Tok/s | Quality Score |
|---|---|---|---|---|---|
| Sonnet 4.6 | Cloud | 0.87 | 14.89 | 104.2 | 100 |
| Opus 4.6 | Cloud | 1.23 | 19.06 | 74.3 | 100 |
| Devstral | Local | 2.24 | 10.26 | 90.2 | 100 |
| Codestral 22B | Local | 15.81 | 22.11 | 98.5 | 60 |
| DeepSeek R1 14B | Local | 11.74 | 20.64 | 191.7 | 60 |
| Qwen 3.5B | Local | 28.20 | 30.91 | 1,510.2 | 28 |
Key takeaway: Devstral was the fastest model overall (10.26s) and the only local model to score a perfect 100. Codestral and DeepSeek R1 built interactive input() apps instead of CLI-argument tools, failing all functional tests. Qwen hit the token limit mid-code, producing a syntax error.
See results/SUMMARY.md for the full analysis.
- Python 3.12+
- Ollama installed and running (for local models)
- API keys for cloud providers
pip install anthropic openaiPull the local models:
ollama pull codestral:22b
ollama pull deepseek-r1:14b
ollama pull devstral:latest
ollama pull qwen3.5:35b-a3bexport ANTHROPIC_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"python3 benchmark.pypython3 validate.pyResults are written to results/.
llm-benchmark/
├── benchmark.py # Main benchmark runner
├── validate.py # Validates generated code (static + functional)
├── results/
│ ├── benchmark_results.json # Raw benchmark data
│ ├── SUMMARY.md # Detailed analysis
│ ├── sonnet_4_6.py # Generated code per model
│ ├── opus_4_6.py
│ ├── devstral.py
│ ├── codestral_22b.py
│ ├── deepseek_r1.py
│ └── qwen_3_5b.py
└── README.md
Full write-up coming soon at vibescoder.dev.