LLM Benchmark — Local vs Cloud Model Showdown

A head-to-head benchmark comparing cloud LLMs (Anthropic, DeepSeek) against locally-hosted models (Ollama) on a real coding task: generating a complete Python CLI todo app with SQLite persistence.

Each model receives the same prompt, and the output is scored on 10 static feature checks and 7 functional tests (the code is actually executed). Quality score is on a 0–100 scale.

Models Tested

Model	Provider	Type	Parameters
Claude Sonnet 4.6	Anthropic API	Cloud	—
Claude Opus 4.6	Anthropic API	Cloud	—
DeepSeek V4-Flash	DeepSeek API	Cloud	284B total, 13B active (MoE)
DeepSeek V4-Pro	DeepSeek API	Cloud	1.6T total, 49B active (MoE)
Codestral 22B	Ollama	Local	22B
DeepSeek R1 14B	Ollama	Local	14B
Devstral	Ollama	Local	—
Qwen 3.5B (MoE)	Ollama	Local	35B total, 3.5B active

Results

Model	Type	TTFT (s)	Total Time (s)	Tok/s	Quality Score
Sonnet 4.6	Cloud	0.87	14.89	104.2	100
Opus 4.6	Cloud	1.23	19.06	74.3	100
Devstral	Local	2.24	10.26	90.2	100
Codestral 22B	Local	15.81	22.11	98.5	60
DeepSeek R1 14B	Local	11.74	20.64	191.7	60
Qwen 3.5B	Local	28.20	30.91	1,510.2	28

Key takeaway: Devstral was the fastest model overall (10.26s) and the only local model to score a perfect 100. Codestral and DeepSeek R1 built interactive input() apps instead of CLI-argument tools, failing all functional tests. Qwen hit the token limit mid-code, producing a syntax error.

See results/SUMMARY.md for the full analysis.

Run It Yourself

Prerequisites

Python 3.12+
Ollama installed and running (for local models)
API keys for cloud providers

pip install anthropic openai

Pull the local models:

ollama pull codestral:22b
ollama pull deepseek-r1:14b
ollama pull devstral:latest
ollama pull qwen3.5:35b-a3b

Set Environment Variables

export ANTHROPIC_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"

Run the Benchmark

python3 benchmark.py

Validate the Results

python3 validate.py

Results are written to results/.

Project Structure

llm-benchmark/
├── benchmark.py                  # Main benchmark runner
├── validate.py                   # Validates generated code (static + functional)
├── results/
│   ├── benchmark_results.json    # Raw benchmark data
│   ├── SUMMARY.md                # Detailed analysis
│   ├── sonnet_4_6.py             # Generated code per model
│   ├── opus_4_6.py
│   ├── devstral.py
│   ├── codestral_22b.py
│   ├── deepseek_r1.py
│   └── qwen_3_5b.py
└── README.md

Blog Post

Full write-up coming soon at vibescoder.dev.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark — Local vs Cloud Model Showdown

Models Tested

Results

Run It Yourself

Prerequisites

Set Environment Variables

Run the Benchmark

Validate the Results

Project Structure

Blog Post

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
validate.py		validate.py

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark — Local vs Cloud Model Showdown

Models Tested

Results

Run It Yourself

Prerequisites

Set Environment Variables

Run the Benchmark

Validate the Results

Project Structure

Blog Post

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages