Skip to content

carryologist/llm-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Benchmark — Local vs Cloud Model Showdown

A head-to-head benchmark comparing cloud LLMs (Anthropic, DeepSeek) against locally-hosted models (Ollama) on a real coding task: generating a complete Python CLI todo app with SQLite persistence.

Each model receives the same prompt, and the output is scored on 10 static feature checks and 7 functional tests (the code is actually executed). Quality score is on a 0–100 scale.

Models Tested

Model Provider Type Parameters
Claude Sonnet 4.6 Anthropic API Cloud
Claude Opus 4.6 Anthropic API Cloud
DeepSeek V4-Flash DeepSeek API Cloud 284B total, 13B active (MoE)
DeepSeek V4-Pro DeepSeek API Cloud 1.6T total, 49B active (MoE)
Codestral 22B Ollama Local 22B
DeepSeek R1 14B Ollama Local 14B
Devstral Ollama Local
Qwen 3.5B (MoE) Ollama Local 35B total, 3.5B active

Results

Model Type TTFT (s) Total Time (s) Tok/s Quality Score
Sonnet 4.6 Cloud 0.87 14.89 104.2 100
Opus 4.6 Cloud 1.23 19.06 74.3 100
Devstral Local 2.24 10.26 90.2 100
Codestral 22B Local 15.81 22.11 98.5 60
DeepSeek R1 14B Local 11.74 20.64 191.7 60
Qwen 3.5B Local 28.20 30.91 1,510.2 28

Key takeaway: Devstral was the fastest model overall (10.26s) and the only local model to score a perfect 100. Codestral and DeepSeek R1 built interactive input() apps instead of CLI-argument tools, failing all functional tests. Qwen hit the token limit mid-code, producing a syntax error.

See results/SUMMARY.md for the full analysis.

Run It Yourself

Prerequisites

  • Python 3.12+
  • Ollama installed and running (for local models)
  • API keys for cloud providers
pip install anthropic openai

Pull the local models:

ollama pull codestral:22b
ollama pull deepseek-r1:14b
ollama pull devstral:latest
ollama pull qwen3.5:35b-a3b

Set Environment Variables

export ANTHROPIC_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"

Run the Benchmark

python3 benchmark.py

Validate the Results

python3 validate.py

Results are written to results/.

Project Structure

llm-benchmark/
├── benchmark.py                  # Main benchmark runner
├── validate.py                   # Validates generated code (static + functional)
├── results/
│   ├── benchmark_results.json    # Raw benchmark data
│   ├── SUMMARY.md                # Detailed analysis
│   ├── sonnet_4_6.py             # Generated code per model
│   ├── opus_4_6.py
│   ├── devstral.py
│   ├── codestral_22b.py
│   ├── deepseek_r1.py
│   └── qwen_3_5b.py
└── README.md

Blog Post

Full write-up coming soon at vibescoder.dev.

About

LLM model benchmark: comparing cloud and local models on agentic coding tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages