BenchPress

Opinionated LLM evaluation for real-world use.

Live dashboard: mark-allwyn.github.io/BenchPress

BenchPress runs two independent benchmark suites against any LLM:

Generalist (80 prompts, 8 categories) - trap questions, false premises, constraint-heavy tasks, coding problems with no bug to find. Scored through three layers: deterministic auto-checks, multi-judge LLM scoring (1-5 normalised to 0-100), and DeepEval G-Eval metrics (correctness, coherence, instruction-following, 0-100).

Causal Reasoning v2.4 (100 questions, 20 bundles) - adversarial multiple-choice questions testing whether models truly understand causal inference or just pattern-match. Each bundle has 5 variants: base scenario, trap (the obvious answer is wrong), formal DAG reasoning with short elimination-style options, multi-step numeric, and analyst debate. Four rounds of structural hardening eliminated length bias and keyword tells. Scored deterministically, no LLM judge or DeepEval involvement.

Both benchmarks display 0-100 and are reported side by side. They are never blended into a single number. Results persist as JSON, so when a new model drops, one command compares it against everything tested before.

Features

Two benchmark suites - Generalist (80 prompts, open-ended) and Causal Reasoning (100 multiple-choice, bundled)
Three-layer scoring (Generalist) - heuristic auto-checks, multi-judge LLM scoring, and DeepEval G-Eval metrics combined into a composite score
Per-variant accuracy (Causal) - 20 bundles × 5 variants each, exposing pattern-matching vs structural reasoning per model
Multi-judge consensus - multiple independent LLM judges score each response, with self-judging prevented and agreement/divergence tracking
50 models, 12 companies - Anthropic, OpenAI, Google, Meta, xAI, Mistral, Alibaba, Zhipu, Moonshot, MiniMax, Cohere, Amazon
20 automated checkers - trap detection, sycophancy checks, constraint validation, hallucination flags, multiple-choice scoring, and more
Interactive dashboard - sortable leaderboard with per-category breakdowns, company views, causal reasoning page, and methodology docs
Any OpenAI-compatible API - works with vLLM, Ollama, Together, Groq, HF Inference API, and others
Append-only history - re-runs append new entries, full history preserved per prompt

Quick Start

pip install -r requirements.txt

cp config.example.yaml config.yaml
# Edit config.yaml - add your API keys and configure judge model

export ANTHROPIC_API_KEY=sk-...
export OPENAI_API_KEY=sk-...

# Run general eval against a model
python run.py eval claude-sonnet-4

# Run causal reasoning benchmark
python run.py eval claude-sonnet-4 --benchmark causal

# Compare everything
python run.py compare

# View the dashboard
python run.py dashboard --open

Scoring Pipeline

Each response is scored through three layers:

Auto-checks - deterministic heuristic checks (word count, JSON validity, trap detection, etc.) that flag mechanical failures instantly
LLM judges - multiple independent LLM judges each score responses 1-5 against the prompt's ideal answer and criteria
DeepEval G-Eval - research-backed metrics (correctness, coherence, instruction following) scored 0-1

The composite score merges judge and DeepEval into a single 0-1 metric:

composite = judge_weight * ((judge - 1) / 4) + deepeval_weight * deepeval_avg

Weights default to 50/50, configurable in config.yaml. The dashboard auto-regenerates after each eval, rejudge, and deepeval run.

Commands

Command	Description
`python run.py eval <model>`	Run Generalist benchmark against a model
`python run.py eval <model> --benchmark causal`	Run causal reasoning benchmark
`python run.py eval <model> --benchmark all`	Run both benchmarks
`python run.py eval <model> --ids C01 L02`	Run specific prompts
`python run.py eval <model> --category coding`	Filter by category
`python run.py eval <model> --rerun`	Re-run (appends, keeps history)
`python run.py rejudge`	Re-judge all models with current judge
`python run.py rejudge --benchmark causal`	Re-judge causal benchmark
`python run.py deepeval`	Score all models with DeepEval metrics
`python run.py compare`	Compare all models
`python run.py compare --benchmark causal`	Compare causal results
`python run.py compare --save`	Save markdown report
`python run.py dashboard`	Generate HTML dashboard
`python run.py dashboard --open`	Generate and open in browser
`python run.py models`	List evaluated models
`python run.py prompts`	List Generalist eval prompts
`python run.py prompts --benchmark causal`	List causal prompts

Models Evaluated

50 models across 12 companies. All ran on Generalist; 5 are excluded from Causal (retired APIs, paid-tier-only, broken HF model paths).

Full model list

Model	Company	Launched
claude-opus-4.8	Anthropic	2026-05-28
claude-opus-4.7	Anthropic	2026-04-14
claude-opus-4.6	Anthropic	2026-01-28
claude-sonnet-4.6	Anthropic	2026-01-28
claude-opus-4.5	Anthropic	2025-11-01
claude-sonnet-4.5	Anthropic	2025-09-29
claude-opus-4	Anthropic	2025-05-14
claude-sonnet-4	Anthropic	2025-05-14
claude-sonnet-3.7	Anthropic	2025-02-19
claude-haiku-3	Anthropic	2024-03-07
gpt-5.5	OpenAI	2026-04-23
gpt-5.4	OpenAI	2026-03-05
gpt-5.3	OpenAI	2026-03-03
gpt-5.2	OpenAI	2025-12-01
gpt-5.1	OpenAI	2025-11-01
gpt-oss-120b	OpenAI	2025-07-01
gpt-oss-20b	OpenAI	2025-07-01
o4-mini	OpenAI	2025-04-16
gpt-4.1	OpenAI	2025-04-14
gpt-4.1-mini	OpenAI	2025-04-14
gpt-4.1-nano	OpenAI	2025-04-14
o3-mini	OpenAI	2025-01-31
gpt-4o	OpenAI	2024-05-13
gpt-4o-mini	OpenAI	2024-07-18
gemini-3.1-pro	Google	2026-01-01
gemini-3-pro	Google	2025-09-01
gemini-3-flash	Google	2025-09-01
gemini-2.5-flash	Google	2025-05-20
gemma-3-27b	Google	2025-03-12
grok-4.1-fast	xAI	2025-10-01
grok-4	xAI	2025-07-09
llama-4-scout	Meta	2025-04-05
llama-4-maverick	Meta	2025-04-05
llama3.2	Meta	2024-09-25
llama3.2-vision-11b	Meta	2024-09-25
llama3.1	Meta	2024-07-23
qwen3-235b	Alibaba	2025-07-01
qwen3-coder-30b	Alibaba	2025-07-01
qwen3-32b	Alibaba	2025-04-29
minimax-m2.5	MiniMax	2025-10-01
kimi-k2.5	Moonshot	2025-10-01
glm-5	Zhipu	2025-10-01
glm-4.7-flash	Zhipu	2025-06-01
mistral-large-3	Mistral	2025-03-01
codestral	Mistral	2024-05-29
command-a	Cohere	2025-03-01
nova-2-lite	Amazon	2025-06-01
nova-pro	Amazon	2024-12-03
nova-lite	Amazon	2024-12-03
nova-micro	Amazon	2024-12-03

Causal Reasoning Benchmark

100 multiple-choice questions across 20 bundles, each covering a core causal-inference pitfall (confounding + selection, M-bias, Berkson's bias, time-varying confounding, transportability, etc). Every bundle has 5 variant types:

Variant	What it tests
Base	Narrative scenario combining 2-3 interacting causal issues
Trap	Looks like the base concept applies but the obvious answer is wrong; tests when a principle does NOT apply
Transfer	Formal DAG reasoning with short elimination-style answers (set notation, path counts, yes/no)
Numeric	Multi-step calculation with tables and conditional probabilities
Analyst	Two analysts debate - pick the most accurate assessment

Scoring dimensions:

Raw accuracy - % of 100 questions correct
Variant accuracy - per-variant accuracy across all 20 bundles (reveals which causal skill a model is weakest at)
Bundle consistency - how many of 5 variants correct per bundle (exposes pattern-matching vs genuine understanding)
Invalid rate - responses that don't produce a valid A/B/C/D answer

Because answers are deterministic, this benchmark skips LLM-judge and DeepEval scoring. Set per-benchmark scoring in config.yaml:

eval:
  benchmark_scoring:
    causal:
      skip_judges: true
      skip_deepeval: true

Run against the causal set:

python run.py eval <model-name> --benchmark causal

Causal questions are not published, to prevent models being tuned to this specific benchmark. Hardening design document: docs/plans/2026-04-10-causal-benchmark-v2-harder.md.

Hardening history

The v2.4 transfer variant is the result of four iterations against a cheap baseline model (Claude Haiku 3). Each round exposed a structural tell that let models score high without reasoning:

Version	Change	Haiku transfer	Opus transfer
v2.0	Initial release	90%	90%
v2.1	Content hardening (more distractors)	90%	90%
v2.2	Length normalisation	90%	90%
v2.3	Narrative replaced with paragraph-long DAG questions	85%	100% (saturated)
v2.4	Elimination-style short options (set notation, counts)	40%	55%

The key insight: when all 4 options are similar short length (20-80 chars), length-based heuristics fail and the question forces actual DAG traversal.

Auto-Checks

20 active checkers, plus 8 judge-only categories that rely entirely on LLM scoring:

Check	What it catches
`trap_no_bug`	Model invents a phantom bug in working code
`trap_common_error`	Model confuses memory vs compute complexity
`trap_wrong_claim`	Model agrees with a wrong claim instead of correcting
`sycophancy_check`	Model sycophantically agrees with a wrong position
`json_valid`	Response isn't valid JSON when asked for JSON
`constraint_check`	Wrong item count, included excluded terms
`refusal_check`	Unnecessary refusal on legitimate requests
`ambiguity_check`	Didn't ask for clarification on vague input
`word_count`	Over/under target word count
`word_count_reduction`	Insufficiently compressed summary
`response_length`	Exceeds maximum word count
`banned_words`	Uses explicitly banned words
`self_awareness`	Doesn't acknowledge known limitations
`code_runnable`	No code block found when code was expected
`hallucination_api`	Treats a fake API/library as real
`acknowledges_nonexistence`	Doesn't flag a fake event/thing as nonexistent
`table_format`	Wrong column/row count in table output
`multi_step_verify`	Expected numeric answer not found
`statistical_significance`	Overclaims statistical significance
`multiple_choice`	Extracts answer letter and checks against correct answer

Configuration

Adding Models

Any OpenAI-compatible API works (vLLM, Ollama, Together, Groq, HF Inference API, etc.):

# In config.yaml
llama-3-70b:
  provider: openai_compatible
  model: meta-llama/Llama-3-70b
  company: Meta
  launch_date: "2024-04-18"
  api_key_env: none
  base_url: http://localhost:8000/v1
  params:
    max_tokens: 4096
    temperature: 0

Supported providers: anthropic, openai, google, ollama, bedrock, cohere, openai_compatible.

Adding Prompts

Edit evals/general.json for general prompts or evals/causal.json for causal reasoning. Each general prompt:

{
  "id": "X01",
  "category": "your_category",
  "subcategory": "specific_area",
  "difficulty": "easy|medium|hard",
  "prompt": "The actual prompt",
  "ideal": "What good looks like",
  "criteria": ["what", "you", "judge"],
  "check_type": "reasoning"
}

After adding prompts, run existing models with --rerun or just eval (only new prompts run by default).

Results Structure

Each model gets its own JSON file in results/:

{
  "model_name": "claude-sonnet-4",
  "created": "2026-02-06T...",
  "updated": "2026-02-06T...",
  "runs": {
    "C01": [
      {
        "timestamp": "2026-02-06T...",
        "api_model": "claude-sonnet-4-20250514",
        "content": "...",
        "latency_s": 3.2,
        "input_tokens": 245,
        "output_tokens": 612,
        "auto_checks": { "flags": [], "passed": true },
        "judge_scores": {
          "gpt-4.1": {
            "score": 4,
            "rationale": "Mostly correct but missed edge case...",
            "judged_at": "2026-02-06T10:01:00"
          }
        },
        "judge_score_avg": 4.0,
        "judge_count": 1,
        "deepeval_scores": { "correctness": 0.87, "coherence": 0.94, "instruction_following": 0.91 },
        "deepeval_avg": 0.9067
      }
    ]
  }
}

Re-running with --rerun appends a new entry; the latest run is used for comparisons.

Project Structure

llm-eval/
├── run.py                       # CLI: eval, compare, rejudge, deepeval, dashboard, models, prompts
├── config.example.yaml          # Template - copy to config.yaml
├── requirements.txt
├── evals/
│   ├── general.json             # 80 general eval prompts across 8 categories
│   └── causal.json              # 100 causal reasoning questions in 20 bundles
├── scripts/
│   ├── providers.py             # Anthropic, OpenAI, Google, Ollama, Bedrock, Cohere, OpenAI-compatible
│   ├── checks.py                # 20 automated response checkers
│   ├── judge.py                 # LLM-as-judge scoring (1-5)
│   ├── deepeval_scorer.py       # DeepEval G-Eval integration (0-1)
│   └── dashboard.py             # HTML dashboard generation
├── docs/                        # Generated dashboard pages (GitHub Pages-ready)
│   ├── index.html               # Overview: scatter, timeline, top 10s, link cards
│   ├── generalist.html          # Generalist benchmark deep-dive
│   ├── causal.html              # Causal reasoning benchmark deep-dive
│   ├── companies.html
│   ├── categories.html
│   ├── judges.html              # Judge audit (agreement, divergence, bias)
│   ├── methodology.html
│   ├── data.json                # Shared dataset, fetched on page load
│   ├── causal-data.json
│   ├── sitemap.xml, robots.txt, favicon.svg, og-card.png
└── results/                     # Per-model JSON files (tracked in git)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchPress

Features

Quick Start

Scoring Pipeline

Commands

Models Evaluated

Causal Reasoning Benchmark

Hardening history

Auto-Checks

Configuration

Adding Models

Adding Prompts

Results Structure

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
docs		docs
evals		evals
results		results
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

BenchPress

Features

Quick Start

Scoring Pipeline

Commands

Models Evaluated

Causal Reasoning Benchmark

Hardening history

Auto-Checks

Configuration

Adding Models

Adding Prompts

Results Structure

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages