A config-driven evaluation pipeline for benchmarking LLMs on European legal reasoning tasks.
Legal Benchmark Runner evaluates large language models against curated European legal datasets covering professional reasoning, law exams, multilingual MCQs, and human-rights case law. It supports multiple LLM providers, three judging strategies, and produces structured, reproducible run artifacts.
- 🔌 Multi-provider — Route candidate models through NVIDIA NIM, Amazon Bedrock, Mistral, Vercel AI Gateway, or any OpenAI-compatible endpoint via LiteLLM
- 🧑⚖️ Three judging modes — Rubric-based (LLM-graded criteria), reference-answer (LLM comparison), and MCQ (deterministic exact-match)
- 📊 Structured outputs — Every run produces JSONL artifacts, per-dataset summaries, and a full config snapshot for reproducibility
- ⚡ Parallel & rate-limited — Configurable worker pools and per-minute rate limits for both generation and judging
- 💾 Disk caching — Avoid redundant API calls across re-runs
- ✅ Validated inputs — Canonical schema validation fails fast on malformed data
- Overview
- Architecture
- Datasets
- Getting Started
- Configuration
- Usage
- Output Artifacts
- Testing
- Project Structure
- Dataset Citations
- License
┌─────────────────────────────────────────────────────────────┐
│ config.yaml │
│ (providers · datasets · candidates · judges) │
└────────────────────────────┬────────────────────────────────┘
│
┌────────▼────────┐
│ Config Validation │
│ (src/config.py) │
└────────┬─────────┘
│
┌──────────────▼──────────────┐
│ Runner Orchestrator │
│ (parallel workers + cache) │
└──────┬──────────────┬───────┘
│ │
┌──────────▼──────┐ ┌───▼──────────────┐
│ Generation │ │ Judging │
│ (LiteLLM / │ │ (Rubric / Ref / │
│ Google GenAI) │ │ MCQ grading) │
└─────────────────┘ └──────────────────┘
│ │
┌──────▼──────────────▼───────┐
│ Output Writer │
│ outputs/<run_id>/ │
│ examples · responses · │
│ judgments · summary │
└─────────────────────────────┘
Validation ownership is intentionally split:
- Config-time validation in
src/config.py(static enums, required fields, provider compatibility). - Runtime validation in runner execution flow (
src/runner/orchestrator.py,src/runner/generation.py) for dynamic content and coverage checks.
The benchmark input is built from five curated upstream sources, each mapped to a specific judging mode:
| Dataset | Source | Task Type | Judging Mode |
|---|---|---|---|
| PRBench | ScaleAI/PRBench | Rubric QA | LLM rubric |
| APEX-v1 | mercor/APEX-v1-extended | Rubric QA | LLM rubric |
| LEXam | LEXam-Benchmark/LEXam | Reference QA / MCQ | LLM reference / Exact-match |
| INCLUDE-Base | CohereLabs/include-base-44 | MCQ | Exact-match |
| LAR-ECHR | AUEB-NLP/lar-echr | MCQ | Exact-match |
All datasets are converted into a canonical legal_eval_v1 JSONL schema (see docs/DATA_SCHEMA.md) and merged into a single evaluation file.
For conversation-style tasks (for example PRBench), canonical rows can include a messages array so generation uses turn-structured chat input directly.
- Python ≥ 3.11
- uv — fast Python package manager
- API keys for at least one LLM provider (see Configuration → Providers)
- If any configured model uses the
bedrock/...prefix, installboto3(included by default viauv sync)
# Clone the repository
git clone https://github.com/<your-org>/legal-benchmark-runner.git
cd legal-benchmark-runner
# Install dependencies
uv sync
# Copy template files
cp .env.example .env
cp config.example.yaml config.yamlEdit .env and add the keys for the providers you plan to use:
# Required for your chosen providers (add only what you need)
NVIDIA_API_KEY=your_nvidia_api_key
MISTRAL_API_KEY=your_mistral_api_key
GEMINI_API_KEY=your_gemini_api_key
AWS_BEARER_TOKEN_BEDROCK=your_aws_bedrock_api_key
AI_GATEWAY_API_KEY=your_vercel_ai_gateway_api_keyMerge curated source datasets into the canonical evaluation file:
uv run python build_for_eval.pyThis produces data/for_eval/merged_legal_eval_v1.jsonl.
Note: MCQ grading now requires canonical correct_choice_ids in each row. If you have older custom eval files,
rebuild them via build_for_eval.py before running.
uv run python run.py --config config.yaml --check-setupIf everything is configured correctly, you'll see: Setup check passed.
If Bedrock models are configured without boto3, setup check fails fast with an install hint.
All runtime behavior is controlled through config.yaml (or any YAML file passed via --config). The config.example.yaml file is a fully annotated template.
Define credential profiles and routing settings. Each candidate or judge model references a provider by name.
providers:
nim:
api_key_env: NVIDIA_API_KEY
base_url: https://integrate.api.nvidia.com/v1
timeout_s: 180
bedrock:
api_key_env: AWS_BEARER_TOKEN_BEDROCK
timeout_s: 180
google_genai:
api_key_env: GEMINI_API_KEY
timeout_s: 120candidates:
- name: bedrock_claude_sonnet_4_5
provider: bedrock
model: bedrock/anthropic.claude-sonnet-4-5-20250929-v1:0
temperature: 0.2
max_tokens: 4096
judges:
- name: judge_gemini_flash_lite
provider: google_genai
model: gemini-flash-lite-latest
temperature: 0.0
max_tokens: 700| Parameter | Description | Default |
|---|---|---|
response_parallel_workers |
Parallel candidate generation workers | 8 |
response_rate_limit_rpm |
Shared RPM throttle for generation (1–50) | 50 |
provider_response_rate_limit_rpm |
Optional per-provider generation RPM overrides (1–50) | {} |
final_response_source |
Candidate answer source: sampled, prefilled, or part_of_conversation |
sampled |
prefilled_responses_path |
JSONL path used when final_response_source=prefilled |
null |
previous_output_path |
Path used when final_response_source=part_of_conversation (.jsonl or .json) |
null |
response_api |
Strict sampled-generation API mode: responses or chat.completions |
chat.completions |
use_scratchpad |
Append dataset scratchpad metadata to generation prompt | false |
web_search |
Inject lightweight web-search hint in request extra body | false |
judge_parallel_workers |
Parallel judge workers per response | 4 |
judge_rate_limit_rpm |
RPM throttle for judge calls (0 = off) | 12 |
include_raw_provider_response |
Include raw provider SDK response in trace rows | false |
Example provider-specific throttle:
run:
response_rate_limit_rpm: 50
provider_response_rate_limit_rpm:
nim: 20Prefilled response file contract (run.final_response_source=prefilled): one JSON object per line with
example_id, candidate_name, and response_text (string). The run fails fast if any selected pair is missing.
part_of_conversation contract (run.final_response_source=part_of_conversation): read previous
responses from .jsonl rows with example_id, candidate_name, response_text (same shape as
responses.jsonl), or from .json with either a list of those objects or a simple
{example_id: response_text} mapping when exactly one candidate is configured.
response_api provider support:
google_genai:chat.completionsonly.- LiteLLM-routed providers (
nim,bedrock,mistral_api,vercel_gateway, etc.):chat.completionsandresponses. - Unsupported combinations fail fast during config/setup validation.
# Full benchmark run
uv run python run.py --config config.yaml
# Smoke test with 5 examples
uv run python run.py --config config.yaml --limit 5
# Validate setup without running
uv run python run.py --config config.yaml --check-setup
# Disable progress output
uv run python run.py --config config.yaml --progress off| Flag | Description |
|---|---|
--config PATH |
Path to YAML config file (default: config.example.yaml) |
--limit N |
Cap total examples across all datasets |
--progress {log,off} |
Progress output mode |
--check-setup |
Validate environment and exit |
When a run completes with failures (generation errors, parse errors, or empty responses), you can repair it without re-running successful items:
# 1. Run targeted backfill for failed items
uv run python scripts/backfill_run.py \
--config config.yaml \
--base-run-id <original_run_id> \
--include-failed-generation \
--include-parse-errors
# 2. Merge backfill results onto the original run
uv run python scripts/merge_backfill.py \
--base-run-id <original_run_id> \
--backfill-run-id <backfill_run_id>The merge script overlays backfill rows onto the base run outputs (keyed by example_id + candidate_name), producing a repaired run with updated summary statistics.
Each run writes files to data/runs/<run_id>/outputs/ by default:
| File | Description |
|---|---|
examples.jsonl |
Normalized examples selected for the run |
responses.jsonl |
Candidate model outputs with request metadata |
judgments.jsonl |
Grading outputs (score, pass/fail, criteria, rationale) |
scored_responses.jsonl |
Merged response + judgment rows |
trace.jsonl |
Per-call trace data for debugging |
summary.json |
Aggregate metrics (overall and per-dataset) |
run_config.json |
Resolved config snapshot for reproducibility |
PRBench scoring note: For PRBench rubric rows,
judgments.jsonlandscored_responses.jsonlinclude additional parity-oriented aggregation fields:prbench_weighted_raw,prbench_points_normalized, andprbench_points_clipped.
The project includes a comprehensive test suite covering schema validation, MCQ grading, rubric aggregation, prompt policies, runner progress, and rate limiting.
# Run the full test suite
uv run pytest
# Run a specific test file
uv run pytest tests/judge/test_mcq_grading.py -v
# Run with output
uv run pytest -slegal-benchmark-runner/
├── run.py # CLI entry point — runs the evaluation pipeline
├── build_for_eval.py # Merges curated datasets into canonical eval file
├── config.example.yaml # Annotated example configuration
├── .env.example # Template for API keys
├── pyproject.toml # Project metadata & dependencies
│
├── src/ # Core library
│ ├── config.py # Config parsing & validation
│ ├── types.py # Shared data types
│ ├── cache.py # Disk-based response cache
│ ├── retry.py # Retry logic with exponential backoff
│ ├── setup_checks.py # Environment validation
│ ├── io/ # Shared JSON/JSONL IO helpers
│ │ └── json_io.py # read_json/read_jsonl/write_json/write_jsonl
│ ├── runtime/ # Shared runtime bootstrap helpers
│ │ └── bootstrap.py # dotenv bootstrap helper
│ ├── data/ # Data loading, schema, attachments
│ │ ├── schema.py # Canonical JSONL schema validator
│ │ ├── loader.py # Dataset loader
│ │ ├── build_for_eval.py # Dataset merge/build logic
│ │ ├── policies.py # Dataset-specific prompting policies
│ │ └── attachments.py # PDF/file attachment extraction
│ ├── providers/ # LLM provider adapters
│ │ ├── base.py # Abstract provider interface
│ │ ├── litellm.py # LiteLLM adapter (OpenAI-compatible)
│ │ └── google_genai.py # Google GenAI native adapter
│ ├── judge/ # Judging & grading
│ │ ├── judge.py # Rubric & reference judging
│ │ ├── mcq.py # Deterministic MCQ grading
│ │ ├── parsing.py # Judge output parsing
│ │ └── policies/ # Policy-specific judge handlers
│ │ ├── base.py # JudgePolicyHandler protocol
│ │ ├── registry.py # Policy handler lookup
│ │ ├── shared.py # Shared judge utilities
│ │ ├── default_policy.py
│ │ ├── prbench_policy.py
│ │ ├── lexam_policy.py
│ │ └── apex_policy.py
│ ├── prompting/ # Prompt construction
│ │ └── templates.py # Per-policy prompt templates
│ └── runner/ # Execution engine
│ ├── orchestrator.py # Two-phase orchestrator
│ ├── context.py # Shared runner execution context
│ ├── services.py # Typed service dependency contracts
│ ├── contracts.py # Typed phase handoff contracts
│ ├── row_types.py # Typed artifact row boundaries
│ ├── row_builders.py # Typed output row constructors
│ ├── generation.py # Candidate generation phase
│ ├── judging.py # Judging phase
│ ├── output.py # Artifact writing & summary
│ ├── reconcile.py # Deterministic row key/overlay helpers
│ ├── response_sources.py # Prefilled/previous-output parsers
│ ├── rate_limiter.py # Per-minute rate limiter
│ └── helpers.py # Utility functions
│
├── data/
│ ├── curated/ # Source datasets (JSONL)
│ └── for_eval/ # Canonical merged eval file
│
├── scripts/ # Operational utilities
│ ├── backfill_run.py # Targeted re-run for failed items
│ └── merge_backfill.py # Merge backfill outputs onto base run
│
├── docs/ # Documentation
│ ├── DATA_SCHEMA.md # Canonical JSONL schema spec
│ └── POLICIES.md # Dataset-specific policy docs
│
└── tests/ # Test suite (pytest)
├── core/ # Cache, retry tests
├── data/ # Schema, loader, attachments tests
├── judge/ # MCQ, rubric, policy tests
├── providers/ # Provider adapter tests
├── runner/ # Orchestrator, generation, judging tests
├── runtime/ # Bootstrap tests
├── scripts/ # Backfill script tests
└── setup/ # Setup validation tests
Click to expand BibTeX entries
@misc{scaleai2025prbench,
title = {PRBench: Large-Scale Expert Rubrics for Evaluating
High-Stakes Professional Reasoning},
author = {{Scale AI}},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/ScaleAI/PRBench}},
note = {Hugging Face dataset card}
}
@misc{mercor2025apexv1extended,
title = {APEX-v1-extended},
author = {{Mercor}},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/mercor/APEX-v1-extended}},
note = {Hugging Face dataset card}
}
@article{fan2025lexam,
title = {LEXam: Benchmarking Legal Reasoning on 340 Law Exams},
author = {Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang
and Hermstr{\"u}wer, Yoan and Huang, Yinya and Akhtar,
Mubashara and Salimbeni, Etienne and Geering, Florian
and Dreyer, Oliver and Brunner, Daniel and Leippold, Markus
and Sachan, Mrinmaya and Stremitzer, Alexander and Engel,
Christoph and Ash, Elliott and Niklaus, Joel},
journal = {arXiv preprint arXiv:2505.12864},
year = {2025}
}
@article{romanou2024include,
title = {INCLUDE: Evaluating Multilingual Language Understanding
with Regional Knowledge},
author = {Romanou, Angelika and Foroutan, Negar and Sotnikova, Anna
and Chen, Zeming and Nelaturu, Sree Harsha and Singh, Shivalika
and Maheshwary, Rishabh and Altomare, Micol and Haggag,
Mohamed A and Amayuelas, Alfonso and others},
journal = {arXiv preprint arXiv:2411.19799},
year = {2024}
}
@inproceedings{chlapanis-etal-2024-lar,
title = {LAR-ECHR: A New Legal Argument Reasoning Task and Dataset
for Cases of the European Court of Human Rights},
author = {Chlapanis, Odysseas S. and Galanis, Dimitrios
and Androutsopoulos, Ion},
booktitle = {Proceedings of the Natural Legal Language Processing
Workshop 2024},
year = {2024},
address = {Miami, FL, USA},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2024.nllp-1.22/},
doi = {10.18653/v1/2024.nllp-1.22},
pages = {267--279}
}Note:
PRBenchandAPEX-v1-extendedare cited as@miscentries because their Hugging Face dataset cards do not publish a BibTeX block.
| Symptom | Solution |
|---|---|
| Provider env var missing | Set the variable named in providers.<name>.api_key_env in your .env file |
| No examples selected | Verify data.datasets[*].enabled is true and check split_field, split_value, and limit settings |
| Module import errors | Run from the repo root so src is importable (e.g., uv run python run.py) |
The source code in this repository is licensed under the MIT License.
Data licensing disclaimer: The curated evaluation data included in
data/curated/anddata/for_eval/is derived from the upstream datasets listed in Dataset Citations. That data is not covered by this repository's MIT license. Each upstream dataset remains subject to its own license and terms of use; review the corresponding Hugging Face dataset cards before any redistribution or commercial use.