⚖️ Legal Benchmark Runner

A config-driven evaluation pipeline for benchmarking LLMs on European legal reasoning tasks.

Overview

Legal Benchmark Runner evaluates large language models against curated European legal datasets covering professional reasoning, law exams, multilingual MCQs, and human-rights case law. It supports multiple LLM providers, three judging strategies, and produces structured, reproducible run artifacts.

Key Features

🔌 Multi-provider — Route candidate models through NVIDIA NIM, Amazon Bedrock, Mistral, Vercel AI Gateway, or any OpenAI-compatible endpoint via LiteLLM
🧑‍⚖️ Three judging modes — Rubric-based (LLM-graded criteria), reference-answer (LLM comparison), and MCQ (deterministic exact-match)
📊 Structured outputs — Every run produces JSONL artifacts, per-dataset summaries, and a full config snapshot for reproducibility
⚡ Parallel & rate-limited — Configurable worker pools and per-minute rate limits for both generation and judging
💾 Disk caching — Avoid redundant API calls across re-runs
✅ Validated inputs — Canonical schema validation fails fast on malformed data

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        config.yaml                          │
│         (providers · datasets · candidates · judges)        │
└────────────────────────────┬────────────────────────────────┘
                             │
                    ┌────────▼────────┐
                    │ Config Validation  │
                    │   (src/config.py)  │
                    └────────┬─────────┘
                             │
              ┌──────────────▼──────────────┐
              │       Runner Orchestrator    │
              │  (parallel workers + cache)  │
              └──────┬──────────────┬───────┘
                     │              │
          ┌──────────▼──────┐  ┌───▼──────────────┐
          │   Generation    │  │     Judging       │
          │  (LiteLLM /     │  │  (Rubric / Ref /  │
          │   Google GenAI) │  │   MCQ grading)    │
          └─────────────────┘  └──────────────────┘
                     │              │
              ┌──────▼──────────────▼───────┐
              │     Output Writer            │
              │  outputs/<run_id>/           │
              │  examples · responses ·      │
              │  judgments · summary          │
              └─────────────────────────────┘

Validation ownership is intentionally split:

Config-time validation in src/config.py (static enums, required fields, provider compatibility).
Runtime validation in runner execution flow (src/runner/orchestrator.py, src/runner/generation.py) for dynamic content and coverage checks.

Datasets

The benchmark input is built from five curated upstream sources, each mapped to a specific judging mode:

Dataset	Source	Task Type	Judging Mode
PRBench	ScaleAI/PRBench	Rubric QA	LLM rubric
APEX-v1	mercor/APEX-v1-extended	Rubric QA	LLM rubric
LEXam	LEXam-Benchmark/LEXam	Reference QA / MCQ	LLM reference / Exact-match
INCLUDE-Base	CohereLabs/include-base-44	MCQ	Exact-match
LAR-ECHR	AUEB-NLP/lar-echr	MCQ	Exact-match

All datasets are converted into a canonical legal_eval_v1 JSONL schema (see docs/DATA_SCHEMA.md) and merged into a single evaluation file.
For conversation-style tasks (for example PRBench), canonical rows can include a messages array so generation uses turn-structured chat input directly.

Getting Started

Prerequisites

Python ≥ 3.11
uv — fast Python package manager
API keys for at least one LLM provider (see Configuration → Providers)
If any configured model uses the bedrock/... prefix, install boto3 (included by default via uv sync)

Installation

# Clone the repository
git clone https://github.com/<your-org>/legal-benchmark-runner.git
cd legal-benchmark-runner

# Install dependencies
uv sync

# Copy template files
cp .env.example .env
cp config.example.yaml config.yaml

Set up API keys

Edit .env and add the keys for the providers you plan to use:

# Required for your chosen providers (add only what you need)
NVIDIA_API_KEY=your_nvidia_api_key
MISTRAL_API_KEY=your_mistral_api_key
GEMINI_API_KEY=your_gemini_api_key
AWS_BEARER_TOKEN_BEDROCK=your_aws_bedrock_api_key
AI_GATEWAY_API_KEY=your_vercel_ai_gateway_api_key

Build the eval dataset

Merge curated source datasets into the canonical evaluation file:

uv run python build_for_eval.py

This produces data/for_eval/merged_legal_eval_v1.jsonl.

Note: MCQ grading now requires canonical correct_choice_ids in each row. If you have older custom eval files, rebuild them via build_for_eval.py before running.

Verify your setup

uv run python run.py --config config.yaml --check-setup

If everything is configured correctly, you'll see: Setup check passed.
If Bedrock models are configured without boto3, setup check fails fast with an install hint.

Configuration

All runtime behavior is controlled through config.yaml (or any YAML file passed via --config). The config.example.yaml file is a fully annotated template.

Providers

Define credential profiles and routing settings. Each candidate or judge model references a provider by name.

providers:
  nim:
    api_key_env: NVIDIA_API_KEY
    base_url: https://integrate.api.nvidia.com/v1
    timeout_s: 180

  bedrock:
    api_key_env: AWS_BEARER_TOKEN_BEDROCK
    timeout_s: 180

  google_genai:
    api_key_env: GEMINI_API_KEY
    timeout_s: 120

Candidates & Judges

candidates:
  - name: bedrock_claude_sonnet_4_5
    provider: bedrock
    model: bedrock/anthropic.claude-sonnet-4-5-20250929-v1:0
    temperature: 0.2
    max_tokens: 4096

judges:
  - name: judge_gemini_flash_lite
    provider: google_genai
    model: gemini-flash-lite-latest
    temperature: 0.0
    max_tokens: 700

Runtime controls

Parameter	Description	Default
`response_parallel_workers`	Parallel candidate generation workers	`8`
`response_rate_limit_rpm`	Shared RPM throttle for generation (1–50)	`50`
`provider_response_rate_limit_rpm`	Optional per-provider generation RPM overrides (1–50)	`{}`
`final_response_source`	Candidate answer source: `sampled`, `prefilled`, or `part_of_conversation`	`sampled`
`prefilled_responses_path`	JSONL path used when `final_response_source=prefilled`	`null`
`previous_output_path`	Path used when `final_response_source=part_of_conversation` (`.jsonl` or `.json`)	`null`
`response_api`	Strict sampled-generation API mode: `responses` or `chat.completions`	`chat.completions`
`use_scratchpad`	Append dataset scratchpad metadata to generation prompt	`false`
`web_search`	Inject lightweight web-search hint in request extra body	`false`
`judge_parallel_workers`	Parallel judge workers per response	`4`
`judge_rate_limit_rpm`	RPM throttle for judge calls (0 = off)	`12`
`include_raw_provider_response`	Include raw provider SDK response in trace rows	`false`

Example provider-specific throttle:

run:
  response_rate_limit_rpm: 50
  provider_response_rate_limit_rpm:
    nim: 20

Prefilled response file contract (run.final_response_source=prefilled): one JSON object per line with example_id, candidate_name, and response_text (string). The run fails fast if any selected pair is missing.

part_of_conversation contract (run.final_response_source=part_of_conversation): read previous responses from .jsonl rows with example_id, candidate_name, response_text (same shape as responses.jsonl), or from .json with either a list of those objects or a simple {example_id: response_text} mapping when exactly one candidate is configured.

response_api provider support:

google_genai: chat.completions only.
LiteLLM-routed providers (nim, bedrock, mistral_api, vercel_gateway, etc.): chat.completions and responses.
Unsupported combinations fail fast during config/setup validation.

Usage

# Full benchmark run
uv run python run.py --config config.yaml

# Smoke test with 5 examples
uv run python run.py --config config.yaml --limit 5

# Validate setup without running
uv run python run.py --config config.yaml --check-setup

# Disable progress output
uv run python run.py --config config.yaml --progress off

CLI Reference

Flag	Description
`--config PATH`	Path to YAML config file (default: `config.example.yaml`)
`--limit N`	Cap total examples across all datasets
`--progress {log,off}`	Progress output mode
`--check-setup`	Validate environment and exit

Backfill Workflow

When a run completes with failures (generation errors, parse errors, or empty responses), you can repair it without re-running successful items:

# 1. Run targeted backfill for failed items
uv run python scripts/backfill_run.py \
  --config config.yaml \
  --base-run-id <original_run_id> \
  --include-failed-generation \
  --include-parse-errors

# 2. Merge backfill results onto the original run
uv run python scripts/merge_backfill.py \
  --base-run-id <original_run_id> \
  --backfill-run-id <backfill_run_id>

The merge script overlays backfill rows onto the base run outputs (keyed by example_id + candidate_name), producing a repaired run with updated summary statistics.

Output Artifacts

Each run writes files to data/runs/<run_id>/outputs/ by default:

File	Description
`examples.jsonl`	Normalized examples selected for the run
`responses.jsonl`	Candidate model outputs with request metadata
`judgments.jsonl`	Grading outputs (score, pass/fail, criteria, rationale)
`scored_responses.jsonl`	Merged response + judgment rows
`trace.jsonl`	Per-call trace data for debugging
`summary.json`	Aggregate metrics (overall and per-dataset)
`run_config.json`	Resolved config snapshot for reproducibility

PRBench scoring note: For PRBench rubric rows, judgments.jsonl and scored_responses.jsonl include additional parity-oriented aggregation fields: prbench_weighted_raw, prbench_points_normalized, and prbench_points_clipped.

Testing

The project includes a comprehensive test suite covering schema validation, MCQ grading, rubric aggregation, prompt policies, runner progress, and rate limiting.

# Run the full test suite
uv run pytest

# Run a specific test file
uv run pytest tests/judge/test_mcq_grading.py -v

# Run with output
uv run pytest -s

Project Structure

legal-benchmark-runner/
├── run.py                    # CLI entry point — runs the evaluation pipeline
├── build_for_eval.py         # Merges curated datasets into canonical eval file
├── config.example.yaml       # Annotated example configuration
├── .env.example              # Template for API keys
├── pyproject.toml            # Project metadata & dependencies
│
├── src/                      # Core library
│   ├── config.py             # Config parsing & validation
│   ├── types.py              # Shared data types
│   ├── cache.py              # Disk-based response cache
│   ├── retry.py              # Retry logic with exponential backoff
│   ├── setup_checks.py       # Environment validation
│   ├── io/                   # Shared JSON/JSONL IO helpers
│   │   └── json_io.py        # read_json/read_jsonl/write_json/write_jsonl
│   ├── runtime/              # Shared runtime bootstrap helpers
│   │   └── bootstrap.py      # dotenv bootstrap helper
│   ├── data/                 # Data loading, schema, attachments
│   │   ├── schema.py         # Canonical JSONL schema validator
│   │   ├── loader.py         # Dataset loader
│   │   ├── build_for_eval.py # Dataset merge/build logic
│   │   ├── policies.py       # Dataset-specific prompting policies
│   │   └── attachments.py    # PDF/file attachment extraction
│   ├── providers/            # LLM provider adapters
│   │   ├── base.py           # Abstract provider interface
│   │   ├── litellm.py        # LiteLLM adapter (OpenAI-compatible)
│   │   └── google_genai.py   # Google GenAI native adapter
│   ├── judge/                # Judging & grading
│   │   ├── judge.py          # Rubric & reference judging
│   │   ├── mcq.py            # Deterministic MCQ grading
│   │   ├── parsing.py        # Judge output parsing
│   │   └── policies/         # Policy-specific judge handlers
│   │       ├── base.py       # JudgePolicyHandler protocol
│   │       ├── registry.py   # Policy handler lookup
│   │       ├── shared.py     # Shared judge utilities
│   │       ├── default_policy.py
│   │       ├── prbench_policy.py
│   │       ├── lexam_policy.py
│   │       └── apex_policy.py
│   ├── prompting/            # Prompt construction
│   │   └── templates.py      # Per-policy prompt templates
│   └── runner/               # Execution engine
│       ├── orchestrator.py   # Two-phase orchestrator
│       ├── context.py        # Shared runner execution context
│       ├── services.py       # Typed service dependency contracts
│       ├── contracts.py      # Typed phase handoff contracts
│       ├── row_types.py      # Typed artifact row boundaries
│       ├── row_builders.py   # Typed output row constructors
│       ├── generation.py     # Candidate generation phase
│       ├── judging.py        # Judging phase
│       ├── output.py         # Artifact writing & summary
│       ├── reconcile.py      # Deterministic row key/overlay helpers
│       ├── response_sources.py # Prefilled/previous-output parsers
│       ├── rate_limiter.py   # Per-minute rate limiter
│       └── helpers.py        # Utility functions
│
├── data/
│   ├── curated/              # Source datasets (JSONL)
│   └── for_eval/             # Canonical merged eval file
│
├── scripts/                  # Operational utilities
│   ├── backfill_run.py       # Targeted re-run for failed items
│   └── merge_backfill.py     # Merge backfill outputs onto base run
│
├── docs/                     # Documentation
│   ├── DATA_SCHEMA.md        # Canonical JSONL schema spec
│   └── POLICIES.md           # Dataset-specific policy docs
│
└── tests/                    # Test suite (pytest)
    ├── core/                 # Cache, retry tests
    ├── data/                 # Schema, loader, attachments tests
    ├── judge/                # MCQ, rubric, policy tests
    ├── providers/            # Provider adapter tests
    ├── runner/               # Orchestrator, generation, judging tests
    ├── runtime/              # Bootstrap tests
    ├── scripts/              # Backfill script tests
    └── setup/                # Setup validation tests

Dataset Citations

Click to expand BibTeX entries

@misc{scaleai2025prbench,
  title        = {PRBench: Large-Scale Expert Rubrics for Evaluating
                  High-Stakes Professional Reasoning},
  author       = {{Scale AI}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/ScaleAI/PRBench}},
  note         = {Hugging Face dataset card}
}

@misc{mercor2025apexv1extended,
  title        = {APEX-v1-extended},
  author       = {{Mercor}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/mercor/APEX-v1-extended}},
  note         = {Hugging Face dataset card}
}

@article{fan2025lexam,
  title   = {LEXam: Benchmarking Legal Reasoning on 340 Law Exams},
  author  = {Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang
             and Hermstr{\"u}wer, Yoan and Huang, Yinya and Akhtar,
             Mubashara and Salimbeni, Etienne and Geering, Florian
             and Dreyer, Oliver and Brunner, Daniel and Leippold, Markus
             and Sachan, Mrinmaya and Stremitzer, Alexander and Engel,
             Christoph and Ash, Elliott and Niklaus, Joel},
  journal = {arXiv preprint arXiv:2505.12864},
  year    = {2025}
}

@article{romanou2024include,
  title   = {INCLUDE: Evaluating Multilingual Language Understanding
             with Regional Knowledge},
  author  = {Romanou, Angelika and Foroutan, Negar and Sotnikova, Anna
             and Chen, Zeming and Nelaturu, Sree Harsha and Singh, Shivalika
             and Maheshwary, Rishabh and Altomare, Micol and Haggag,
             Mohamed A and Amayuelas, Alfonso and others},
  journal = {arXiv preprint arXiv:2411.19799},
  year    = {2024}
}

@inproceedings{chlapanis-etal-2024-lar,
  title     = {LAR-ECHR: A New Legal Argument Reasoning Task and Dataset
               for Cases of the European Court of Human Rights},
  author    = {Chlapanis, Odysseas S. and Galanis, Dimitrios
               and Androutsopoulos, Ion},
  booktitle = {Proceedings of the Natural Legal Language Processing
               Workshop 2024},
  year      = {2024},
  address   = {Miami, FL, USA},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2024.nllp-1.22/},
  doi       = {10.18653/v1/2024.nllp-1.22},
  pages     = {267--279}
}

Note: PRBench and APEX-v1-extended are cited as @misc entries because their Hugging Face dataset cards do not publish a BibTeX block.

Troubleshooting

Symptom	Solution
Provider env var missing	Set the variable named in `providers.<name>.api_key_env` in your `.env` file
No examples selected	Verify `data.datasets[*].enabled` is `true` and check `split_field`, `split_value`, and `limit` settings
Module import errors	Run from the repo root so `src` is importable (e.g., `uv run python run.py`)

License

The source code in this repository is licensed under the MIT License.

Data licensing disclaimer: The curated evaluation data included in data/curated/ and data/for_eval/ is derived from the upstream datasets listed in Dataset Citations. That data is not covered by this repository's MIT license. Each upstream dataset remains subject to its own license and terms of use; review the corresponding Hugging Face dataset cards before any redistribution or commercial use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ Legal Benchmark Runner

Overview

Key Features

Table of Contents

Architecture

Datasets

Getting Started

Prerequisites

Installation

Set up API keys

Build the eval dataset

Verify your setup

Configuration

Providers

Candidates & Judges

Runtime controls

Usage

CLI Reference

Backfill Workflow

Output Artifacts

Testing

Project Structure

Dataset Citations

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_for_eval.py		build_for_eval.py
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

⚖️ Legal Benchmark Runner

Overview

Key Features

Table of Contents

Architecture

Datasets

Getting Started

Prerequisites

Installation

Set up API keys

Build the eval dataset

Verify your setup

Configuration

Providers

Candidates & Judges

Runtime controls

Usage

CLI Reference

Backfill Workflow

Output Artifacts

Testing

Project Structure

Dataset Citations

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages