Skip to content

MercorEUplatform/EU-Legal-Benchmark

Repository files navigation

⚖️ Legal Benchmark Runner

A config-driven evaluation pipeline for benchmarking LLMs on European legal reasoning tasks.

Python 3.11+ uv License: MIT


Overview

Legal Benchmark Runner evaluates large language models against curated European legal datasets covering professional reasoning, law exams, multilingual MCQs, and human-rights case law. It supports multiple LLM providers, three judging strategies, and produces structured, reproducible run artifacts.

Key Features

  • 🔌 Multi-provider — Route candidate models through NVIDIA NIM, Amazon Bedrock, Mistral, Vercel AI Gateway, or any OpenAI-compatible endpoint via LiteLLM
  • 🧑‍⚖️ Three judging modes — Rubric-based (LLM-graded criteria), reference-answer (LLM comparison), and MCQ (deterministic exact-match)
  • 📊 Structured outputs — Every run produces JSONL artifacts, per-dataset summaries, and a full config snapshot for reproducibility
  • Parallel & rate-limited — Configurable worker pools and per-minute rate limits for both generation and judging
  • 💾 Disk caching — Avoid redundant API calls across re-runs
  • Validated inputs — Canonical schema validation fails fast on malformed data

Table of Contents


Architecture

┌─────────────────────────────────────────────────────────────┐
│                        config.yaml                          │
│         (providers · datasets · candidates · judges)        │
└────────────────────────────┬────────────────────────────────┘
                             │
                    ┌────────▼────────┐
                    │ Config Validation  │
                    │   (src/config.py)  │
                    └────────┬─────────┘
                             │
              ┌──────────────▼──────────────┐
              │       Runner Orchestrator    │
              │  (parallel workers + cache)  │
              └──────┬──────────────┬───────┘
                     │              │
          ┌──────────▼──────┐  ┌───▼──────────────┐
          │   Generation    │  │     Judging       │
          │  (LiteLLM /     │  │  (Rubric / Ref /  │
          │   Google GenAI) │  │   MCQ grading)    │
          └─────────────────┘  └──────────────────┘
                     │              │
              ┌──────▼──────────────▼───────┐
              │     Output Writer            │
              │  outputs/<run_id>/           │
              │  examples · responses ·      │
              │  judgments · summary          │
              └─────────────────────────────┘

Validation ownership is intentionally split:

  • Config-time validation in src/config.py (static enums, required fields, provider compatibility).
  • Runtime validation in runner execution flow (src/runner/orchestrator.py, src/runner/generation.py) for dynamic content and coverage checks.

Datasets

The benchmark input is built from five curated upstream sources, each mapped to a specific judging mode:

Dataset Source Task Type Judging Mode
PRBench ScaleAI/PRBench Rubric QA LLM rubric
APEX-v1 mercor/APEX-v1-extended Rubric QA LLM rubric
LEXam LEXam-Benchmark/LEXam Reference QA / MCQ LLM reference / Exact-match
INCLUDE-Base CohereLabs/include-base-44 MCQ Exact-match
LAR-ECHR AUEB-NLP/lar-echr MCQ Exact-match

All datasets are converted into a canonical legal_eval_v1 JSONL schema (see docs/DATA_SCHEMA.md) and merged into a single evaluation file.
For conversation-style tasks (for example PRBench), canonical rows can include a messages array so generation uses turn-structured chat input directly.


Getting Started

Prerequisites

  • Python ≥ 3.11
  • uv — fast Python package manager
  • API keys for at least one LLM provider (see Configuration → Providers)
  • If any configured model uses the bedrock/... prefix, install boto3 (included by default via uv sync)

Installation

# Clone the repository
git clone https://github.com/<your-org>/legal-benchmark-runner.git
cd legal-benchmark-runner

# Install dependencies
uv sync

# Copy template files
cp .env.example .env
cp config.example.yaml config.yaml

Set up API keys

Edit .env and add the keys for the providers you plan to use:

# Required for your chosen providers (add only what you need)
NVIDIA_API_KEY=your_nvidia_api_key
MISTRAL_API_KEY=your_mistral_api_key
GEMINI_API_KEY=your_gemini_api_key
AWS_BEARER_TOKEN_BEDROCK=your_aws_bedrock_api_key
AI_GATEWAY_API_KEY=your_vercel_ai_gateway_api_key

Build the eval dataset

Merge curated source datasets into the canonical evaluation file:

uv run python build_for_eval.py

This produces data/for_eval/merged_legal_eval_v1.jsonl.

Note: MCQ grading now requires canonical correct_choice_ids in each row. If you have older custom eval files, rebuild them via build_for_eval.py before running.

Verify your setup

uv run python run.py --config config.yaml --check-setup

If everything is configured correctly, you'll see: Setup check passed.
If Bedrock models are configured without boto3, setup check fails fast with an install hint.


Configuration

All runtime behavior is controlled through config.yaml (or any YAML file passed via --config). The config.example.yaml file is a fully annotated template.

Providers

Define credential profiles and routing settings. Each candidate or judge model references a provider by name.

providers:
  nim:
    api_key_env: NVIDIA_API_KEY
    base_url: https://integrate.api.nvidia.com/v1
    timeout_s: 180

  bedrock:
    api_key_env: AWS_BEARER_TOKEN_BEDROCK
    timeout_s: 180

  google_genai:
    api_key_env: GEMINI_API_KEY
    timeout_s: 120

Candidates & Judges

candidates:
  - name: bedrock_claude_sonnet_4_5
    provider: bedrock
    model: bedrock/anthropic.claude-sonnet-4-5-20250929-v1:0
    temperature: 0.2
    max_tokens: 4096

judges:
  - name: judge_gemini_flash_lite
    provider: google_genai
    model: gemini-flash-lite-latest
    temperature: 0.0
    max_tokens: 700

Runtime controls

Parameter Description Default
response_parallel_workers Parallel candidate generation workers 8
response_rate_limit_rpm Shared RPM throttle for generation (1–50) 50
provider_response_rate_limit_rpm Optional per-provider generation RPM overrides (1–50) {}
final_response_source Candidate answer source: sampled, prefilled, or part_of_conversation sampled
prefilled_responses_path JSONL path used when final_response_source=prefilled null
previous_output_path Path used when final_response_source=part_of_conversation (.jsonl or .json) null
response_api Strict sampled-generation API mode: responses or chat.completions chat.completions
use_scratchpad Append dataset scratchpad metadata to generation prompt false
web_search Inject lightweight web-search hint in request extra body false
judge_parallel_workers Parallel judge workers per response 4
judge_rate_limit_rpm RPM throttle for judge calls (0 = off) 12
include_raw_provider_response Include raw provider SDK response in trace rows false

Example provider-specific throttle:

run:
  response_rate_limit_rpm: 50
  provider_response_rate_limit_rpm:
    nim: 20

Prefilled response file contract (run.final_response_source=prefilled): one JSON object per line with example_id, candidate_name, and response_text (string). The run fails fast if any selected pair is missing.

part_of_conversation contract (run.final_response_source=part_of_conversation): read previous responses from .jsonl rows with example_id, candidate_name, response_text (same shape as responses.jsonl), or from .json with either a list of those objects or a simple {example_id: response_text} mapping when exactly one candidate is configured.

response_api provider support:

  • google_genai: chat.completions only.
  • LiteLLM-routed providers (nim, bedrock, mistral_api, vercel_gateway, etc.): chat.completions and responses.
  • Unsupported combinations fail fast during config/setup validation.

Usage

# Full benchmark run
uv run python run.py --config config.yaml

# Smoke test with 5 examples
uv run python run.py --config config.yaml --limit 5

# Validate setup without running
uv run python run.py --config config.yaml --check-setup

# Disable progress output
uv run python run.py --config config.yaml --progress off

CLI Reference

Flag Description
--config PATH Path to YAML config file (default: config.example.yaml)
--limit N Cap total examples across all datasets
--progress {log,off} Progress output mode
--check-setup Validate environment and exit

Backfill Workflow

When a run completes with failures (generation errors, parse errors, or empty responses), you can repair it without re-running successful items:

# 1. Run targeted backfill for failed items
uv run python scripts/backfill_run.py \
  --config config.yaml \
  --base-run-id <original_run_id> \
  --include-failed-generation \
  --include-parse-errors

# 2. Merge backfill results onto the original run
uv run python scripts/merge_backfill.py \
  --base-run-id <original_run_id> \
  --backfill-run-id <backfill_run_id>

The merge script overlays backfill rows onto the base run outputs (keyed by example_id + candidate_name), producing a repaired run with updated summary statistics.


Output Artifacts

Each run writes files to data/runs/<run_id>/outputs/ by default:

File Description
examples.jsonl Normalized examples selected for the run
responses.jsonl Candidate model outputs with request metadata
judgments.jsonl Grading outputs (score, pass/fail, criteria, rationale)
scored_responses.jsonl Merged response + judgment rows
trace.jsonl Per-call trace data for debugging
summary.json Aggregate metrics (overall and per-dataset)
run_config.json Resolved config snapshot for reproducibility

PRBench scoring note: For PRBench rubric rows, judgments.jsonl and scored_responses.jsonl include additional parity-oriented aggregation fields: prbench_weighted_raw, prbench_points_normalized, and prbench_points_clipped.


Testing

The project includes a comprehensive test suite covering schema validation, MCQ grading, rubric aggregation, prompt policies, runner progress, and rate limiting.

# Run the full test suite
uv run pytest

# Run a specific test file
uv run pytest tests/judge/test_mcq_grading.py -v

# Run with output
uv run pytest -s

Project Structure

legal-benchmark-runner/
├── run.py                    # CLI entry point — runs the evaluation pipeline
├── build_for_eval.py         # Merges curated datasets into canonical eval file
├── config.example.yaml       # Annotated example configuration
├── .env.example              # Template for API keys
├── pyproject.toml            # Project metadata & dependencies
│
├── src/                      # Core library
│   ├── config.py             # Config parsing & validation
│   ├── types.py              # Shared data types
│   ├── cache.py              # Disk-based response cache
│   ├── retry.py              # Retry logic with exponential backoff
│   ├── setup_checks.py       # Environment validation
│   ├── io/                   # Shared JSON/JSONL IO helpers
│   │   └── json_io.py        # read_json/read_jsonl/write_json/write_jsonl
│   ├── runtime/              # Shared runtime bootstrap helpers
│   │   └── bootstrap.py      # dotenv bootstrap helper
│   ├── data/                 # Data loading, schema, attachments
│   │   ├── schema.py         # Canonical JSONL schema validator
│   │   ├── loader.py         # Dataset loader
│   │   ├── build_for_eval.py # Dataset merge/build logic
│   │   ├── policies.py       # Dataset-specific prompting policies
│   │   └── attachments.py    # PDF/file attachment extraction
│   ├── providers/            # LLM provider adapters
│   │   ├── base.py           # Abstract provider interface
│   │   ├── litellm.py        # LiteLLM adapter (OpenAI-compatible)
│   │   └── google_genai.py   # Google GenAI native adapter
│   ├── judge/                # Judging & grading
│   │   ├── judge.py          # Rubric & reference judging
│   │   ├── mcq.py            # Deterministic MCQ grading
│   │   ├── parsing.py        # Judge output parsing
│   │   └── policies/         # Policy-specific judge handlers
│   │       ├── base.py       # JudgePolicyHandler protocol
│   │       ├── registry.py   # Policy handler lookup
│   │       ├── shared.py     # Shared judge utilities
│   │       ├── default_policy.py
│   │       ├── prbench_policy.py
│   │       ├── lexam_policy.py
│   │       └── apex_policy.py
│   ├── prompting/            # Prompt construction
│   │   └── templates.py      # Per-policy prompt templates
│   └── runner/               # Execution engine
│       ├── orchestrator.py   # Two-phase orchestrator
│       ├── context.py        # Shared runner execution context
│       ├── services.py       # Typed service dependency contracts
│       ├── contracts.py      # Typed phase handoff contracts
│       ├── row_types.py      # Typed artifact row boundaries
│       ├── row_builders.py   # Typed output row constructors
│       ├── generation.py     # Candidate generation phase
│       ├── judging.py        # Judging phase
│       ├── output.py         # Artifact writing & summary
│       ├── reconcile.py      # Deterministic row key/overlay helpers
│       ├── response_sources.py # Prefilled/previous-output parsers
│       ├── rate_limiter.py   # Per-minute rate limiter
│       └── helpers.py        # Utility functions
│
├── data/
│   ├── curated/              # Source datasets (JSONL)
│   └── for_eval/             # Canonical merged eval file
│
├── scripts/                  # Operational utilities
│   ├── backfill_run.py       # Targeted re-run for failed items
│   └── merge_backfill.py     # Merge backfill outputs onto base run
│
├── docs/                     # Documentation
│   ├── DATA_SCHEMA.md        # Canonical JSONL schema spec
│   └── POLICIES.md           # Dataset-specific policy docs
│
└── tests/                    # Test suite (pytest)
    ├── core/                 # Cache, retry tests
    ├── data/                 # Schema, loader, attachments tests
    ├── judge/                # MCQ, rubric, policy tests
    ├── providers/            # Provider adapter tests
    ├── runner/               # Orchestrator, generation, judging tests
    ├── runtime/              # Bootstrap tests
    ├── scripts/              # Backfill script tests
    └── setup/                # Setup validation tests

Dataset Citations

Click to expand BibTeX entries
@misc{scaleai2025prbench,
  title        = {PRBench: Large-Scale Expert Rubrics for Evaluating
                  High-Stakes Professional Reasoning},
  author       = {{Scale AI}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/ScaleAI/PRBench}},
  note         = {Hugging Face dataset card}
}

@misc{mercor2025apexv1extended,
  title        = {APEX-v1-extended},
  author       = {{Mercor}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/mercor/APEX-v1-extended}},
  note         = {Hugging Face dataset card}
}

@article{fan2025lexam,
  title   = {LEXam: Benchmarking Legal Reasoning on 340 Law Exams},
  author  = {Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang
             and Hermstr{\"u}wer, Yoan and Huang, Yinya and Akhtar,
             Mubashara and Salimbeni, Etienne and Geering, Florian
             and Dreyer, Oliver and Brunner, Daniel and Leippold, Markus
             and Sachan, Mrinmaya and Stremitzer, Alexander and Engel,
             Christoph and Ash, Elliott and Niklaus, Joel},
  journal = {arXiv preprint arXiv:2505.12864},
  year    = {2025}
}

@article{romanou2024include,
  title   = {INCLUDE: Evaluating Multilingual Language Understanding
             with Regional Knowledge},
  author  = {Romanou, Angelika and Foroutan, Negar and Sotnikova, Anna
             and Chen, Zeming and Nelaturu, Sree Harsha and Singh, Shivalika
             and Maheshwary, Rishabh and Altomare, Micol and Haggag,
             Mohamed A and Amayuelas, Alfonso and others},
  journal = {arXiv preprint arXiv:2411.19799},
  year    = {2024}
}

@inproceedings{chlapanis-etal-2024-lar,
  title     = {LAR-ECHR: A New Legal Argument Reasoning Task and Dataset
               for Cases of the European Court of Human Rights},
  author    = {Chlapanis, Odysseas S. and Galanis, Dimitrios
               and Androutsopoulos, Ion},
  booktitle = {Proceedings of the Natural Legal Language Processing
               Workshop 2024},
  year      = {2024},
  address   = {Miami, FL, USA},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2024.nllp-1.22/},
  doi       = {10.18653/v1/2024.nllp-1.22},
  pages     = {267--279}
}

Note: PRBench and APEX-v1-extended are cited as @misc entries because their Hugging Face dataset cards do not publish a BibTeX block.


Troubleshooting

Symptom Solution
Provider env var missing Set the variable named in providers.<name>.api_key_env in your .env file
No examples selected Verify data.datasets[*].enabled is true and check split_field, split_value, and limit settings
Module import errors Run from the repo root so src is importable (e.g., uv run python run.py)

License

The source code in this repository is licensed under the MIT License.

Data licensing disclaimer: The curated evaluation data included in data/curated/ and data/for_eval/ is derived from the upstream datasets listed in Dataset Citations. That data is not covered by this repository's MIT license. Each upstream dataset remains subject to its own license and terms of use; review the corresponding Hugging Face dataset cards before any redistribution or commercial use.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages