Benchmark for long-list entity extraction from semi-structured documents under complex layouts and OCR noise, inspired by recurring patterns observed in real-world claims documents.
This benchmark was developed at Kay.ai.
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
python -m pip install -r benchmarks/requirements.txt
python -m playwright install chromium
# Set API keys (only needed for OCR/evaluation runs)
cp .env.example .env
# Generate the complete benchmark dataset
# This writes JSON, HTML, PDF, and canonical transcript files.
python benchmarks/generate_claims_benchmark.pyConvenience targets are provided via the repository root Makefile:
make help
# Create venv + install deps + install Playwright Chromium
make setup
# Generate synthetic benchmark dataset (PDF/HTML/JSON)
make generate
# Build the paper
make paperSee benchmarks/README.md for benchmark documentation.
- Version: see
VERSION. - Citation metadata: see
CITATION.cff.
- 80 benchmark instances across 4 difficulty tiers × 2 formats
- 2,700 base claims across all instances (some instances include additional rows due to
large_docandduplicates) - 7 implemented problem types approximating common long-list failure modes
- 2 document formats (detailed and table views)
- Ground truth annotations in JSON format
- Canonical transcripts derived from rendered HTML
- OCR transcripts derived from page-image OCR
| Code | Meaning |
|---|---|
page_breaks |
Detailed documents can split one incident across pages; table documents insert row-boundary page breaks with repeated table headers. |
multi_row |
Key fields (especially descriptions) span multiple lines/rows instead of being single-line. |
duplicates |
Duplicate incidents are inserted (exact repeats) to test deduplication and counting. |
large_doc |
Document is much longer than normal (many more incidents/pages). |
multiple_tables |
Adds additional irrelevant tables/sections mixed in with the main claims content. |
multi_column |
Uses a multi-column layout in detailed-format content and distractor sections to stress reading order. |
merged_cells |
Uses merged table cells (e.g. rowspan/colspan) to make table structure harder. |
The strongest page_breaks and multi_column effects are format-dependent: detailed documents receive split-record page breaks and multi-column primary content, while table documents keep the main claims table single-span.
| Tier | Seed Claims/PDF | Released Rows/Doc | Instances | Formats | Problems |
|---|---|---|---|---|---|
| Easy | 10 | 10-11 | 15×2 = 30 | Detailed + Table | 1-2 |
| Medium | 25 | 25-27 | 12×2 = 24 | Detailed + Table | 3-4 |
| Hard | 50 | 55 | 8×2 = 16 | Detailed + Table | 5-6 |
| Extreme | 100 | 500 | 5×2 = 10 | Detailed + Table | All 7 |
The released dataset includes additional rows from duplicates and large_doc. Extreme filenames retain a legacy _100_ seed-count suffix, but every released extreme document contains 500 incidents.
- Detailed: Incident sections with line items and financial breakdowns
- Table: Compact tabular format with all claims in rows
Using the synchronized OCR-condition snapshot from this repository, we highlight two local extraction regimes:
| Regime | Overall weighted micro F1 | Extreme-tier weighted micro F1 |
|---|---|---|
| Full-context one-shot | 27.4% | 5.9% |
Auto-chunked (longlistbench) |
84.8% | 81.7% |
Moving from full-context one-shot to the local auto-chunked regime improves overall weighted F1 by 57.4 points and extreme-tier weighted F1 by 75.8 points on the same snapshot. The one-shot regime remains strong on easy documents (97.2%), but drops to 74.6% on medium, 44.4% on hard, and 5.9% on extreme. By contrast, the local auto-chunked regime reaches 97.3% weighted F1 on easy, 96.5% on medium, 87.7% on hard, 71.0% on detailed documents overall, and 95.9% on table documents overall. Chunking therefore mitigates the catastrophic long-context failure mode, but the local chunked baseline still leaves substantial residual errors, especially on long detailed documents. The evaluator now also supports direct clean-vs-OCR comparisons by running the same extractor over canonical and ocr transcript conditions.
For development and testing, see benchmarks/synthetic/README.md for the synthetic data generator.
Optional: install a pre-commit hook to quickly sanity-check that the paper compiles:
# From the repository root
cp pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commitThe hook runs a fast LaTeX compile (make quick) in the paper directory; in strict mode it can prevent the commit if compilation fails.
By default, the hook is best-effort and will skip (or warn) when dependencies are missing. To make paper compilation failures block commits, set:
export STRICT_PAPER_COMPILE=1Manually invoking the hook:
# Test the hook without committing
.git/hooks/pre-commitAlternatively, run the same check from your virtualenv:
source .venv/bin/activate
make -C paper quickNote: You can skip the hook for a specific commit using:
git commit --no-verifyLaTeX is only needed if you want to compile the paper locally.
- LaTeX distribution (TeX Live, MacTeX, or similar)
pdflatexandbiberavailable in yourPATH- See paper/README.md for paper-specific build instructions