Minimal benchmark harness to compare PDF -> Markdown extraction across providers using a shared contract and common output format.
- Reads PDFs from
sample_pdfs/ - Runs one or more providers (Mistral first)
- Saves markdown outputs under
output/runs/<run_id>/with provider/model-prefixed filenames - Writes run metrics to
output/runs/<run_id>/metrics.txt - Appends benchmark lines to
output/metrics.txtacross all runs
From project root:
uv init
uv add mistralai
uv add requests
uv add python-dotenv
uv add --dev pytestIf you want to add more providers later, install their SDKs similarly with uv add ....
Copy .env.example to .env and set values:
MISTRAL_API_KEY: Required for Mistral OCR.MISTRAL_OCR_MODEL: Optional, defaults tomistral-ocr-latest.MISTRAL_USD_PER_1000_PAGES: Optional price config for Mistral cost estimation. Default:2.LANDING_AI_API_KEY: Required for Landing AI ADE Parse.LANDING_AI_PARSE_URL: Optional endpoint override. Default:https://api.va.landing.ai/v1/ade/parse.LANDING_AI_MODEL: Optional ADE Parse model override.LANDING_AI_SPLIT: Optional split mode (page).LANDING_AI_CREDIT_TO_USD: Optional conversion ratio for estimated cost.LOG_LEVEL: Optional logging level (INFOby default).
The app auto-loads .env from project root when you run python -m main.
Each run output includes a sidecar .json file with a pdf_sha256 field.
Place PDF files in sample_pdfs/, then run:
python -m main --providers mistral --input-dir sample_pdfs --output-dir outputRun only one PDF from that folder:
python -m main --providers mistral --input-dir sample_pdfs --input-file invoice.pdfRun with Landing AI:
python -m main --providers landing_ai --input-dir sample_pdfs --output-dir outputMultiple providers:
python -m main --providers mistral,landing_ai,openai,gemini,markeroutput/
runs/
<run_id>/
<provider>_<model>_<pdf_name>.md
<provider>_<model>_<pdf_name>.json
metrics.txt
metrics.txt
metrics.txt is append-only and line-based, for example:
run=20260211_141500 provider=mistral pdf=invoice.pdf time=2.300s pages=4 tokens=1234 credits=n/a cost=0.008 model=mistral-ocr-latest
run=20260211_141500 provider=landing_ai pdf=invoice.pdf time=1.842s pages=4 tokens=n/a credits=7.5 cost=0.075 model=default
pytestBuild image:
docker build -t ocr-comparison .Run container:
docker run -it --rm --env-file .env -v "$(pwd)/sample_pdfs:/app/sample_pdfs" -v "$(pwd)/output:/app/output" ocr-comparison