OCR Comparison

Minimal benchmark harness to compare PDF -> Markdown extraction across providers using a shared contract and common output format.

What it does

Reads PDFs from sample_pdfs/
Runs one or more providers (Mistral first)
Saves markdown outputs under output/runs/<run_id>/ with provider/model-prefixed filenames
Writes run metrics to output/runs/<run_id>/metrics.txt
Appends benchmark lines to output/metrics.txt across all runs

Install

UV (recommended)

From project root:

uv init
uv add mistralai
uv add requests
uv add python-dotenv
uv add --dev pytest

If you want to add more providers later, install their SDKs similarly with uv add ....

Environment variables

Copy .env.example to .env and set values:

MISTRAL_API_KEY: Required for Mistral OCR.
MISTRAL_OCR_MODEL: Optional, defaults to mistral-ocr-latest.
MISTRAL_USD_PER_1000_PAGES: Optional price config for Mistral cost estimation. Default: 2.
LANDING_AI_API_KEY: Required for Landing AI ADE Parse.
LANDING_AI_PARSE_URL: Optional endpoint override. Default: https://api.va.landing.ai/v1/ade/parse.
LANDING_AI_MODEL: Optional ADE Parse model override.
LANDING_AI_SPLIT: Optional split mode (page).
LANDING_AI_CREDIT_TO_USD: Optional conversion ratio for estimated cost.
LOG_LEVEL: Optional logging level (INFO by default).

The app auto-loads .env from project root when you run python -m main. Each run output includes a sidecar .json file with a pdf_sha256 field.

Run locally

Place PDF files in sample_pdfs/, then run:

python -m main --providers mistral --input-dir sample_pdfs --output-dir output

Run only one PDF from that folder:

python -m main --providers mistral --input-dir sample_pdfs --input-file invoice.pdf

Run with Landing AI:

python -m main --providers landing_ai --input-dir sample_pdfs --output-dir output

Multiple providers:

python -m main --providers mistral,landing_ai,openai,gemini,marker

Output layout

output/
  runs/
    <run_id>/
      <provider>_<model>_<pdf_name>.md
      <provider>_<model>_<pdf_name>.json
      metrics.txt
  metrics.txt

metrics.txt is append-only and line-based, for example:

run=20260211_141500 provider=mistral pdf=invoice.pdf time=2.300s pages=4 tokens=1234 credits=n/a cost=0.008 model=mistral-ocr-latest
run=20260211_141500 provider=landing_ai pdf=invoice.pdf time=1.842s pages=4 tokens=n/a credits=7.5 cost=0.075 model=default

Run tests

pytest

Docker

Build image:

docker build -t ocr-comparison .

Run container:

docker run -it --rm --env-file .env -v "$(pwd)/sample_pdfs:/app/sample_pdfs" -v "$(pwd)/output:/app/output" ocr-comparison

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cursor/skills		.cursor/skills
app		app
output		output
sample_pdfs		sample_pdfs
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
implementation_plan.md		implementation_plan.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Comparison

What it does

Install

UV (recommended)

Environment variables

Run locally

Output layout

Run tests

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCR Comparison

What it does

Install

UV (recommended)

Environment variables

Run locally

Output layout

Run tests

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages