Skip to content

mircorudolph/ocr_comparison

Repository files navigation

OCR Comparison

Minimal benchmark harness to compare PDF -> Markdown extraction across providers using a shared contract and common output format.

What it does

  • Reads PDFs from sample_pdfs/
  • Runs one or more providers (Mistral first)
  • Saves markdown outputs under output/runs/<run_id>/ with provider/model-prefixed filenames
  • Writes run metrics to output/runs/<run_id>/metrics.txt
  • Appends benchmark lines to output/metrics.txt across all runs

Install

UV (recommended)

From project root:

uv init
uv add mistralai
uv add requests
uv add python-dotenv
uv add --dev pytest

If you want to add more providers later, install their SDKs similarly with uv add ....

Environment variables

Copy .env.example to .env and set values:

  • MISTRAL_API_KEY: Required for Mistral OCR.
  • MISTRAL_OCR_MODEL: Optional, defaults to mistral-ocr-latest.
  • MISTRAL_USD_PER_1000_PAGES: Optional price config for Mistral cost estimation. Default: 2.
  • LANDING_AI_API_KEY: Required for Landing AI ADE Parse.
  • LANDING_AI_PARSE_URL: Optional endpoint override. Default: https://api.va.landing.ai/v1/ade/parse.
  • LANDING_AI_MODEL: Optional ADE Parse model override.
  • LANDING_AI_SPLIT: Optional split mode (page).
  • LANDING_AI_CREDIT_TO_USD: Optional conversion ratio for estimated cost.
  • LOG_LEVEL: Optional logging level (INFO by default).

The app auto-loads .env from project root when you run python -m main. Each run output includes a sidecar .json file with a pdf_sha256 field.

Run locally

Place PDF files in sample_pdfs/, then run:

python -m main --providers mistral --input-dir sample_pdfs --output-dir output

Run only one PDF from that folder:

python -m main --providers mistral --input-dir sample_pdfs --input-file invoice.pdf

Run with Landing AI:

python -m main --providers landing_ai --input-dir sample_pdfs --output-dir output

Multiple providers:

python -m main --providers mistral,landing_ai,openai,gemini,marker

Output layout

output/
  runs/
    <run_id>/
      <provider>_<model>_<pdf_name>.md
      <provider>_<model>_<pdf_name>.json
      metrics.txt
  metrics.txt

metrics.txt is append-only and line-based, for example:

run=20260211_141500 provider=mistral pdf=invoice.pdf time=2.300s pages=4 tokens=1234 credits=n/a cost=0.008 model=mistral-ocr-latest
run=20260211_141500 provider=landing_ai pdf=invoice.pdf time=1.842s pages=4 tokens=n/a credits=7.5 cost=0.075 model=default

Run tests

pytest

Docker

Build image:

docker build -t ocr-comparison .

Run container:

docker run -it --rm --env-file .env -v "$(pwd)/sample_pdfs:/app/sample_pdfs" -v "$(pwd)/output:/app/output" ocr-comparison

About

Repo to test different OCR/ VQA pipelines and tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors