Skip to content

holiq/pdf-extractor

Repository files navigation

PDF Extractor

A small FastAPI service that extracts Markdown-ready text from PDF files using PyMuPDF and PyMuPDF4LLM.

Features

  • Accepts PDF uploads via POST /extract
  • Converts PDF content into Markdown text
  • Returns page count and extracted character count
  • Includes health check endpoint
  • Supports Docker and local development

Requirements

  • Python 3.12+
  • uv

Environment

Create a .env file at the project root to customize runtime settings.

Supported variables:

  • PORT — service port (default: 5000)
  • MAX_FILE_SIZE_MB — maximum upload size in megabytes (default: 50)
  • CORS_ORIGINS — allowed CORS origins, comma-separated (default: *)

Local development

uv venv
source .venv/bin/activate
uv sync

Run the API locally:

uv run uvicorn main:app --host 0.0.0.0 --port ${PORT:-5000} --reload

API Endpoints

Health check

GET /health

Response:

{ "status": "ok" }

Extract PDF

POST /extract
Content-Type: multipart/form-data

file: <PDF file>

Successful response:

{
  "text": "...",
  "pages": 10,
  "characters": 12345
}

Errors:

  • 422 if the uploaded file is missing, empty, or not a PDF
  • 413 if the file is larger than MAX_FILE_SIZE_MB
  • 500 for extraction or server errors

Docker

Build and run with Docker Compose:

docker compose up --build

The app will be available at http://localhost:5000 by default.

Notes

  • Only application/pdf uploads are supported.
  • Extraction uses pymupdf4llm.to_markdown() to generate Markdown-friendly output.
  • The service is intentionally small and focused on PDF text extraction.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors