A small FastAPI service that extracts Markdown-ready text from PDF files using PyMuPDF and PyMuPDF4LLM.
- Accepts PDF uploads via
POST /extract - Converts PDF content into Markdown text
- Returns page count and extracted character count
- Includes health check endpoint
- Supports Docker and local development
- Python 3.12+
uv
Create a .env file at the project root to customize runtime settings.
Supported variables:
PORT— service port (default:5000)MAX_FILE_SIZE_MB— maximum upload size in megabytes (default:50)CORS_ORIGINS— allowed CORS origins, comma-separated (default:*)
uv venv
source .venv/bin/activate
uv syncRun the API locally:
uv run uvicorn main:app --host 0.0.0.0 --port ${PORT:-5000} --reloadGET /healthResponse:
{ "status": "ok" }POST /extract
Content-Type: multipart/form-data
file: <PDF file>Successful response:
{
"text": "...",
"pages": 10,
"characters": 12345
}Errors:
422if the uploaded file is missing, empty, or not a PDF413if the file is larger thanMAX_FILE_SIZE_MB500for extraction or server errors
Build and run with Docker Compose:
docker compose up --buildThe app will be available at http://localhost:5000 by default.
- Only
application/pdfuploads are supported. - Extraction uses
pymupdf4llm.to_markdown()to generate Markdown-friendly output. - The service is intentionally small and focused on PDF text extraction.