This repository contains ocr_transcribe.py, a small Python script that uses
pytesseract to OCR images and create a markdown file (e.g. chapter_1.md).
Prerequisites
- Tesseract OCR must be installed on your machine and the
tesseractbinary must be on PATH.
Install Tesseract (platform-specific)
-
macOS:
brew install tesseract
-
Linux (Debian/Ubuntu) or Windows (WSL):
sudo apt update sudo apt install -y tesseract-ocr libtesseract-dev
Install Python dependencies in your venv (on Windows or inside WSL). We pin a small set of packages here; adjust versions if needed.
Using pip:
python -m pip install numpy pillow pytesseractUsing uv (if you prefer to manage packages with uv):
uv add numpy pillow pytesseractRun the script on your chapter folder:
python ocr_transcribe.py --input-dir static/chapter_1 --output-file chapter_1.mdOr run the script using uv run (uses the active uv environment):
uv run python ocr_transcribe.py --input-dir static/chapter_1 --output-file chapter_1.mdOptions:
--resize: max width in pixels to resize images before OCR (default 2000).--lang: tesseract language codes (defaulteng).
Notes on privacy and cost:
- Script runs locally with
pytesseractso your images are not uploaded.
If you want, I can add a small helper to compress/resize images prior to uploading to an API, but given your preference for local processing this README guides running under WSL for Windows users.