Part of the Philofree Project
This tool processes scanned PDFs of ancient Greek texts using Google Document AI, producing clean digital text ready for the Philofree corpus. It is designed to digitize out-of-copyright print editions and make them available for scholarly use.
The tool handles scanned PDFs of ancient Greek literature and converts them to structured digital formats suitable for text processing and analysis.
- PDF processing with automatic handling of large scholarly editions
- Polytonic Greek support — full Unicode coverage for ancient Greek diacritics
- Multiple output formats — Text, Markdown, HTML, JSON (ready for Philofree pipeline integration)
- Document structure identification — automatically identifies line numbers, footnotes, headers, indentation levels, and references
- Memory-efficient processing — handles large critical editions efficiently
- Batch processing — process entire library collections systematically
Scanned PDF (print edition)
↓
Philofree OCR (this tool)
↓
Raw Greek text + metadata
↓
Philofree processing pipeline
↓
Canonical JSON → philofree.com
The OCR output feeds directly into our text processing pipeline, where it receives:
- Sentence segmentation with Philofree IDs
- Reference system mapping (Stephanus, Bekker, etc.)
- Integration with the searchable corpus
- Python 3.12 or higher
- Google Cloud account with Document AI enabled
- Document AI processor configured for OCR
-
Clone this repository:
git clone https://github.com/philofree/ocr.git cd ocr -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt # Optional: Install ijson for optimal performance with large files pip install ijson -
Configure credentials:
- Copy
ENV.templatetoENV.local - Add your Google Cloud credentials
- Copy
- Create a project in Google Cloud Console
- Enable the Document AI API
- Create a Document AI processor for OCR
- Create a service account and download the JSON key file
- Update your
ENV.localfile with the appropriate values
Note on costs: Google Document AI offers a free tier sufficient for small projects. For large-scale digitization, costs vary based on usage volume.
python -m src.philocr.mainfrom utils.markdown_converter.markdown_handler import MarkdownHandler
# Convert OCR JSON output to markdown
MarkdownHandler.save_file_as_markdown(
json_file_path="path/to/ocr_output.json",
output_file_path="path/to/greek_text.md"
)# Process all PDFs in a directory
python test_large_file_batch.py --dir /path/to/scanned_editions --output digitized_texts
# Process a specific file
python test_large_file_batch.py --file /path/to/plato_republic_1903.json --output output_dir./src/philocr/build_package.sh.\src\philocr\build_package.batThe markdown converter handles multiple JSON output formats from Document AI:
- Standard format with
document_dataandpages - Alternative format with
filesarray and page markers - Chunked format for very large documents
The tool automatically selects the optimal processing method based on file size:
- Direct processing for small files (< 50MB)
- Chunked processing for large files (50-200MB)
- Streaming processing with ijson for very large files (> 200MB)
This ensures that a 2000-page critical edition processes just as reliably as a 50-page fragment collection.
The academic document parser automatically identifies and categorizes structural elements in scholarly texts:
- Line numbers — identified by position (typically in the left margin)
- Footnotes — detected at the bottom of pages by position and formatting patterns
- Headers — recognized as all caps text (typically section titles)
- Indentation levels — calculated from x-coordinate positions to preserve hierarchical structure
- References — detected through pattern matching (e.g., fragment references, citations)
These elements are properly categorized and formatted in the output to maintain the scholarly structure of the original document.
- Full polytonic Greek character preservation (Unicode NFC normalized)
- Page structure maintained for cross-reference with print editions
- Paragraph and line breaks preserved where meaningful
- Document structure elements (line numbers, footnotes, headers) properly identified and formatted
- Metadata extraction for bibliographic tracking
PhilOcr/
├── src/philocr/ # Main application code
│ ├── main.py # Application entry point
│ ├── config/ # Configuration handling
│ ├── processing/ # OCR processing logic
│ ├── ui/ # User interface (PyQt6)
│ └── utils/ # Utilities and helpers
├── tests/ # Test suite
│ ├── markdown_converter/ # Converter tests
│ └── fixtures/ # Test data
├── docs/ # Documentation
└── utils/ # Standalone utilities
└── markdown_converter/ # JSON-to-Markdown converter
Contributions are welcome.
Ways to contribute:
- Improve OCR accuracy for polytonic Greek
- Add support for additional output formats
- Enhance batch processing capabilities
- Document edge cases in Greek text processing
- Test with diverse print edition formats
- Place source code in
src/philocr/ - Place tests in
tests/ - Use absolute imports:
from philocr.main import ... - Run
pre-commit run --all-filesbefore committing - Add entries to the Development Log for significant changes
Pre-commit hooks enforce:
- Ruff linting
- Black formatting
- isort import sorting
- mypy type checking
CC0 1.0 Universal — Public Domain Dedication
This tool is dedicated to the public domain. You can copy, modify, distribute, and use it for any purpose, including commercial purposes, without asking permission.
See LICENSE for full details.
More about CC0: https://creativecommons.org/publicdomain/zero/1.0/
- Never commit
ENV.localor credential files - Store Google Cloud credentials securely
- The application uses environment variables, not hardcoded credentials
Qt plugin issues (macOS):
./src/philocr/reinstall_qt.shLarge file processing failures: Install ijson for streaming support:
pip install ijsonPolytonic character issues: Ensure your terminal/editor supports Unicode and the output files are UTF-8 encoded.
- Google Document AI — OCR engine with excellent Greek character recognition
- PyQt6 — Cross-platform user interface
- PyMuPDF — PDF handling and manipulation
- The Philofree community — Testing and feedback
- Philofree — The main corpus and translation project
- Perseus Digital Library — Open Greek and Latin texts
- First1KGreek — Community-sourced Greek texts
- Open Greek and Latin — TEI XML corpus development
This project maintains a detailed Development Log tracking changes, issues, solutions, and lessons learned.
## [YYYY-MM-DD] - Brief Title
**Developer:** [Name]
**Time:** [HH:MM]
### Changes Made
...
### Issues Encountered
...
### Solutions Implemented
...
### Lessons Learned
...
### Next Steps
...The development log preserves institutional knowledge and helps future contributors understand why certain technical decisions were made.