Philofree OCR: Processing Ancient Greek Texts

Overview

This tool processes scanned PDFs of ancient Greek texts using Google Document AI, producing clean digital text ready for the Philofree corpus. It is designed to digitize out-of-copyright print editions and make them available for scholarly use.

The tool handles scanned PDFs of ancient Greek literature and converts them to structured digital formats suitable for text processing and analysis.

Features

PDF processing with automatic handling of large scholarly editions
Polytonic Greek support — full Unicode coverage for ancient Greek diacritics
Multiple output formats — Text, Markdown, HTML, JSON (ready for Philofree pipeline integration)
Document structure identification — automatically identifies line numbers, footnotes, headers, indentation levels, and references
Memory-efficient processing — handles large critical editions efficiently
Batch processing — process entire library collections systematically

How It Fits the Philofree Pipeline

Scanned PDF (print edition)
        ↓
   Philofree OCR (this tool)
        ↓
   Raw Greek text + metadata
        ↓
   Philofree processing pipeline
        ↓
   Canonical JSON → philofree.com

The OCR output feeds directly into our text processing pipeline, where it receives:

Sentence segmentation with Philofree IDs
Reference system mapping (Stephanus, Bekker, etc.)
Integration with the searchable corpus

Setup Instructions

Prerequisites

Python 3.12 or higher
Google Cloud account with Document AI enabled
Document AI processor configured for OCR

Installation

Clone this repository:

git clone https://github.com/philofree/ocr.git
cd ocr

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

# Optional: Install ijson for optimal performance with large files
pip install ijson

Configure credentials:
- Copy ENV.template to ENV.local
- Add your Google Cloud credentials

Google Document AI Setup

Create a project in Google Cloud Console
Enable the Document AI API
Create a Document AI processor for OCR
Create a service account and download the JSON key file
Update your ENV.local file with the appropriate values

Note on costs: Google Document AI offers a free tier sufficient for small projects. For large-scale digitization, costs vary based on usage volume.

Usage

Running the Application

python -m src.philocr.main

Processing a Single File

from utils.markdown_converter.markdown_handler import MarkdownHandler

# Convert OCR JSON output to markdown
MarkdownHandler.save_file_as_markdown(
    json_file_path="path/to/ocr_output.json",
    output_file_path="path/to/greek_text.md"
)

Batch Processing (Recommended for Library Digitization)

# Process all PDFs in a directory
python test_large_file_batch.py --dir /path/to/scanned_editions --output digitized_texts

# Process a specific file
python test_large_file_batch.py --file /path/to/plato_republic_1903.json --output output_dir

Building Standalone Applications

macOS

./src/philocr/build_package.sh

Windows

.\src\philocr\build_package.bat

Technical Details

Supported Input Formats

The markdown converter handles multiple JSON output formats from Document AI:

Standard format with document_data and pages
Alternative format with files array and page markers
Chunked format for very large documents

Processing Strategies

The tool automatically selects the optimal processing method based on file size:

Direct processing for small files (< 50MB)
Chunked processing for large files (50-200MB)
Streaming processing with ijson for very large files (> 200MB)

This ensures that a 2000-page critical edition processes just as reliably as a 50-page fragment collection.

Document Structure Parsing

The academic document parser automatically identifies and categorizes structural elements in scholarly texts:

Line numbers — identified by position (typically in the left margin)
Footnotes — detected at the bottom of pages by position and formatting patterns
Headers — recognized as all caps text (typically section titles)
Indentation levels — calculated from x-coordinate positions to preserve hierarchical structure
References — detected through pattern matching (e.g., fragment references, citations)

These elements are properly categorized and formatted in the output to maintain the scholarly structure of the original document.

Output Quality

Full polytonic Greek character preservation (Unicode NFC normalized)
Page structure maintained for cross-reference with print editions
Paragraph and line breaks preserved where meaningful
Document structure elements (line numbers, footnotes, headers) properly identified and formatted
Metadata extraction for bibliographic tracking

Project Structure

PhilOcr/
├── src/philocr/            # Main application code
│   ├── main.py             # Application entry point
│   ├── config/             # Configuration handling
│   ├── processing/         # OCR processing logic
│   ├── ui/                 # User interface (PyQt6)
│   └── utils/              # Utilities and helpers
├── tests/                  # Test suite
│   ├── markdown_converter/ # Converter tests
│   └── fixtures/           # Test data
├── docs/                   # Documentation
└── utils/                  # Standalone utilities
    └── markdown_converter/ # JSON-to-Markdown converter

Contributing

Contributions are welcome.

Ways to contribute:

Improve OCR accuracy for polytonic Greek
Add support for additional output formats
Enhance batch processing capabilities
Document edge cases in Greek text processing
Test with diverse print edition formats

Development Workflow

Place source code in src/philocr/
Place tests in tests/
Use absolute imports: from philocr.main import ...
Run pre-commit run --all-files before committing
Add entries to the Development Log for significant changes

Code Quality

Pre-commit hooks enforce:

Ruff linting
Black formatting
isort import sorting
mypy type checking

License

CC0 1.0 Universal — Public Domain Dedication

This tool is dedicated to the public domain. You can copy, modify, distribute, and use it for any purpose, including commercial purposes, without asking permission.

See LICENSE for full details.
More about CC0: https://creativecommons.org/publicdomain/zero/1.0/

Security Notes

Never commit ENV.local or credential files
Store Google Cloud credentials securely
The application uses environment variables, not hardcoded credentials

Troubleshooting

Qt plugin issues (macOS):

./src/philocr/reinstall_qt.sh

Large file processing failures: Install ijson for streaming support:

pip install ijson

Polytonic character issues: Ensure your terminal/editor supports Unicode and the output files are UTF-8 encoded.

Acknowledgments

Google Document AI — OCR engine with excellent Greek character recognition
PyQt6 — Cross-platform user interface
PyMuPDF — PDF handling and manipulation
The Philofree community — Testing and feedback

Related Projects

Philofree — The main corpus and translation project
Perseus Digital Library — Open Greek and Latin texts
First1KGreek — Community-sourced Greek texts
Open Greek and Latin — TEI XML corpus development

Development Log

This project maintains a detailed Development Log tracking changes, issues, solutions, and lessons learned.

Log Format

## [YYYY-MM-DD] - Brief Title

**Developer:** [Name]
**Time:** [HH:MM]

### Changes Made
...

### Issues Encountered
...

### Solutions Implemented
...

### Lessons Learned
...

### Next Steps
...

The development log preserves institutional knowledge and helps future contributors understand why certain technical decisions were made.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
docs		docs
philocr_icons		philocr_icons
scans		scans
scripts		scripts
src		src
tests		tests
.coverage 2		.coverage 2
.coverage 3		.coverage 3
.coverage 4		.coverage 4
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit-guardians.py		.pre-commit-guardians.py
ENV.template		ENV.template
LICENSE		LICENSE
LICENSE 2		LICENSE 2
README.md		README.md
config.yaml.template		config.yaml.template
god_class_violations.csv		god_class_violations.csv
norms_exception_handling copy.md		norms_exception_handling copy.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
rules_agent_hygiene.yaml		rules_agent_hygiene.yaml
rules_structured_logging.yaml		rules_structured_logging.yaml
rules_verification_snagging_shakedown.yaml		rules_verification_snagging_shakedown.yaml
run_tests.py		run_tests.py
verify_system_operability.sh		verify_system_operability.sh

Folders and files

Latest commit

History

Repository files navigation

Philofree OCR: Processing Ancient Greek Texts

Overview

Features

How It Fits the Philofree Pipeline

Setup Instructions

Prerequisites

Installation

Google Document AI Setup

Usage

Running the Application

Processing a Single File

Batch Processing (Recommended for Library Digitization)

Building Standalone Applications

macOS

Windows

Technical Details

Supported Input Formats

Processing Strategies

Document Structure Parsing

Output Quality

Project Structure

Contributing

Development Workflow

Code Quality

License

Security Notes

Troubleshooting

Acknowledgments

Related Projects

Development Log

Log Format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages