This project provides tools to decode Hebrew text from specially encoded PDFs and process the extracted text for readability.
# Install dependencies
pip install pymupdf python-bidi- decode_pdf.py: Main script that extracts and decodes PDF text using character mapping
- process_pdf.py: Comprehensive script for PDF processing pipeline
- text_processor.py: Processes extracted text files with cleaning and RTL correction
Edit decode_pdf.py to specify your input file and page range:
# At the bottom of decode_pdf.py
if __name__ == "__main__":
pdf_path = "your_file.pdf" # Change to your PDF file
start_page = 0 # First page to process (0-based index)
end_page = 10 # Last page to process (0-based index)
output_dir = "extracted_pages"
extract_and_decode_pdf_pages(pdf_path, start_page, end_page, output_dir)Then run:
python decode_pdf.pyEdit process_pdf.py to specify your input file:
# At the bottom of process_pdf.py
if __name__ == "__main__":
pdf_path = "your_file.pdf" # Change to your PDF file
output_file = "final_text/full_processed_text.txt"
process_pdf(pdf_path, output_file)Then run:
python process_pdf.py- Extract text from PDF files
- Decode the text using Hebrew character mappings
- Clean and process the text files
- Output as readable text/markdown files