PDF Hebrew Text Decoder

This project provides tools to decode Hebrew text from specially encoded PDFs and process the extracted text for readability.

Installation

# Install dependencies
pip install pymupdf python-bidi

Key Components

decode_pdf.py: Main script that extracts and decodes PDF text using character mapping
process_pdf.py: Comprehensive script for PDF processing pipeline
text_processor.py: Processes extracted text files with cleaning and RTL correction

Usage

Extracting and Decoding Pages

Edit decode_pdf.py to specify your input file and page range:

# At the bottom of decode_pdf.py
if __name__ == "__main__":
    pdf_path = "your_file.pdf"  # Change to your PDF file
    start_page = 0              # First page to process (0-based index)
    end_page = 10               # Last page to process (0-based index)
    output_dir = "extracted_pages"

    extract_and_decode_pdf_pages(pdf_path, start_page, end_page, output_dir)

Then run:

python decode_pdf.py

Full Processing Pipeline

Edit process_pdf.py to specify your input file:

# At the bottom of process_pdf.py
if __name__ == "__main__":
    pdf_path = "your_file.pdf"  # Change to your PDF file
    output_file = "final_text/full_processed_text.txt"
    
    process_pdf(pdf_path, output_file)

Then run:

python process_pdf.py

Workflow

Extract text from PDF files
Decode the text using Hebrew character mappings
Clean and process the text files
Output as readable text/markdown files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
decoded_pages		decoded_pages
final_text		final_text
.gitignore		.gitignore
25-29.pdf		25-29.pdf
25tisa1a.pdf		25tisa1a.pdf
28.pdf		28.pdf
DOCUMENTATION.md		DOCUMENTATION.md
Hebrew.unicodeMap		Hebrew.unicodeMap
README.md		README.md
decode_pdf.py		decode_pdf.py
decoded_output.txt		decoded_output.txt
extract_char_code.py		extract_char_code.py
garbled28.txt		garbled28.txt
lh1.pdf		lh1.pdf
pdf.py		pdf.py
pdf_reader.py		pdf_reader.py
process_pdf.py		process_pdf.py
remove_linebreaks_quotes.py		remove_linebreaks_quotes.py
remove_page_numbers.py		remove_page_numbers.py
test.py		test.py
text_processor.py		text_processor.py
xpdfrc		xpdfrc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Hebrew Text Decoder

Installation

Key Components

Usage

Extracting and Decoding Pages

Full Processing Pipeline

Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Hebrew Text Decoder

Installation

Key Components

Usage

Extracting and Decoding Pages

Full Processing Pipeline

Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages