fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct by yasen-pavlov · Pull Request #12 · gardar/ocrchestra

yasen-pavlov · 2026-06-15T00:30:12Z

Problem

The hOCR → PDF text layer is unreadable for non-Latin scripts. Selecting/copying or searching the OCR layer of a generated PDF returns mojibake — e.g. the Cyrillic БЪЛГАРИЯ comes back as Ð‘ÐªÐ›Ð“Ð•Ð Ð˜Ð¯. The visible page image is fine; only the invisible OCR layer is corrupt, so it's easy to miss until someone copies text or runs a search.

Root cause

pkg/pdfocr/layer.go (drawWord) force-encoded each word to ISO-8859-1 and drew it with fpdf's core Helvetica font:

latin1, err := charmap.ISO8859_1.NewEncoder().String(word.Text)
if err != nil {
    *encodingErrors++
    latin1 = word.Text // fallback to raw text
}
...
pdf.Text(x, y, latin1)

Latin-1 can't represent Cyrillic/Greek/etc., so the encoder errors and the raw UTF-8 bytes are written through a single-byte core font that has no ToUnicode CMap. Each UTF-8 byte is then interpreted as a separate WinAnsi character → the classic double-encoding mojibake.

Fix

Embed DejaVu Sans (pkg/pdfocr/fonts/DejaVuSans.ttf, covers Latin/Cyrillic/Greek/…) and register it via AddUTF8FontFromBytes.
Make it the default OCR-layer font (DefaultFont), still overridable through OCRConfig.Font.
Write the recognized text directly as UTF-8 (drop the lossy Latin-1 step), so fpdf emits a proper ToUnicode CMap and the layer is copy/paste- and search-correct for all scripts.
Remove the now-obsolete Latin-1 encode-error tracking.
Add a round-trip regression test (font_test.go): build a PDF with Cyrillic+Latin words, extract with pdftotext, assert they come back intact and contain no Ð mojibake marker.

Latin documents are unaffected — this only changes how the text layer is encoded.

Validation

Verified end-to-end through a downstream consumer (paperless-gpt + Google Document AI) on real Bulgarian documents (a handwritten birth certificate, ID cards): before, pdftotext returned Ð…-style mojibake; after, it returns clean Cyrillic (РЕПУБЛИКА БЪЛГАРИЯ … УДОСТОВЕРЕНИЕ ЗА РАЖДАНЕ …). gofmt / go vet / go test ./... are clean.

Notes / open questions

The embedded TTF adds ~740 KB to the module. DejaVu Sans is permissively licensed (Bitstream Vera / Arev); its license is included at pkg/pdfocr/fonts/LICENSE. Happy to switch to a smaller/subset font, or to make the embedded font opt-in via a FontConfig.FontBytes/FontFile field if you'd rather not vendor a font by default — just say which you'd prefer.
The new test exercises the image path (AssembleWithOCR → createPDFFromImage); the existing-PDF path (ApplyOCR → modifyExistingPDF) shares the same drawWord, so both are covered by the fix.

🤖 Generated with Claude Code

drawWord force-encoded each word to ISO-8859-1 and drew it with fpdf's core Helvetica font. Latin-1 cannot represent non-Latin scripts, so the encoder failed and the raw UTF-8 bytes were written through a single-byte font, producing mojibake in the text layer (e.g. the Cyrillic "БЪЛГАРИЯ" became "Ð‘ÐªÐ›Ð“Ð•Ð Ð˜Ð¯"). The rendered page image was unaffected, but copy/paste and text search of the OCR layer returned garbage for Cyrillic, Greek, and other non-Latin scripts. Embed DejaVu Sans (Latin/Cyrillic/Greek/…) and register it with AddUTF8FontFromBytes, make it the default OCR-layer font, and write the recognized text directly as UTF-8 so fpdf emits a proper ToUnicode CMap. The font stays overridable via OCRConfig.Font. Also remove the now-obsolete Latin-1 encode-error tracking and add a round-trip regression test (render -> pdftotext) covering Cyrillic+Latin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yasen Pavlov <yasen.pavlov@bitnet.me>

yasen-pavlov · 2026-06-15T00:35:56Z

I did this quickly with claude code to fix an issue locally that I ran into and thought it would make sense to let it open a pr for it, but if you prefer I can close it and create an issue instead.

yasen-pavlov closed this Jun 15, 2026

yasen-pavlov reopened this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct#12

fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct#12
yasen-pavlov wants to merge 1 commit into
gardar:mainfrom
yasen-pavlov:fix/unicode-ocr-text-layer

yasen-pavlov commented Jun 15, 2026 •

edited

Loading

Uh oh!

yasen-pavlov commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yasen-pavlov commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Validation

Notes / open questions

Uh oh!

yasen-pavlov commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yasen-pavlov commented Jun 15, 2026 •

edited

Loading

yasen-pavlov commented Jun 15, 2026 •

edited

Loading