Skip to content

fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct#12

Open
yasen-pavlov wants to merge 1 commit into
gardar:mainfrom
yasen-pavlov:fix/unicode-ocr-text-layer
Open

fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct#12
yasen-pavlov wants to merge 1 commit into
gardar:mainfrom
yasen-pavlov:fix/unicode-ocr-text-layer

Conversation

@yasen-pavlov

@yasen-pavlov yasen-pavlov commented Jun 15, 2026

Copy link
Copy Markdown

Problem

The hOCR → PDF text layer is unreadable for non-Latin scripts. Selecting/copying or searching the OCR layer of a generated PDF returns mojibake — e.g. the Cyrillic БЪЛГАРИЯ comes back as БЪЛГЕРИЯ. The visible page image is fine; only the invisible OCR layer is corrupt, so it's easy to miss until someone copies text or runs a search.

Root cause

pkg/pdfocr/layer.go (drawWord) force-encoded each word to ISO-8859-1 and drew it with fpdf's core Helvetica font:

latin1, err := charmap.ISO8859_1.NewEncoder().String(word.Text)
if err != nil {
    *encodingErrors++
    latin1 = word.Text // fallback to raw text
}
...
pdf.Text(x, y, latin1)

Latin-1 can't represent Cyrillic/Greek/etc., so the encoder errors and the raw UTF-8 bytes are written through a single-byte core font that has no ToUnicode CMap. Each UTF-8 byte is then interpreted as a separate WinAnsi character → the classic double-encoding mojibake.

Fix

  • Embed DejaVu Sans (pkg/pdfocr/fonts/DejaVuSans.ttf, covers Latin/Cyrillic/Greek/…) and register it via AddUTF8FontFromBytes.
  • Make it the default OCR-layer font (DefaultFont), still overridable through OCRConfig.Font.
  • Write the recognized text directly as UTF-8 (drop the lossy Latin-1 step), so fpdf emits a proper ToUnicode CMap and the layer is copy/paste- and search-correct for all scripts.
  • Remove the now-obsolete Latin-1 encode-error tracking.
  • Add a round-trip regression test (font_test.go): build a PDF with Cyrillic+Latin words, extract with pdftotext, assert they come back intact and contain no Ð mojibake marker.

Latin documents are unaffected — this only changes how the text layer is encoded.

Validation

Verified end-to-end through a downstream consumer (paperless-gpt + Google Document AI) on real Bulgarian documents (a handwritten birth certificate, ID cards): before, pdftotext returned Ð…-style mojibake; after, it returns clean Cyrillic (РЕПУБЛИКА БЪЛГАРИЯ … УДОСТОВЕРЕНИЕ ЗА РАЖДАНЕ …). gofmt / go vet / go test ./... are clean.

Notes / open questions

  • The embedded TTF adds ~740 KB to the module. DejaVu Sans is permissively licensed (Bitstream Vera / Arev); its license is included at pkg/pdfocr/fonts/LICENSE. Happy to switch to a smaller/subset font, or to make the embedded font opt-in via a FontConfig.FontBytes/FontFile field if you'd rather not vendor a font by default — just say which you'd prefer.
  • The new test exercises the image path (AssembleWithOCRcreatePDFFromImage); the existing-PDF path (ApplyOCRmodifyExistingPDF) shares the same drawWord, so both are covered by the fix.

🤖 Generated with Claude Code

drawWord force-encoded each word to ISO-8859-1 and drew it with fpdf's
core Helvetica font. Latin-1 cannot represent non-Latin scripts, so the
encoder failed and the raw UTF-8 bytes were written through a single-byte
font, producing mojibake in the text layer (e.g. the Cyrillic "БЪЛГАРИЯ"
became "БЪЛГЕРИЯ"). The rendered page image was unaffected, but
copy/paste and text search of the OCR layer returned garbage for
Cyrillic, Greek, and other non-Latin scripts.

Embed DejaVu Sans (Latin/Cyrillic/Greek/…) and register it with
AddUTF8FontFromBytes, make it the default OCR-layer font, and write the
recognized text directly as UTF-8 so fpdf emits a proper ToUnicode CMap.
The font stays overridable via OCRConfig.Font.

Also remove the now-obsolete Latin-1 encode-error tracking and add a
round-trip regression test (render -> pdftotext) covering Cyrillic+Latin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Yasen Pavlov <yasen.pavlov@bitnet.me>
@yasen-pavlov

yasen-pavlov commented Jun 15, 2026

Copy link
Copy Markdown
Author

I did this quickly with claude code to fix an issue locally that I ran into and thought it would make sense to let it open a pr for it, but if you prefer I can close it and create an issue instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant