fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct#12
Open
yasen-pavlov wants to merge 1 commit into
Open
fix(pdfocr): embed Unicode font so non-Latin OCR text layers are correct#12yasen-pavlov wants to merge 1 commit into
yasen-pavlov wants to merge 1 commit into
Conversation
drawWord force-encoded each word to ISO-8859-1 and drew it with fpdf's core Helvetica font. Latin-1 cannot represent non-Latin scripts, so the encoder failed and the raw UTF-8 bytes were written through a single-byte font, producing mojibake in the text layer (e.g. the Cyrillic "БЪЛГАРИЯ" became "БЪЛГЕРИЯ"). The rendered page image was unaffected, but copy/paste and text search of the OCR layer returned garbage for Cyrillic, Greek, and other non-Latin scripts. Embed DejaVu Sans (Latin/Cyrillic/Greek/…) and register it with AddUTF8FontFromBytes, make it the default OCR-layer font, and write the recognized text directly as UTF-8 so fpdf emits a proper ToUnicode CMap. The font stays overridable via OCRConfig.Font. Also remove the now-obsolete Latin-1 encode-error tracking and add a round-trip regression test (render -> pdftotext) covering Cyrillic+Latin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yasen Pavlov <yasen.pavlov@bitnet.me>
Author
|
I did this quickly with claude code to fix an issue locally that I ran into and thought it would make sense to let it open a pr for it, but if you prefer I can close it and create an issue instead. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The hOCR → PDF text layer is unreadable for non-Latin scripts. Selecting/copying or searching the OCR layer of a generated PDF returns mojibake — e.g. the Cyrillic
БЪЛГАРИЯcomes back asБЪЛГЕРИЯ. The visible page image is fine; only the invisible OCR layer is corrupt, so it's easy to miss until someone copies text or runs a search.Root cause
pkg/pdfocr/layer.go(drawWord) force-encoded each word to ISO-8859-1 and drew it with fpdf's core Helvetica font:Latin-1 can't represent Cyrillic/Greek/etc., so the encoder errors and the raw UTF-8 bytes are written through a single-byte core font that has no
ToUnicodeCMap. Each UTF-8 byte is then interpreted as a separate WinAnsi character → the classic double-encoding mojibake.Fix
pkg/pdfocr/fonts/DejaVuSans.ttf, covers Latin/Cyrillic/Greek/…) and register it viaAddUTF8FontFromBytes.DefaultFont), still overridable throughOCRConfig.Font.ToUnicodeCMap and the layer is copy/paste- and search-correct for all scripts.font_test.go): build a PDF with Cyrillic+Latin words, extract withpdftotext, assert they come back intact and contain noÐmojibake marker.Latin documents are unaffected — this only changes how the text layer is encoded.
Validation
Verified end-to-end through a downstream consumer (paperless-gpt + Google Document AI) on real Bulgarian documents (a handwritten birth certificate, ID cards): before,
pdftotextreturnedÐ…-style mojibake; after, it returns clean Cyrillic (РЕПУБЛИКА БЪЛГАРИЯ … УДОСТОВЕРЕНИЕ ЗА РАЖДАНЕ …).gofmt/go vet/go test ./...are clean.Notes / open questions
pkg/pdfocr/fonts/LICENSE. Happy to switch to a smaller/subset font, or to make the embedded font opt-in via aFontConfig.FontBytes/FontFilefield if you'd rather not vendor a font by default — just say which you'd prefer.AssembleWithOCR→createPDFFromImage); the existing-PDF path (ApplyOCR→modifyExistingPDF) shares the samedrawWord, so both are covered by the fix.🤖 Generated with Claude Code