Skip to content

integration: consolidated PDF.js parsing resilience fixes#809

Closed
vitormattos wants to merge 75 commits into
smalot:masterfrom
vitormattos:integration/pdfjs-fixes
Closed

integration: consolidated PDF.js parsing resilience fixes#809
vitormattos wants to merge 75 commits into
smalot:masterfrom
vitormattos:integration/pdfjs-fixes

Conversation

@vitormattos
Copy link
Copy Markdown

@vitormattos vitormattos commented Apr 25, 2026

PDF.js validation: aggregated fixes

Objective: Validate smalot/pdfparser against the Mozilla PDF.js corpus by comparing parser output with pdfinfo (Poppler reference implementation).

Validation approach: Run parser on each PDF, extract page count. Compare against pdfinfo page count. Files match = ok. Mismatch or parser error = issue to fix.

Results (PDF.js corpus, 930 files / 929 unique hashes):

Status Baseline This branch Δ
✅ ok 865 909 +44
❌ parser_error 41 0 -41
⚠️ both_error 13 0 -13
ℹ️ pdfinfo_error 7 20 +13
🔀 mismatch 3 0 -3
Total 929 929

Success rate: 93.1% → 97.8% (+4.7pp)

Deduplication note: the validation state is keyed by SHA-256, so two byte-identical files (empty.pdf and empty#hash.pdf) collapse into one effective corpus entry. The directory contains 930 PDFs, but the aggregate counts operate on 929 unique hashes.

Note on pdfinfo_error (+13): this is a classification shift, not a parser regression. These files are cases where parser-side failures were reduced while pdfinfo still reports malformed-input errors (e.g. Poppler syntax or xref failures). The parser now successfully reads more PDFs than pdfinfo can process.

Additional validation:

  • VeraPDF corpus: 2,907 PDFs validated with 100% pass rate on baseline tests

Review options:

Maintainer can choose either path:

  1. Review and merge individual fixes: Each focused fix reviewed and merged separately (granular approach)
  2. Fast-track integration PR: Remove draft status and merge this consolidated PR (aggregated approach)

Both paths achieve the same end state. Choose based on review capacity and preference.

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Some PDFs include bytes before the %PDF- header while still using
absolute xref offsets from the beginning of the file.

The parser trimmed data before %PDF-, which shifted offsets and caused
xref lookup failures. This manifested as an Invalid object reference
error in the veraPDF corpus header case.

Changes:
- Keep original byte layout in RawDataParser::parseData
- Add stricter trailer key matching for /Size /Root /Encrypt /Info /Prev
- Add defensive handling in xref stream resolution when startxref is near,
  but not exactly at, the xref stream object
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestInvalidObjectReference.pdf

Test:
- DocumentIssueFocusTest::testParseFileWithCompressedObjRefInXrefStream

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Some PDFs set startxref to the whitespace immediately before the
xref keyword instead of the first letter of xref.

The parser required an exact match and incorrectly switched to xref
stream decoding, which then failed with Invalid object reference.

Changes:
- Skip PDF whitespace before checking startxref position
- Use adjusted offset when decoding classic xref
- Apply same whitespace tolerance for Unix line-ending detection
- Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev
- Add regression fixture and integration test

Regression fixture:
- samples/bugs/PullRequestXrefWhitespaceStart.pdf

Test:
- DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
@vitormattos vitormattos reopened this Apr 25, 2026
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from e786ad9 to b1684d5 Compare April 25, 2026 23:27
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from 09d17df to 5779af8 Compare April 28, 2026 18:34
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from a624421 to 7290bc9 Compare April 28, 2026 18:49
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from 81cd214 to cc984c1 Compare April 29, 2026 02:23
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from be82edc to 30d369f Compare April 29, 2026 03:48
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from 30d369f to 3a6f1ac Compare April 29, 2026 04:01
@vitormattos vitormattos force-pushed the integration/pdfjs-fixes branch from 3a6f1ac to 16f2e3f Compare April 29, 2026 13:51
@vitormattos
Copy link
Copy Markdown
Author

Consolidated into #795 to avoid repeated cross-PR conflicts and keep a single review lane. Closing this PR as part of branch unification strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant