integration: consolidated PDF.js parsing resilience fixes by vitormattos · Pull Request #809 · smalot/pdfparser

vitormattos · 2026-04-25T02:37:20Z

PDF.js validation: aggregated fixes

Objective: Validate smalot/pdfparser against the Mozilla PDF.js corpus by comparing parser output with pdfinfo (Poppler reference implementation).

Validation approach: Run parser on each PDF, extract page count. Compare against pdfinfo page count. Files match = ok. Mismatch or parser error = issue to fix.

Results (PDF.js corpus, 930 files / 929 unique hashes):

Status	Baseline	This branch	Δ
✅ ok	865	909	+44
❌ parser_error	41	0	-41
⚠️ both_error	13	0	-13
ℹ️ pdfinfo_error	7	20	+13
🔀 mismatch	3	0	-3
Total	929	929	—

Success rate: 93.1% → 97.8% (+4.7pp)

Deduplication note: the validation state is keyed by SHA-256, so two byte-identical files (empty.pdf and empty#hash.pdf) collapse into one effective corpus entry. The directory contains 930 PDFs, but the aggregate counts operate on 929 unique hashes.

Note on pdfinfo_error (+13): this is a classification shift, not a parser regression. These files are cases where parser-side failures were reduced while pdfinfo still reports malformed-input errors (e.g. Poppler syntax or xref failures). The parser now successfully reads more PDFs than pdfinfo can process.

Additional validation:

VeraPDF corpus: 2,907 PDFs validated with 100% pass rate on baseline tests

Review options:

Maintainer can choose either path:

Review and merge individual fixes: Each focused fix reviewed and merged separately (granular approach)
Fast-track integration PR: Remove draft status and merge this consolidated PR (aggregated approach)

Both paths achieve the same end state. Choose based on review capacity and preference.

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Some PDFs include bytes before the %PDF- header while still using absolute xref offsets from the beginning of the file. The parser trimmed data before %PDF-, which shifted offsets and caused xref lookup failures. This manifested as an Invalid object reference error in the veraPDF corpus header case. Changes: - Keep original byte layout in RawDataParser::parseData - Add stricter trailer key matching for /Size /Root /Encrypt /Info /Prev - Add defensive handling in xref stream resolution when startxref is near, but not exactly at, the xref stream object - Add regression fixture and integration test Regression fixture: - samples/bugs/PullRequestInvalidObjectReference.pdf Test: - DocumentIssueFocusTest::testParseFileWithCompressedObjRefInXrefStream Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Some PDFs set startxref to the whitespace immediately before the xref keyword instead of the first letter of xref. The parser required an exact match and incorrectly switched to xref stream decoding, which then failed with Invalid object reference. Changes: - Skip PDF whitespace before checking startxref position - Use adjusted offset when decoding classic xref - Apply same whitespace tolerance for Unix line-ending detection - Tighten trailer key regexes to match /Size /Root /Encrypt /Info /Prev - Add regression fixture and integration test Regression fixture: - samples/bugs/PullRequestXrefWhitespaceStart.pdf Test: - DocumentIssueFocusTest::testParseFileWhenStartxrefPointsToLeadingWhitespace Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

…tack

…ssion

…versions

… to recover-invalid branch

…uard ownership

# Conflicts: # src/Smalot/PdfParser/Document.php

# Conflicts: # src/Smalot/PdfParser/RawData/RawDataParser.php # tests/PHPUnit/Integration/RawData/RawDataParserTest.php

# Conflicts: # src/Smalot/PdfParser/Parser.php # src/Smalot/PdfParser/RawData/RawDataParser.php # tests/PHPUnit/Integration/RawData/RawDataParserTest.php

vitormattos · 2026-04-30T15:25:01Z

Consolidated into #795 to avoid repeated cross-PR conflicts and keep a single review lane. Closing this PR as part of branch unification strategy.

vitormattos added 10 commits April 23, 2026 22:55

fix: deduplicate duplicate kids references in getPages

f195d27

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: add duplicate-kids PDF fixture regression

a617540

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: add @see link for duplicate-kids regression

ace7d51

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

fix: support php 7.1 in page deduplication

bbbd1d3

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: use assertCount for page count assertion

917ad5d

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

fix: guard cyclic page tree traversal

1ae0081

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

fix: recover repeated page refs in cyclic page trees

e24b1c2

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

fix: recover pages when xref entries are partially missing

7bf64b8

fix: recover root object when xref points to invalid offset

4996a8f

vitormattos mentioned this pull request Apr 25, 2026

fix(rawdata): consolidate invalid xref/object-reference recovery #812

Closed

tests: trim PR812 scope in DocumentIssueFocusTest

e2f33e3

vitormattos mentioned this pull request Apr 25, 2026

sync: include missing PR806 follow-up commit in integration vitormattos/pdfparser#24

Closed

vitormattos force-pushed the integration/pdfjs-fixes branch from ff2abc5 to e786ad9 Compare April 25, 2026 19:00

vitormattos added 5 commits April 25, 2026 17:08

fix: recover when startxref points into xref trailer

70361ca

test: move PR796 regression to RawDataParserTest

b8ec7b3

test: add pdf.js compressed xref regression

edbacca

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

test: clarify pull request fixture provenance

cc85357

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos closed this Apr 25, 2026

vitormattos reopened this Apr 25, 2026

vitormattos force-pushed the integration/pdfjs-fixes branch from e786ad9 to b1684d5 Compare April 25, 2026 23:27

vitormattos added 8 commits April 25, 2026 20:39

test(rawdata): keep PR796/797 regressions in RawDataParserTest only

cbd0bbf

test(pages): keep cyclic pages regression in PagesTest

0692f90

style(tests): fix import order in PagesTest

e1e08e9

fix(rawdata): recover malformed xref/startxref scenarios from PR809 s…

9629034

…tack

fix(memory): guard flate decoding and add memory limit helper

6815ca8

test(pages): align cyclic pages expectation with dedup behavior

b0471aa

test(pages): fix PR806 standalone cyclic pages expectation

d22ae73

test(pages): make cyclic pages assertion merge-safe

181268f

vitormattos added 2 commits April 28, 2026 15:30

docs(test): drop unnecessary hashes for synthetic fixture

1aa434a

docs(test): trim synthetic fixture docblock to essential note

76afa31

vitormattos force-pushed the integration/pdfjs-fixes branch from 09d17df to 5779af8 Compare April 28, 2026 18:34

fix(pages): normalize Kids in collectPages traversal

a27049d

vitormattos force-pushed the integration/pdfjs-fixes branch from a624421 to 7290bc9 Compare April 28, 2026 18:49

vitormattos added 2 commits April 28, 2026 16:26

fix(rawdata): tolerate malformed prev xref chain and add REDHAT regre…

da300de

…ssion

chore: drop internal diagnose-parser tool from public PR

0720cda

vitormattos force-pushed the integration/pdfjs-fixes branch from 81cd214 to cc984c1 Compare April 29, 2026 02:23

vitormattos added 6 commits April 28, 2026 23:30

fix(rawdata-next): align conflict hotspots with integration-resolved …

a51f137

…versions

refactor(rawdata-next): delegate shared parser/rawdata test ownership…

7b96814

… to recover-invalid branch

refactor(rawdata-next): minimize overlap with pages-tree and memory-g…

ab9a4ef

…uard ownership

fix(tests): align rawdata fixture paths in PR816

0600fd2

fix(rawdata): restore xref and objref recovery logic for PR816

27e1186

docs(tests): keep only external PDF @see links in PR816 rawdata tests

d0017ca

vitormattos force-pushed the integration/pdfjs-fixes branch from be82edc to 30d369f Compare April 29, 2026 03:48

refactor(pr817): isolate non-overlapping fixture scope

e5c71b7

vitormattos force-pushed the integration/pdfjs-fixes branch from 30d369f to 3a6f1ac Compare April 29, 2026 04:01

vitormattos added 10 commits April 29, 2026 01:24

fix(pages): recover malformed page-like kids

77d435b

fix(rawdata): tolerate recoverable headerless inputs

ced72e9

Merge PR smalot#795 into PR809 recreation 20260429

c624613

Merge PR smalot#806 into PR809 recreation 20260429

1eecc75

# Conflicts: # src/Smalot/PdfParser/Document.php

Merge PR smalot#812 into PR809 recreation 20260429

2cc6c1d

Merge PR smalot#813 into PR809 recreation 20260429

b08699b

Merge PR smalot#816 into PR809 recreation 20260429

b416f17

# Conflicts: # src/Smalot/PdfParser/RawData/RawDataParser.php # tests/PHPUnit/Integration/RawData/RawDataParserTest.php

Merge rawdata recovery stack into PR809 recreation 20260429

e86f180

# Conflicts: # src/Smalot/PdfParser/Parser.php # src/Smalot/PdfParser/RawData/RawDataParser.php # tests/PHPUnit/Integration/RawData/RawDataParserTest.php

Merge PR smalot#817 into PR809 recreation 20260429

ad64bdc

fix(pages): remove duplicated declared-count recovery method

16f2e3f

vitormattos force-pushed the integration/pdfjs-fixes branch from 3a6f1ac to 16f2e3f Compare April 29, 2026 13:51

vitormattos closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration: consolidated PDF.js parsing resilience fixes#809

integration: consolidated PDF.js parsing resilience fixes#809
vitormattos wants to merge 75 commits into
smalot:masterfrom
vitormattos:integration/pdfjs-fixes

vitormattos commented Apr 25, 2026 •

edited

Loading

Uh oh!

vitormattos commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vitormattos commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PDF.js validation: aggregated fixes

Uh oh!

vitormattos commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vitormattos commented Apr 25, 2026 •

edited

Loading