Improve PDF.js corpus resilience with parser-native page recovery, dimensions API, and regression coverage by vitormattos · Pull Request #795 · smalot/pdfparser

vitormattos · 2026-04-24T01:47:13Z

Objective

Strengthen parser reliability on malformed real-world PDFs and validate outcomes directly by parser readability (not by external tool comparison).

What this PR changes

Fixes page-tree traversal and recovery paths to handle malformed/cyclic/repeated Kids references more safely.
Improves page box resolution and fallback behavior for malformed or missing page boxes.
Adds robust handling for readable-encrypted edge cases (files marked as encrypted but still readable).
Expands regression coverage with additional PDF.js fixtures, including broken xref/stream/encryption-marked cases.
Introduces a native page dimensions API:
- Page::getDimensions($boxName = 'CropBox')
- Document::getPagesDimensions($boxName = 'CropBox')
Simplifies usage documentation to use the new native dimensions API instead of workaround logic.

Validation approach

Parser-centric validation over the Mozilla PDF.js corpus.
Each file is classified only by parser readability:
- Readable (plain)
- Readable (encryption-marked)
- Unreadable by parser

Results (PDF.js corpus: 930 files / 929 unique hashes)

Status	Baseline (master)	This branch	Δ
✅ Readable (plain)	870	922	+52
🔐 Readable (encryption-marked)	2	6	+4
❌ Unreadable by parser	58	2	-56
Total	930	930	-
parser_success_rate	93.7634%	99.7849%	+6.0215 pp

Additional validation

Full PHPUnit suite remains green: 273 tests, 1649 assertions.
VeraPDF corpus baseline validation remains stable.

Notes

This PR does not claim full decryption support for explicit user-password protected PDFs.
It does improve handling of encryption-marked files that are readable in practice.

k00ni · 2026-04-24T07:16:29Z

@vitormattos Thank you for all these PRs 🚀

I'm really busy right now and don't have much time, but I'll try to make time for it over the next few weeks. What I've seen so far looks very solid. If @j0k3r or others from the community wanna step in, I'd be happy to assist.

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

k00ni · 2026-04-28T06:38:49Z

@vitormattos Several PRs were closed very recently and new ones replaced them (to some extent). It seems you currently doing a deep dive into the library. I only have so much time to look into these. Which one(s) should I prioritize?

vitormattos · 2026-04-28T13:05:08Z

@k00ni

Short answer:

Yes, there are two possible ways to review:

Review all PRs except integration: consolidated PDF.js parsing resilience fixes #809
Or review only integration: consolidated PDF.js parsing resilience fixes #809 (it aggregates all changes)

Reason: #809 is an integration PR that combines everything I’ve been working on.

More context

I used pdfparser in LibreSign to extract page dimensions and generate stamped PDFs with footers. Over time, we started receiving bug reports related to parsing inconsistencies.

As a workaround, I switched to pdfinfo (from Poppler), which works well but requires system-level dependencies. This makes setup harder in some environments, especially where installing external tools is restricted.

After about a year using this workaround, I decided to come back to pdfparser and improve it so it can behave closer to pdfinfo for our use case.

To do that, I built a test engine that compares outputs between pdfparser and pdfinfo using large PDF datasets. Then I started fixing cases where pdfparser behaves differently (mainly page count and dimensions).

About the PR structure

Initially, I tried a single integration PR, but it quickly became hard to maintain due to conflicts and overlapping changes.

So I reorganized the work into smaller PRs grouped by related fixes, making them easier to review and safer to merge.

More details about the PR reorganization (optional to read)

At first, I had one large integration branch, but as fixes evolved, I started finding issues in very close or overlapping parts of the code.

This created a lot of conflicts and made the integration PR unstable and hard to review.

Because of that, I split and regrouped changes by file and responsibility. Some PRs were closed and replaced by cleaner versions to better isolate the fixes.

That’s why you may see several closed PRs, they were part of this restructuring process.

Next steps (planned)

Continue improving compatibility so pdfparser can read the same PDFs as pdfinfo
Add the test engine to validate against large datasets like pdf.js and veraPDF
Explore improvements to extract specific data without parsing the whole document (performance)

Finally, thank you for maintaining this project.

I’m not able to sponsor right now since LibreSign still doesn’t have enough funding, but I hope to do that in the future. For now, contributing with code is my way to give back.

…tack

- narrow catch type to Exception in Document fallback flow - narrow catch type to Exception in Page text extraction fallback - narrow catch type to Exception in test dimension helper

- use Page::getDimensions() for CropBox/MediaBox resolution - remove legacy getDetails-based fallback parsing - drop obsolete helper and unused import

- call Page::getDimensions directly in page dimension assertions - remove redundant extractPageDimensions helper

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

- Make config property always non-null after construction - Align with RawDataParser pattern for consistency - Remove defensive null coalescing operators where now unnecessary - Improve type safety and eliminate defensive checks

- Pop recursion stack at the end of getTextArray - Prevent state leakage between calls/tests - Keep circular-reference guard behavior unchanged

…getPages

…d closures

…o inlined closures" This reverts commit 55dfe59.

vitormattos · 2026-05-05T01:06:14Z

@k00ni I completed the work proposed in this PR: improving the parser's resilience so it can read more PDFs from the PDF.js and veraPDF corpora.

Any further changes would greatly expand this PR's scope and move us away from the original goal. I prefer to reserve future improvements (e.g., performance optimizations or refactoring other components) for separate PRs.

k00ni · 2026-05-05T08:15:20Z

@vitormattos I really appreciate the amount of time you invested in this library. Although this library is used by many projects, its not developed any further. Therefore, if you plan to invest further time in it, you should check out https://github.com/PrinsFrank/pdfparser by @PrinsFrank before. It is currently under active development again and might be a better "bet" for future development.

That said, I will review this PR as soon as I find the time (in May), unless you want to switch to https://github.com/PrinsFrank/pdfparser.

j0k3r · 2026-05-05T08:48:05Z

Thanks for the work @vitormattos.

But as a "maintainer" (even if @k00ni does 99,9% of the job here) I find very complicated to merge such a huge PR which might going to introduce a lot of changes (in the good direction I'm sure) that we won't be able to maintain without taking a lot of time to invest in the near future.

As @k00ni said, that library should only receive little fix as it's not actively maintained. Maybe you should focus your time on the other lib.

vitormattos · 2026-05-05T13:24:47Z

Thanks a lot @k00ni and @j0k3r for the comment, I really appreciate the feedback.

I completely understand the concern about the size of this PR and the maintenance implications. I’m also a maintainer of some open source projects, so I’m very aware of how important it is to keep things reviewable and sustainable. At first glance it does look quite large, and I agree that merging something like this all at once would be difficult to review safely.

Just to give a bit more context on why it ended up this way:

A significant portion of the changes here are actually test fixtures (real-world PDFs from the PDF.js corpus), along with regression tests built on top of them. The goal was to ensure that every behavior change is backed by reproducible cases, especially for malformed PDFs, which are quite common in the real world.

In other words, the size is less about added complexity in the parser itself, and more about increasing confidence and coverage around real-world cases.

That said, I fully agree that the current format is not ideal for review or long-term maintenance.

I’m happy to restructure this in a way that better fits the project’s expectations. For example, we could split this into smaller, focused PRs (e.g. page tree traversal, xref recovery, dimensions API, test coverage), so each part can be reviewed independently and safely.

Regarding the suggestion about https://github.com/PrinsFrank/pdfparser, thanks for pointing that out, I’ll definitely take a look. For now, my main goal is still to improve this project, since there are still users relying on this package.

Happy to adapt to whichever direction you think is most appropriate 👍

k00ni · 2026-05-06T08:50:57Z

@vitormattos for what its worth: I am glad that you reached @PrinsFrank in #315. Judging by the way things are developing right know, would it be OK for you to close this PR, so you can solely focusing your time and energy on https://github.com/PrinsFrank/pdfparser? Having a strong, reliable project is better for the community than a few, which are "half baked".

Its unfortunate that I can't free up enough time right now to start reviewing. I want to at least participate in the dialogue. As it was mentioned before, this library is just being kept alive on the latest PHP version but without further development (except contributions from the community occasionally). It would be a shame if all your effort went into a dead end. We don't have to rush here, but time for Open Source projects is valuable (especially yours) and I don't want to waste it here.

In case you want to stay with smalot/pdfparser for now, I will switch back to my plan to review your code. I'll get back to you about the option of having several smaller PRs instead of one large one.

vitormattos force-pushed the fix/getpages-deduplicate-first-pr branch from 4f628d3 to bbbd1d3 Compare April 24, 2026 01:55

vitormattos closed this Apr 24, 2026

vitormattos deleted the fix/getpages-deduplicate-first-pr branch April 24, 2026 13:46

vitormattos restored the fix/getpages-deduplicate-first-pr branch April 24, 2026 14:19

vitormattos reopened this Apr 24, 2026

vitormattos added 2 commits April 24, 2026 13:41

fix: guard cyclic page tree traversal

1ae0081

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

fix: recover repeated page refs in cyclic page trees

e24b1c2

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

vitormattos mentioned this pull request Apr 25, 2026

sync: include missing PR806 follow-up commit in integration vitormattos/pdfparser#24

Closed

vitormattos closed this Apr 25, 2026

vitormattos mentioned this pull request Apr 25, 2026

fix: deduplicate duplicate Kids references in getPages (PR795 stack) vitormattos/pdfparser#40

Merged

vitormattos reopened this Apr 25, 2026

vitormattos added 7 commits April 25, 2026 20:48

test(pages): keep cyclic pages regression in PagesTest

0692f90

style(tests): fix import order in PagesTest

e1e08e9

fix(memory): guard flate decoding and add memory limit helper

6815ca8

test(pages): align cyclic pages expectation with dedup behavior

b0471aa

test(pages): fix PR806 standalone cyclic pages expectation

d22ae73

test(pages): make cyclic pages assertion merge-safe

181268f

test(page): drop PR806 fixture regression from PR814 scope

1abea5a

vitormattos mentioned this pull request Apr 26, 2026

integration: consolidated PDF.js parsing resilience fixes #809

Closed

test(pages): add fixture source @see for PR806

121b545

vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 27, 2026

Merge PR smalot#795

288206a

vitormattos added 6 commits April 28, 2026 10:40

fix(rawdata): recover malformed xref/startxref scenarios from PR809 s…

41c91d8

…tack

fix(rawdata): remove MemoryLimit dependency from PR813 base

6460f6f

test(rawdata): annotate fixture origins with @see links in PR813

5a96c1f

style(test): fix @see indentation in rawdata fixture docs

f57f179

fix(rawdata): recover malformed xref trailers and page trees

309841d

fix(rawdata): return early for visited xref offsets

ccff792

vitormattos added 8 commits May 3, 2026 17:31

refactor: replace broad Throwable catches in PR paths

a123dea

- narrow catch type to Exception in Document fallback flow - narrow catch type to Exception in Page text extraction fallback - narrow catch type to Exception in test dimension helper

refactor(tests): use native page dimensions API in helper

4c10848

- use Page::getDimensions() for CropBox/MediaBox resolution - remove legacy getDetails-based fallback parsing - drop obsolete helper and unused import

refactor(tests): inline native page dimension retrieval

602d790

- call Page::getDimensions directly in page dimension assertions - remove redundant extractPageDimensions helper

docs: clarify encrypted-but-readable PDF wording

aad87f3

refactor(tests): simplify native page dimensions assertions

60ad627

refactor(tests): rely on native getDimensions default fallback

efaf0d9

chore: remove unecessary line break

84b6277

Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>

fix(pdfobject): enforce non-nullable config property

685474b

- Make config property always non-null after construction - Align with RawDataParser pattern for consistency - Remove defensive null coalescing operators where now unnecessary - Improve type safety and eliminate defensive checks

k00ni mentioned this pull request May 4, 2026

Implement 'Do' command and support Form referenced by 'Do' commands #820

Draft

4 tasks

vitormattos added 11 commits May 4, 2026 10:54

fix(pdfobject): clean recursion stack after text extraction

cc1c936

- Pop recursion stack at the end of getTextArray - Prevent state leakage between calls/tests - Keep circular-reference guard behavior unchanged

ci: cancel older in-progress runs on same PR

eda10e9

fix(pdfobject): restore stable recursion and config behavior

445017b

refactor(page): deduplicate box coordinate extraction

8021276

refactor(document): extract ordered page resolvers in getPages

d6bede9

refactor(document): restore explicit getPages flow

875051e

refactor(document): collapse repetitive fallback dispatch in getPages

bd27e75

refactor(document): use callable array for lazy fallback dispatch in …

275994e

…getPages

refactor(document): use lazy closures for page fallback resolution

a397794

refactor(document): consolidate duplicate fallback guards into inline…

55dfe59

…d closures

Revert "refactor(document): consolidate duplicate fallback guards int…

a4b5992

…o inlined closures" This reverts commit 55dfe59.

vitormattos marked this pull request as ready for review May 5, 2026 01:06

k00ni added enhancement unit tests / CI help wanted labels May 5, 2026

vitormattos mentioned this pull request May 5, 2026

Would you be open to receive a port of resilience improvements from smalot/pdfparser? PrinsFrank/pdfparser#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PDF.js corpus resilience with parser-native page recovery, dimensions API, and regression coverage#795

Improve PDF.js corpus resilience with parser-native page recovery, dimensions API, and regression coverage#795
vitormattos wants to merge 110 commits into
smalot:masterfrom
vitormattos:fix/getpages-deduplicate-first-pr

vitormattos commented Apr 24, 2026 •

edited

Loading

Uh oh!

k00ni commented Apr 24, 2026 •

edited

Loading

Uh oh!

k00ni commented Apr 28, 2026 •

edited

Loading

Uh oh!

vitormattos commented Apr 28, 2026

Uh oh!

vitormattos commented May 5, 2026

Uh oh!

k00ni commented May 5, 2026

Uh oh!

j0k3r commented May 5, 2026

Uh oh!

vitormattos commented May 5, 2026

Uh oh!

k00ni commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vitormattos commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

What this PR changes

Validation approach

Results (PDF.js corpus: 930 files / 929 unique hashes)

Additional validation

Notes

Uh oh!

k00ni commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k00ni commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitormattos commented Apr 28, 2026

More context

About the PR structure

Next steps (planned)

Uh oh!

vitormattos commented May 5, 2026

Uh oh!

k00ni commented May 5, 2026

Uh oh!

j0k3r commented May 5, 2026

Uh oh!

vitormattos commented May 5, 2026

Uh oh!

k00ni commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vitormattos commented Apr 24, 2026 •

edited

Loading

k00ni commented Apr 24, 2026 •

edited

Loading

k00ni commented Apr 28, 2026 •

edited

Loading

k00ni commented May 6, 2026 •

edited

Loading