Skip to content

Improve PDF.js corpus resilience with parser-native page recovery, dimensions API, and regression coverage#795

Open
vitormattos wants to merge 110 commits into
smalot:masterfrom
vitormattos:fix/getpages-deduplicate-first-pr
Open

Improve PDF.js corpus resilience with parser-native page recovery, dimensions API, and regression coverage#795
vitormattos wants to merge 110 commits into
smalot:masterfrom
vitormattos:fix/getpages-deduplicate-first-pr

Conversation

@vitormattos
Copy link
Copy Markdown

@vitormattos vitormattos commented Apr 24, 2026

Objective

Strengthen parser reliability on malformed real-world PDFs and validate outcomes directly by parser readability (not by external tool comparison).

What this PR changes

  • Fixes page-tree traversal and recovery paths to handle malformed/cyclic/repeated Kids references more safely.
  • Improves page box resolution and fallback behavior for malformed or missing page boxes.
  • Adds robust handling for readable-encrypted edge cases (files marked as encrypted but still readable).
  • Expands regression coverage with additional PDF.js fixtures, including broken xref/stream/encryption-marked cases.
  • Introduces a native page dimensions API:
    • Page::getDimensions($boxName = 'CropBox')
    • Document::getPagesDimensions($boxName = 'CropBox')
  • Simplifies usage documentation to use the new native dimensions API instead of workaround logic.

Validation approach

  • Parser-centric validation over the Mozilla PDF.js corpus.
  • Each file is classified only by parser readability:
    • Readable (plain)
    • Readable (encryption-marked)
    • Unreadable by parser

Results (PDF.js corpus: 930 files / 929 unique hashes)

Status Baseline (master) This branch Δ
✅ Readable (plain) 870 922 +52
🔐 Readable (encryption-marked) 2 6 +4
❌ Unreadable by parser 58 2 -56
Total 930 930 -
parser_success_rate 93.7634% 99.7849% +6.0215 pp

Additional validation

  • Full PHPUnit suite remains green: 273 tests, 1649 assertions.
  • VeraPDF corpus baseline validation remains stable.

Notes

  • This PR does not claim full decryption support for explicit user-password protected PDFs.
  • It does improve handling of encryption-marked files that are readable in practice.

@vitormattos vitormattos force-pushed the fix/getpages-deduplicate-first-pr branch from 4f628d3 to bbbd1d3 Compare April 24, 2026 01:55
@k00ni
Copy link
Copy Markdown
Collaborator

k00ni commented Apr 24, 2026

@vitormattos Thank you for all these PRs 🚀

I'm really busy right now and don't have much time, but I'll try to make time for it over the next few weeks. What I've seen so far looks very solid. If @j0k3r or others from the community wanna step in, I'd be happy to assist.

@vitormattos vitormattos deleted the fix/getpages-deduplicate-first-pr branch April 24, 2026 13:46
@vitormattos vitormattos restored the fix/getpages-deduplicate-first-pr branch April 24, 2026 14:19
@vitormattos vitormattos reopened this Apr 24, 2026
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
vitormattos added a commit to vitormattos/pdfparser that referenced this pull request Apr 27, 2026
@k00ni
Copy link
Copy Markdown
Collaborator

k00ni commented Apr 28, 2026

@vitormattos Several PRs were closed very recently and new ones replaced them (to some extent). It seems you currently doing a deep dive into the library. I only have so much time to look into these. Which one(s) should I prioritize?

@vitormattos
Copy link
Copy Markdown
Author

@k00ni

Short answer:

Yes, there are two possible ways to review:

Reason: #809 is an integration PR that combines everything I’ve been working on.


More context

I used pdfparser in LibreSign to extract page dimensions and generate stamped PDFs with footers. Over time, we started receiving bug reports related to parsing inconsistencies.

As a workaround, I switched to pdfinfo (from Poppler), which works well but requires system-level dependencies. This makes setup harder in some environments, especially where installing external tools is restricted.

After about a year using this workaround, I decided to come back to pdfparser and improve it so it can behave closer to pdfinfo for our use case.

To do that, I built a test engine that compares outputs between pdfparser and pdfinfo using large PDF datasets. Then I started fixing cases where pdfparser behaves differently (mainly page count and dimensions).

About the PR structure

Initially, I tried a single integration PR, but it quickly became hard to maintain due to conflicts and overlapping changes.

So I reorganized the work into smaller PRs grouped by related fixes, making them easier to review and safer to merge.

More details about the PR reorganization (optional to read)

At first, I had one large integration branch, but as fixes evolved, I started finding issues in very close or overlapping parts of the code.

This created a lot of conflicts and made the integration PR unstable and hard to review.

Because of that, I split and regrouped changes by file and responsibility. Some PRs were closed and replaced by cleaner versions to better isolate the fixes.

That’s why you may see several closed PRs, they were part of this restructuring process.

Next steps (planned)

  1. Continue improving compatibility so pdfparser can read the same PDFs as pdfinfo
  2. Add the test engine to validate against large datasets like pdf.js and veraPDF
  3. Explore improvements to extract specific data without parsing the whole document (performance)

Finally, thank you for maintaining this project.

I’m not able to sponsor right now since LibreSign still doesn’t have enough funding, but I hope to do that in the future. For now, contributing with code is my way to give back.

- narrow catch type to Exception in Document fallback flow
- narrow catch type to Exception in Page text extraction fallback
- narrow catch type to Exception in test dimension helper
- use Page::getDimensions() for CropBox/MediaBox resolution
- remove legacy getDetails-based fallback parsing
- drop obsolete helper and unused import
- call Page::getDimensions directly in page dimension assertions
- remove redundant extractPageDimensions helper
Signed-off-by: Vitor Mattos <1079143+vitormattos@users.noreply.github.com>
- Make config property always non-null after construction
- Align with RawDataParser pattern for consistency
- Remove defensive null coalescing operators where now unnecessary
- Improve type safety and eliminate defensive checks
@vitormattos
Copy link
Copy Markdown
Author

@k00ni I completed the work proposed in this PR: improving the parser's resilience so it can read more PDFs from the PDF.js and veraPDF corpora.

Any further changes would greatly expand this PR's scope and move us away from the original goal. I prefer to reserve future improvements (e.g., performance optimizations or refactoring other components) for separate PRs.

@vitormattos vitormattos marked this pull request as ready for review May 5, 2026 01:06
@k00ni
Copy link
Copy Markdown
Collaborator

k00ni commented May 5, 2026

@vitormattos I really appreciate the amount of time you invested in this library. Although this library is used by many projects, its not developed any further. Therefore, if you plan to invest further time in it, you should check out https://github.com/PrinsFrank/pdfparser by @PrinsFrank before. It is currently under active development again and might be a better "bet" for future development.

That said, I will review this PR as soon as I find the time (in May), unless you want to switch to https://github.com/PrinsFrank/pdfparser.

@j0k3r
Copy link
Copy Markdown
Collaborator

j0k3r commented May 5, 2026

Thanks for the work @vitormattos.

But as a "maintainer" (even if @k00ni does 99,9% of the job here) I find very complicated to merge such a huge PR which might going to introduce a lot of changes (in the good direction I'm sure) that we won't be able to maintain without taking a lot of time to invest in the near future.

As @k00ni said, that library should only receive little fix as it's not actively maintained. Maybe you should focus your time on the other lib.

@vitormattos
Copy link
Copy Markdown
Author

Thanks a lot @k00ni and @j0k3r for the comment, I really appreciate the feedback.

I completely understand the concern about the size of this PR and the maintenance implications. I’m also a maintainer of some open source projects, so I’m very aware of how important it is to keep things reviewable and sustainable. At first glance it does look quite large, and I agree that merging something like this all at once would be difficult to review safely.

Just to give a bit more context on why it ended up this way:

A significant portion of the changes here are actually test fixtures (real-world PDFs from the PDF.js corpus), along with regression tests built on top of them. The goal was to ensure that every behavior change is backed by reproducible cases, especially for malformed PDFs, which are quite common in the real world.

In other words, the size is less about added complexity in the parser itself, and more about increasing confidence and coverage around real-world cases.

That said, I fully agree that the current format is not ideal for review or long-term maintenance.

I’m happy to restructure this in a way that better fits the project’s expectations. For example, we could split this into smaller, focused PRs (e.g. page tree traversal, xref recovery, dimensions API, test coverage), so each part can be reviewed independently and safely.

Regarding the suggestion about https://github.com/PrinsFrank/pdfparser, thanks for pointing that out, I’ll definitely take a look. For now, my main goal is still to improve this project, since there are still users relying on this package.

Happy to adapt to whichever direction you think is most appropriate 👍

@k00ni
Copy link
Copy Markdown
Collaborator

k00ni commented May 6, 2026

@vitormattos for what its worth: I am glad that you reached @PrinsFrank in #315. Judging by the way things are developing right know, would it be OK for you to close this PR, so you can solely focusing your time and energy on https://github.com/PrinsFrank/pdfparser? Having a strong, reliable project is better for the community than a few, which are "half baked".

Its unfortunate that I can't free up enough time right now to start reviewing. I want to at least participate in the dialogue. As it was mentioned before, this library is just being kept alive on the latest PHP version but without further development (except contributions from the community occasionally). It would be a shame if all your effort went into a dead end. We don't have to rush here, but time for Open Source projects is valuable (especially yours) and I don't want to waste it here.

In case you want to stay with smalot/pdfparser for now, I will switch back to my plan to review your code. I'll get back to you about the option of having several smaller PRs instead of one large one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants