Skip to content

feat: add support for PDF file processing and extraction using pdf-parse#15

Open
akramcodez wants to merge 1 commit into
Nano-Collective:mainfrom
akramcodez:feat/pdf-markdown-support
Open

feat: add support for PDF file processing and extraction using pdf-parse#15
akramcodez wants to merge 1 commit into
Nano-Collective:mainfrom
akramcodez:feat/pdf-markdown-support

Conversation

@akramcodez

@akramcodez akramcodez commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #8

This PR introduces native PDF → Markdown conversion support, allowing PDF documents to be processed through the existing get-md pipeline.

PDF content is extracted, normalized, and passed through the same markdown generation flow used by other content sources.

Changes

PDF Extraction

  • Added PDF text extraction support using pdf-parse
  • Introduced src/extractors/pdf-extractor.ts
  • Added handling for ESM module resolution

Binary Fetching Support

  • Added fetchUrlBuffer() to support binary downloads
  • Updated URL fetching utilities to handle PDF content without UTF-8 corruption

Core Pipeline Integration

  • Updated convertToMarkdown() to accept PDF input
  • Added PDF detection using file signatures (magic bytes)
  • Extracted PDF text is normalized before entering the existing markdown pipeline

CLI Support

The CLI now supports:

get-md handbook.pdf

as well as PDF URLs.

PDF inputs are automatically detected and routed through the PDF extraction pipeline.

Tests

Added:

  • Unit tests for PDF extraction
  • Integration tests covering PDF ingestion and conversion
  • Backward compatibility verification

Validation

  • All tests passing
  • Existing HTML workflows remain unchanged
  • PDF inputs supported through both API and CLI
  • No breaking API changes introduced

@will-lamerton

Copy link
Copy Markdown
Member

Hey @akramcodez!

The work is solid - extractor, Buffer/URL detection, CLI dispatch, and tests are all in place and align with the issue. But you're shipping without documentation and the PDF branch has an HTML injection vector that would be good to be addressed before this merges.

Required changes

  • Add documentation. Acceptance criterion feat: docs folder #4 from issue [Feature] Add PDF → Markdown Conversion Support #8 is unmet. README.md has zero PDF mentions and docs/ has zero PDF mentions. At minimum:
    • README.md Quick Start: add a get-md handbook.pdf example
    • docs/cli/index.md: document the new input behaviour (.pdf file/URL/stdin into Buffer, auto-routing)
    • docs/api/index.md: document the Buffer input shape on convertToMarkdown

Without this, users only discover PDF support by reading source.

  • Fix HTML injection in src/index.ts:84. You're interpolating raw extracted PDF text into a <p>…</p> template without escaping. A PDF containing <, >, or & produces malformed HTML that cheerio/Readability then re-parses. Two options:
    1. Escape the text before wrapping (&, <, > → entities)
    2. Skip the HTML round-trip and emit Markdown directly from the raw text

Either way, pick one and add a regression test with a PDF whose text contains <script> or similar to lock in the behaviour.

  • Update the JSDoc on convertToMarkdown. The signature changed from string to string | Buffer (src/index.ts:73-75), but the example block (src/index.ts:43-72) only shows string/URL inputs. Add a Buffer example so the new overload is discoverable.

  • Justify or remove the createRequire workaround in src/extractors/pdf-extractor.ts. pdf-parse is dual ESM/CJS and can be imported directly via import { PDFParse } from "pdf-parse". The dynamic require hides the dependency from bundlers and TypeScript. Either switch to a static ESM import, or add a comment explaining why createRequire is necessary.

Risks

  • %PDF magic-byte detection is fragile. Both src/index.ts:81 and src/cli.ts:308-309 route based on the first four bytes. A non-PDF file that happens to start with %PDF (PostScript file, a text snippet) would be silently routed to pdf-parse and throw with the generic "Failed to parse PDF" message. Minor — the error is surfaced — but worth tightening or improving the error.

  • Type-only @types/pdf-parse resolution. @types/pdf-parse@^1.1.5 and pdf-parse@^2.4.5 have major versions out of step. If PDFParse ends up untyped, add a local declaration.

  • Native dependency surface. pdf-parse pulls in @napi-rs/canvas-* as optional native dependencies. The lockfile diff shows them correctly scoped to optionalDependencies, but I'd want confirmation that CI installs and runs the new spec cleanly on all target platforms before merge. The PR body says "all tests passing" but doesn't link a CI run -please add that.

  • No documented rationale for picking pdf-parse. Worth a one-line note (in package.json or the extractor) on why this was chosen over pdfjs-dist or unpdf for future maintainers.

Nice to have (non-blocking)

  • The Buffer branch in convertToMarkdown widens the public API surface; consider whether fetchUrlBuffer belongs exported from src/utils/url-fetcher.ts (file name fits, but worth a thought).
  • Auto-detecting Content-Type: application/pdf in the API (like the CLI does at src/cli.ts:300) would make the API more forgiving, but that's a feature, not a bug fix.
  • PDF structure (headings, lists, tables) is discarded - line-naive split on blank lines then wrap in <p>. Worth a follow-up issue to use result.pages for layout-aware conversion before this gets baked in as the expected behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add PDF → Markdown Conversion Support

2 participants