feat: add support for PDF file processing and extraction using pdf-parse by akramcodez · Pull Request #15 · Nano-Collective/get-md

akramcodez · 2026-06-23T09:14:43Z

Summary

Closes #8

This PR introduces native PDF → Markdown conversion support, allowing PDF documents to be processed through the existing get-md pipeline.

PDF content is extracted, normalized, and passed through the same markdown generation flow used by other content sources.

Changes

PDF Extraction

Added PDF text extraction support using pdf-parse
Introduced src/extractors/pdf-extractor.ts
Added handling for ESM module resolution

Binary Fetching Support

Added fetchUrlBuffer() to support binary downloads
Updated URL fetching utilities to handle PDF content without UTF-8 corruption

Core Pipeline Integration

Updated convertToMarkdown() to accept PDF input
Added PDF detection using file signatures (magic bytes)
Extracted PDF text is normalized before entering the existing markdown pipeline

CLI Support

The CLI now supports:

get-md handbook.pdf

as well as PDF URLs.

PDF inputs are automatically detected and routed through the PDF extraction pipeline.

Tests

Added:

Unit tests for PDF extraction
Integration tests covering PDF ingestion and conversion
Backward compatibility verification

Validation

All tests passing
Existing HTML workflows remain unchanged
PDF inputs supported through both API and CLI
No breaking API changes introduced

will-lamerton · 2026-06-25T21:10:25Z

Hey @akramcodez!

The work is solid - extractor, Buffer/URL detection, CLI dispatch, and tests are all in place and align with the issue. But you're shipping without documentation and the PDF branch has an HTML injection vector that would be good to be addressed before this merges.

Required changes

Add documentation. Acceptance criterion feat: docs folder #4 from issue [Feature] Add PDF → Markdown Conversion Support #8 is unmet. README.md has zero PDF mentions and docs/ has zero PDF mentions. At minimum:
- README.md Quick Start: add a get-md handbook.pdf example
- docs/cli/index.md: document the new input behaviour (.pdf file/URL/stdin into Buffer, auto-routing)
- docs/api/index.md: document the Buffer input shape on convertToMarkdown

Without this, users only discover PDF support by reading source.

Fix HTML injection in src/index.ts:84. You're interpolating raw extracted PDF text into a <p>…</p> template without escaping. A PDF containing <, >, or & produces malformed HTML that cheerio/Readability then re-parses. Two options:
1. Escape the text before wrapping (&, <, > → entities)
2. Skip the HTML round-trip and emit Markdown directly from the raw text

Either way, pick one and add a regression test with a PDF whose text contains <script> or similar to lock in the behaviour.

Update the JSDoc on convertToMarkdown. The signature changed from string to string | Buffer (src/index.ts:73-75), but the example block (src/index.ts:43-72) only shows string/URL inputs. Add a Buffer example so the new overload is discoverable.
Justify or remove the createRequire workaround in src/extractors/pdf-extractor.ts. pdf-parse is dual ESM/CJS and can be imported directly via import { PDFParse } from "pdf-parse". The dynamic require hides the dependency from bundlers and TypeScript. Either switch to a static ESM import, or add a comment explaining why createRequire is necessary.

Risks

%PDF magic-byte detection is fragile. Both src/index.ts:81 and src/cli.ts:308-309 route based on the first four bytes. A non-PDF file that happens to start with %PDF (PostScript file, a text snippet) would be silently routed to pdf-parse and throw with the generic "Failed to parse PDF" message. Minor — the error is surfaced — but worth tightening or improving the error.
Type-only @types/pdf-parse resolution. @types/pdf-parse@^1.1.5 and pdf-parse@^2.4.5 have major versions out of step. If PDFParse ends up untyped, add a local declaration.
Native dependency surface. pdf-parse pulls in @napi-rs/canvas-* as optional native dependencies. The lockfile diff shows them correctly scoped to optionalDependencies, but I'd want confirmation that CI installs and runs the new spec cleanly on all target platforms before merge. The PR body says "all tests passing" but doesn't link a CI run -please add that.
No documented rationale for picking pdf-parse. Worth a one-line note (in package.json or the extractor) on why this was chosen over pdfjs-dist or unpdf for future maintainers.

Nice to have (non-blocking)

The Buffer branch in convertToMarkdown widens the public API surface; consider whether fetchUrlBuffer belongs exported from src/utils/url-fetcher.ts (file name fits, but worth a thought).
Auto-detecting Content-Type: application/pdf in the API (like the CLI does at src/cli.ts:300) would make the API more forgiving, but that's a feature, not a bug fix.
PDF structure (headings, lists, tables) is discarded - line-naive split on blank lines then wrap in <p>. Worth a follow-up issue to use result.pages for layout-aware conversion before this gets baked in as the expected behaviour.

feat: add support for PDF file processing and extraction using pdf-parse

ea1fe56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add support for PDF file processing and extraction using pdf-parse#15

feat: add support for PDF file processing and extraction using pdf-parse#15
akramcodez wants to merge 1 commit into
Nano-Collective:mainfrom
akramcodez:feat/pdf-markdown-support

akramcodez commented Jun 23, 2026 •

edited

Loading

Uh oh!

will-lamerton commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

akramcodez commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

PDF Extraction

Binary Fetching Support

Core Pipeline Integration

CLI Support

Tests

Validation

Uh oh!

will-lamerton commented Jun 25, 2026

Required changes

Risks

Nice to have (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akramcodez commented Jun 23, 2026 •

edited

Loading