feat: add support for PDF file processing and extraction using pdf-parse#15
feat: add support for PDF file processing and extraction using pdf-parse#15akramcodez wants to merge 1 commit into
Conversation
|
Hey @akramcodez! The work is solid - extractor, Buffer/URL detection, CLI dispatch, and tests are all in place and align with the issue. But you're shipping without documentation and the PDF branch has an HTML injection vector that would be good to be addressed before this merges. Required changes
Without this, users only discover PDF support by reading source.
Either way, pick one and add a regression test with a PDF whose text contains
Risks
Nice to have (non-blocking)
|
Summary
Closes #8
This PR introduces native PDF → Markdown conversion support, allowing PDF documents to be processed through the existing get-md pipeline.
PDF content is extracted, normalized, and passed through the same markdown generation flow used by other content sources.
Changes
PDF Extraction
pdf-parsesrc/extractors/pdf-extractor.tsBinary Fetching Support
fetchUrlBuffer()to support binary downloadsCore Pipeline Integration
convertToMarkdown()to accept PDF inputCLI Support
The CLI now supports:
as well as PDF URLs.
PDF inputs are automatically detected and routed through the PDF extraction pipeline.
Tests
Added:
Validation