Text extraction helpers for Node.js. Supports PDF, Word, and PowerPoint files from ArrayBuffer or Node Buffer
inputs.
pnpm add @niicojs/text-extractorimport { readFile } from 'node:fs/promises';
import { extractFromPdf, extractFromPowerPoint, extractFromWord, extractText } from '@niicojs/text-extractor';
const pdfBuffer = await readFile('document.pdf');
const pdfText = await extractFromPdf(pdfBuffer);
const docxBuffer = await readFile('document.docx');
const wordText = await extractFromWord(docxBuffer);
const pptxBuffer = await readFile('presentation.pptx');
const powerPointText = await extractFromPowerPoint(pptxBuffer);
const text = await extractText(pdfBuffer, 'pdf');Dispatches to the matching extractor based on ext. For txt, it decodes the input with TextDecoder.
Extracts merged text from a PDF using unpdf.
Extracts text from a Word document using @niicojs/word.
Extracts text from PowerPoint slides in slide order. Text runs within the same slide are separated by newlines, and slides are separated by blank lines.
Use Vite+ (vp) for project commands.
vp install
vp check
vp test
vp packvp check --fixfixes formatting and lint issues.vp test run tests/index.test.tsruns one test file.vp pack --watchbuilds in watch mode.
The test suite covers the public extractors and verifies that sliced Buffer inputs pass only their visible bytes to the
underlying parsers.