PDF, DOCX, and HTML text extraction and normalization for academic papers.
Built from cross-project experience across 8,000+ PDFs spanning psychology, medicine, economics, physics, and biology. Achieves 100% accuracy on 29 manually verified ground-truth passages (see BENCHMARKS.md).
Supports three input formats:
- PDF via
pdftotextdefault mode (withpdfplumberSMP recovery) - DOCX via
mammoth(DOCX → HTML → text, preserving Shift+Enter soft breaks) - HTML via
beautifulsoup4+lxml(block/inline-aware tree-walk)
All three formats feed into the same 15-step normalization pipeline and quality scoring.
# PDF only (pdfplumber)
pip install docpluck
# + DOCX support (adds mammoth)
pip install docpluck[docx]
# + HTML support (adds beautifulsoup4 + lxml)
pip install docpluck[html]
# Everything
pip install docpluck[all]System requirement for extract_pdf(): poppler-utils (provides the pdftotext binary). DOCX and HTML are pure Python — no system dependencies.
# Linux / WSL
apt-get install poppler-utils
# macOS
brew install poppler
# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases
# Add bin/ to PATHInstall from GitHub (like R's remotes::install_github()):
pip install git+https://github.com/giladfeldman/docpluck.git
# Pinned version
pip install "docpluck>=1.3.0"from docpluck import (
extract_pdf, extract_docx, extract_html,
normalize_text, NormalizationLevel, compute_quality_score,
)
# 1. Extract text from any supported format
with open("paper.pdf", "rb") as f:
text, method = extract_pdf(f.read())
# Or from DOCX:
# with open("paper.docx", "rb") as f:
# text, method = extract_docx(f.read())
# Or from HTML:
# with open("paper.html", "rb") as f:
# text, method = extract_html(f.read())
print(f"Extracted {len(text):,} chars via {method}")
# 2. Normalize for statistical pattern matching
normalized, report = normalize_text(text, NormalizationLevel.academic)
print(f"Steps applied: {report.steps_applied}")
print(f"Changes made: {report.changes_made}")
# 3. Check quality
quality = compute_quality_score(normalized)
print(f"Quality: {quality['score']}/100 ({quality['confidence']})")
if quality["garbled"]:
print("Warning: text may be corrupted (column merge or encoding failure)")For consumers that need tables and figures as structured data — meta-analysis tooling, statistical-claim extraction, dashboards — call extract_pdf_structured():
from docpluck import extract_pdf_structured
with open("paper.pdf", "rb") as f:
result = extract_pdf_structured(f.read())
print(f"{result['page_count']} pages")
print(f"{len(result['tables'])} tables, {len(result['figures'])} figures")
for t in result["tables"]:
print(f" {t['label']} on page {t['page']} ({t['kind']}, confidence={t['confidence']})")
if t["kind"] == "structured":
print(f" {t['n_rows']} rows × {t['n_cols']} cols")# Default: caption-anchored fast path.
extract_pdf_structured(pdf_bytes)
# Thorough: scan every page for uncaptioned tables (slower).
extract_pdf_structured(pdf_bytes, thorough=True)
# Strip table/figure regions from `text` and replace with [Label: caption] markers.
extract_pdf_structured(pdf_bytes, table_text_mode="placeholder")docpluck extract paper.pdf --structured > out.json
docpluck extract paper.pdf --structured --thorough --text-mode placeholder
docpluck extract paper.pdf --structured --html-tables-to ./out/
extract_pdf() (the v1 text-only path) is unchanged. New consumers opt in to the structured path; existing consumers see no behavioral change.
See docs/superpowers/specs/2026-05-06-table-extraction-design.md for the full schema and design rationale.
Extract text from PDF bytes.
Parameters:
pdf_bytes— Raw PDF file content asbytes
Returns: (text, method) tuple where:
text— Extracted plain text. Checktext.startswith("ERROR:")for failure.method— Engine used:"pdftotext_default"— standard extraction (fast, ~400ms)"pdftotext_default+pdfplumber_recovery"— SMP fallback triggered (~9s), used when pdftotext outputsU+FFFDreplacement characters (common in Nature/Cell papers using Mathematical Italic fonts)
Requires: pdftotext binary on PATH.
with open("paper.pdf", "rb") as f:
text, method = extract_pdf(f.read())
if text.startswith("ERROR:"):
raise RuntimeError(f"Extraction failed: {text}")Extract text from DOCX (Word) file bytes via mammoth.
Parameters:
docx_bytes— Raw DOCX file content asbytes
Returns: (text, method) tuple where method is always "mammoth".
How it works: DOCX is converted to HTML first (preserving Shift+Enter soft breaks as <br> tags), then passed through the same block/inline-aware tree-walk used by extract_html(). This preserves paragraph structure, headings, lists, and soft breaks — which mammoth.extract_raw_text() would lose.
Requires: pip install docpluck[docx] (adds mammoth>=1.8.0).
Known limitations:
- OMML equations (Office Math) are silently dropped. Inline stats written as plain text survive; stats inside equation objects do not.
- Tracked changes: only deleted paragraphs are handled minimally.
- Memory: peak usage is ~3–5× file size.
from docpluck import extract_docx
with open("paper.docx", "rb") as f:
text, method = extract_docx(f.read())Extract text from HTML file bytes via beautifulsoup4 + lxml.
Parameters:
html_bytes— Raw HTML file content asbytes(UTF-8 decoded with error replacement)
Returns: (text, method) tuple where method is always "beautifulsoup".
How it works: Custom tree-walk that distinguishes block from inline elements:
- Block elements (
<p>,<div>,<h1>–<h6>,<li>,<td>, etc.) get newlines before and after. - Inline elements (
<a>,<span>,<em>, etc.) get spaces before and after — critical for preventing merged words like"ChanORCID"when adjacent inline elements have no whitespace between them. - Ignored tags (
<script>,<style>,<meta>,<svg>,<iframe>, etc.) are decomposed before walking.
Why not BeautifulSoup.get_text(): get_text() cannot distinguish block from inline elements — it applies a uniform separator everywhere, which either merges paragraphs or inserts spurious whitespace. The BeautifulSoup maintainer has confirmed this will not be fixed. A custom tree-walk is required.
Requires: pip install docpluck[html] (adds beautifulsoup4>=4.12.0 and lxml>=5.0.0).
from docpluck import extract_html, html_to_text
# From bytes
with open("article.html", "rb") as f:
text, method = extract_html(f.read())
# From an already-decoded string
text = html_to_text("<p>Hello <a>world</a></p>")Count pages in a PDF using byte pattern matching. No external binary required. PDF only — returns None is not applicable for DOCX/HTML.
with open("paper.pdf", "rb") as f:
content = f.read()
n = count_pages(content)
print(f"{n} pages")Apply the normalization pipeline at the specified level.
Parameters:
text— Raw extracted textlevel—NormalizationLevel.none|NormalizationLevel.standard|NormalizationLevel.academic
Returns: (normalized_text, report) tuple.
Normalization levels:
| Level | Steps | Use when |
|---|---|---|
none |
— | You want raw text, no modifications |
standard |
S0-S9 | General text processing (NLP, search indexing) |
academic |
S0-S9 + A1-A6 | Statistical pattern matching, meta-analysis |
from docpluck import normalize_text, NormalizationLevel
# Raw text
text, _ = normalize_text(raw, NormalizationLevel.none)
# General cleanup
text, report = normalize_text(raw, NormalizationLevel.standard)
# Full statistical repair (recommended for academic PDFs)
text, report = normalize_text(raw, NormalizationLevel.academic)
print(report.version) # "1.1.0"
print(report.steps_applied) # ["S0_smp_to_ascii", "S1_encoding_validation", ...]
print(report.changes_made) # {"ligatures_expanded": 27, "dashes_normalized": 3, ...}NormalizationReport fields:
| Field | Type | Description |
|---|---|---|
level |
str |
Level used: "none", "standard", or "academic" |
version |
str |
Pipeline version (e.g. "1.1.0") |
steps_applied |
list[str] |
Step codes in order (e.g. ["S1_encoding_validation", "S3_ligature_expansion"]) |
changes_made |
dict[str, int] |
Character-level change counts per step |
Compute extraction quality metrics.
Returns:
{
"score": 85, # 0–100 composite score
"common_word_ratio": 0.142, # fraction of first 2000 words that are common English words
"garbled": False, # True if common_word_ratio < 0.02 (column merge / encoding failure)
"confidence": "high", # "high" (≥80), "medium" (≥50), "low" (<50)
"details": {
"ligatures_remaining": 0, # count of ff/fi/fl ligature chars not yet expanded
"garbled_chars": 0, # count of U+FFFD replacement characters
"non_ascii_ratio": 0.031, # fraction of non-ASCII characters
}
}Interpreting the score:
| Score | Meaning |
|---|---|
| ≥ 80 | High quality — proceed with analysis |
| 50–79 | Medium — check manually if precision matters |
| < 50 | Low — likely garbled (column merge, encoding failure, or non-English) |
quality = compute_quality_score(text)
if quality["garbled"]:
print("Extraction likely failed — skipping this paper")
elif quality["score"] < 50:
print(f"Low quality ({quality['score']}) — verify manually")from docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
import re
def extract_stats(pdf_path: str) -> list[dict]:
with open(pdf_path, "rb") as f:
text, method = extract_pdf(f.read())
normalized, report = normalize_text(text, NormalizationLevel.academic)
quality = compute_quality_score(normalized)
if quality["garbled"]:
return [] # Skip garbled papers
# Now apply your statistical patterns to `normalized`
# e.g. find t-tests, F-tests, correlations, p-values
p_values = re.findall(r'p\s*[<=>]\s*\.?\d+', normalized)
return p_valuesfrom docpluck import extract_pdf, normalize_text, NormalizationLevel, compute_quality_score
from pathlib import Path
def process_corpus(pdf_dir: str) -> list[dict]:
results = []
for pdf_path in Path(pdf_dir).glob("**/*.pdf"):
with open(pdf_path, "rb") as f:
text, method = extract_pdf(f.read())
if text.startswith("ERROR:"):
results.append({"file": pdf_path.name, "error": text})
continue
normalized, report = normalize_text(text, NormalizationLevel.academic)
quality = compute_quality_score(normalized)
results.append({
"file": pdf_path.name,
"chars": len(normalized),
"method": method,
"quality": quality["score"],
"garbled": quality["garbled"],
})
return resultsimport httpx
from docpluck import extract_pdf, normalize_text, NormalizationLevel
def extract_from_url(url: str) -> str:
response = httpx.get(url, follow_redirects=True, timeout=30)
response.raise_for_status()
text, method = extract_pdf(response.content)
normalized, _ = normalize_text(text, NormalizationLevel.academic)
return normalized| Artifact | Example (before → after) |
|---|---|
| Null bytes | "Study\x00 results" → "Study results" |
| Ligatures | "significant" → "significant" |
| Unicode minus | "r = −0.73" → "r = -0.73" |
| Soft hyphen (invisible) | "signifi\u00ADcant" → "significant" |
| Non-breaking spaces | "p\u00A0<\u00A0.001" → "p < .001" |
| Full-width digits | "p = 0.001" → "p = 0.001" |
| Curly quotes | "the "effect"" → "the "effect"" |
| Hyphenation | "signi-\nficant" → "significant" |
| Repeated headers | Journal name repeated on every page → stripped |
| Page numbers | Standalone 12 on its own line → stripped |
| Artifact | Example (before → after) |
|---|---|
| Stat line breaks | "p =\n.001" → "p = .001" |
| Dropped decimals | "p = 484" → "p = .484" |
| European decimals | "p = 0,05" → "p = 0.05" |
| CI delimiters | "[0.81; 1.92]" → "[0.81, 1.92]" |
| Greek letters | "η² = 0.12" → "eta2 = 0.12" |
| Superscripts | "r² = 0.54" → "r2 = 0.54" |
| Footnote markers | "p < .001¹" → "p < .001" |
| Requirement | Version | Notes |
|---|---|---|
| Python | ≥ 3.10 | |
| pdfplumber | ≥ 0.11.0 | Core pip dependency — installed automatically |
| poppler-utils | any recent | System package — for extract_pdf() only |
| mammoth | ≥ 1.8.0 | Optional ([docx]) — pure Python, no system deps |
| beautifulsoup4 | ≥ 4.12.0 | Optional ([html]) — pure Python |
| lxml | ≥ 5.0.0 | Optional ([html]) — has prebuilt wheels |
The normalization and quality functions (normalize_text, compute_quality_score) have no system requirements — pure Python, no external binaries. DOCX and HTML extraction are pure Python too; only PDF needs a system binary.
MIT. See LICENSE.
If you use docpluck in research, please cite:
Feldman, G. (2026). docpluck: PDF text extraction and normalization for academic papers.
https://github.com/giladfeldman/docpluck