Clean-room Rust implementation of defuddle, packaged for native applications, editor agents, browser capture flows, and Python consumers.
Click the thumbnail above to watch the demo video on YouTube.
Raw pages are full of chrome. defuddle-rs keeps the useful part: extracted metadata, cleaned HTML, cleaned markdown, and a normalized result shape that can be reused across real software surfaces.
It ships the same parser core as:
- a Rust crate
- an MCP server
- a Python package
- a browser extension
- a WASM parser for browser-side UIs
Most pages are not mostly content. They are wrappers around content.
Navigation, sidebars, share widgets, footer junk, ad slots, hidden blocks, layout scaffolding, and repeated chrome all make the useful part harder to reuse.
defuddle-rs removes that overhead and keeps what downstream tools actually want:
- Article Body Extraction: isolate the main readable content node
- Metadata Extraction: title, author, date, site, description, image, language
- Markdown Conversion: convert cleaned content into portable markdown
- Native Fetch + Parse: fetch a URL directly or parse raw HTML
- MCP Surface: expose parser operations to local AI tooling over stdio or HTTP
- Python Surface: expose the same parser core to scripts and workflows through UniFFI
- Browser Surface: capture the active tab into an extension side panel and parse locally
- WASM Surface: run the parser in-browser for extension and app UIs
There is no hosted parsing service here. No headless browser. No Node runtime requirement for the core parser.
flowchart TB
subgraph C1["Client Surfaces"]
EXT["Browser Extension<br/>side panel capture UI"]
WEB["Browser App<br/>HTML and URL workspace"]
PY["Python Consumer<br/>UniFFI package"]
MCPCLIENT["MCP Client<br/>Copilot / Claude / Cursor"]
RUSTAPP["Rust Consumer<br/>native crate user"]
end
subgraph C2["Surface Adapters"]
WASM["WASM Adapter<br/>parseJson export"]
PYAPI["Python API<br/>src/python_api.rs"]
MCPSERVER["MCP Server<br/>src/mcp.rs + defuddle-mcp"]
CRATE["Crate API<br/>src/lib.rs"]
end
subgraph C3["Core Extraction Pipeline"]
META["Metadata Extraction<br/>src/metadata.rs"]
STANDARDIZE["DOM Standardization<br/>src/standardize.rs"]
REMOVE["Removal Pipeline<br/>src/removals.rs"]
SCORE["Content Scoring<br/>src/scoring.rs"]
MD["Markdown Conversion<br/>src/markdown.rs"]
end
subgraph C4["Input Sources"]
HTML["Raw HTML"]
FETCH["Native URL Fetch<br/>src/fetch.rs"]
end
subgraph C5["Outputs"]
RESULT["DefuddleResult<br/>metadata + html + markdown + word count"]
end
EXT --> WASM
WEB --> WASM
PY --> PYAPI
MCPCLIENT --> MCPSERVER
RUSTAPP --> CRATE
WASM --> CRATE
PYAPI --> CRATE
MCPSERVER --> CRATE
HTML --> CRATE
FETCH --> CRATE
CRATE --> META
CRATE --> STANDARDIZE
CRATE --> REMOVE
CRATE --> SCORE
CRATE --> MD
META --> RESULT
STANDARDIZE --> RESULT
REMOVE --> RESULT
SCORE --> RESULT
MD --> RESULT
style C1 fill:#e3f2fd,stroke:#1e88e5,stroke-width:2px,color:#111
style C2 fill:#ede7f6,stroke:#5e35b1,stroke-width:2px,color:#111
style C3 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#111
style C4 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#111
style C5 fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#111
| Color | Layer | Meaning |
|---|---|---|
| Blue | Client Surfaces | Where people or tools consume the parser |
| Purple | Surface Adapters | Language, transport, and runtime-specific wrappers |
| Green | Core Extraction Pipeline | The Rust parser internals |
| Orange | Input Sources | Raw HTML or native fetch entrypoints |
| Pink | Outputs | Structured extraction result returned to callers |
Parser Core:
- Mutable DOM Pipeline: uses
dom_queryto parse, mutate, score, and serialize pages - Metadata Extraction: collects page-level metadata before destructive cleanup
- Removal Pipeline: exact selectors, partial selectors, hidden elements, and low-signal blocks
- Main Content Protection: ancestor guards prevent removal passes from disconnecting the chosen content node
- Markdown Output: preserves headings, tables, code blocks, lists, and links in a portable format
Runtime Surfaces:
- Rust API: direct
parseandfetch_and_parse - MCP Server: parser operations available to local AI clients
- Python API: same parser core exported via UniFFI
- WASM API: browser-side parsing for extension and app interfaces
Browser Layer:
- Extension Side Panel: captures the active tab and parses it locally
- Browser App: direct HTML and URL parsing interface using the same WASM core
The parser follows the same general flow as upstream defuddle, but implemented natively in Rust:
- Parse HTML into a mutable DOM
- Extract metadata and page-level signals
- Standardize the document shape
- Try site-specific extraction paths where available
- Identify the main content candidate
- Remove hidden and explicit junk nodes
- Remove partial-match junk nodes
- Score remaining blocks for content density
- Protect main-content ancestors during cleanup
- Convert the final cleaned content to markdown
That ancestor protection is one of the critical behavioral details. It prevents an over-broad removal selector from deleting the parent chain that still contains the actual article.
use defuddle_rs::Defuddle;
let html = reqwest::get("https://example.com/article")
.await?
.text()
.await?;
let result = Defuddle::parse(&html, "https://example.com/article")?;
println!("{}", result.title);
println!("{}", result.content_markdown);
println!("{}", result.word_count);Build and test:
cargo build --release
cargo testBuild:
cargo build --release --bin defuddle-mcpRun over stdio:
target/release/defuddle-mcp stdioRun over HTTP:
target/release/defuddle-mcp http --bind 127.0.0.1:8080 --path /mcpCurrent tools:
parse_htmlfetch_and_parse_urlextract_metadataextract_markdown
See MCP.md for config and transport details.
Generate and install the UniFFI package:
cargo build --release
cargo run --bin uniffi-bindgen -- generate \
--library target/release/libdefuddle_rs.so \
--language python \
--out-dir bindings/python/defuddle
cp target/release/libdefuddle_rs.so bindings/python/defuddle/
uv venv /tmp/defuddle-py-uv
UV_CACHE_DIR=/tmp/uv-cache uv pip install --python /tmp/defuddle-py-uv/bin/python -e bindings/pythonSmoke test:
/tmp/defuddle-py-uv/bin/python -c "from pathlib import Path; from defuddle import DefuddleParser; html = Path('tests/fixtures/example.html').read_text(); parser = DefuddleParser(); result = parser.extract_markdown(html, 'https://example.com'); print(result.title); print(result.word_count)"Build the WASM bundle and extension assets:
npm run build:wasm
npm run build:extensionLoad unpacked from:
extension/
Extension-specific details are in extension/README.md.
Build the static app:
npm run buildThis emits dist/.
Defuddle::parse(html, url)Defuddle::fetch_and_parse(url)
DefuddleParser.parse_html(html, url)DefuddleParser.fetch_and_parse_url(url)DefuddleParser.extract_metadata(html, url)DefuddleParser.extract_markdown(html, url)
DefuddleResult includes:
titleauthorpublishedsitedescriptionimagelanguagecontent_htmlcontent_markdownword_countschema_org
defuddle-rs/
├── src/ # Rust parser core, MCP server, Python API
├── tests/ # crate, MCP, HTTPS, and Python integration tests
├── bindings/python/ # UniFFI Python package
├── extension/ # browser extension side panel
├── packages/defuddle-wasm/ # WASM wrapper package
├── app/ # browser app
├── demo/ # demo pipeline and assets
├── MCP.md # MCP setup and transport docs
├── PARITY.md # parity notes against upstream fixtures
└── README.md
This repo includes fixture-based parity validation against upstream defuddle, including large and awkward pages like Hacker News threads, MDN docs, Wikipedia, GitHub, and blog content.
See:
This is a clean-room Rust implementation. Upstream defuddle is used as a behavioral and architectural reference, but the code here is not a direct line-by-line translation of the TypeScript source.

