defuddle-rs

Rust + dom_query + reqwest + UniFFI + RMCP + wasm-bindgen

Clean-room Rust implementation of defuddle, packaged for native applications, editor agents, browser capture flows, and Python consumers.

defuddle-rs

Click the thumbnail above to watch the demo video on YouTube.

Raw pages are full of chrome. defuddle-rs keeps the useful part: extracted metadata, cleaned HTML, cleaned markdown, and a normalized result shape that can be reused across real software surfaces.

It ships the same parser core as:

a Rust crate
an MCP server
a Python package
a browser extension
a WASM parser for browser-side UIs

Overview

Most pages are not mostly content. They are wrappers around content.

Navigation, sidebars, share widgets, footer junk, ad slots, hidden blocks, layout scaffolding, and repeated chrome all make the useful part harder to reuse.

defuddle-rs removes that overhead and keeps what downstream tools actually want:

Article Body Extraction: isolate the main readable content node
Metadata Extraction: title, author, date, site, description, image, language
Markdown Conversion: convert cleaned content into portable markdown
Native Fetch + Parse: fetch a URL directly or parse raw HTML
MCP Surface: expose parser operations to local AI tooling over stdio or HTTP
Python Surface: expose the same parser core to scripts and workflows through UniFFI
Browser Surface: capture the active tab into an extension side panel and parse locally
WASM Surface: run the parser in-browser for extension and app UIs

There is no hosted parsing service here. No headless browser. No Node runtime requirement for the core parser.

Architecture Overview

flowchart TB
    subgraph C1["Client Surfaces"]
        EXT["Browser Extension<br/>side panel capture UI"]
        WEB["Browser App<br/>HTML and URL workspace"]
        PY["Python Consumer<br/>UniFFI package"]
        MCPCLIENT["MCP Client<br/>Copilot / Claude / Cursor"]
        RUSTAPP["Rust Consumer<br/>native crate user"]
    end

    subgraph C2["Surface Adapters"]
        WASM["WASM Adapter<br/>parseJson export"]
        PYAPI["Python API<br/>src/python_api.rs"]
        MCPSERVER["MCP Server<br/>src/mcp.rs + defuddle-mcp"]
        CRATE["Crate API<br/>src/lib.rs"]
    end

    subgraph C3["Core Extraction Pipeline"]
        META["Metadata Extraction<br/>src/metadata.rs"]
        STANDARDIZE["DOM Standardization<br/>src/standardize.rs"]
        REMOVE["Removal Pipeline<br/>src/removals.rs"]
        SCORE["Content Scoring<br/>src/scoring.rs"]
        MD["Markdown Conversion<br/>src/markdown.rs"]
    end

    subgraph C4["Input Sources"]
        HTML["Raw HTML"]
        FETCH["Native URL Fetch<br/>src/fetch.rs"]
    end

    subgraph C5["Outputs"]
        RESULT["DefuddleResult<br/>metadata + html + markdown + word count"]
    end

    EXT --> WASM
    WEB --> WASM
    PY --> PYAPI
    MCPCLIENT --> MCPSERVER
    RUSTAPP --> CRATE

    WASM --> CRATE
    PYAPI --> CRATE
    MCPSERVER --> CRATE

    HTML --> CRATE
    FETCH --> CRATE

    CRATE --> META
    CRATE --> STANDARDIZE
    CRATE --> REMOVE
    CRATE --> SCORE
    CRATE --> MD

    META --> RESULT
    STANDARDIZE --> RESULT
    REMOVE --> RESULT
    SCORE --> RESULT
    MD --> RESULT

    style C1 fill:#e3f2fd,stroke:#1e88e5,stroke-width:2px,color:#111
    style C2 fill:#ede7f6,stroke:#5e35b1,stroke-width:2px,color:#111
    style C3 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#111
    style C4 fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#111
    style C5 fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#111

Diagram Legend

Color	Layer	Meaning
Blue	Client Surfaces	Where people or tools consume the parser
Purple	Surface Adapters	Language, transport, and runtime-specific wrappers
Green	Core Extraction Pipeline	The Rust parser internals
Orange	Input Sources	Raw HTML or native fetch entrypoints
Pink	Outputs	Structured extraction result returned to callers

Key Components

Parser Core:

Mutable DOM Pipeline: uses dom_query to parse, mutate, score, and serialize pages
Metadata Extraction: collects page-level metadata before destructive cleanup
Removal Pipeline: exact selectors, partial selectors, hidden elements, and low-signal blocks
Main Content Protection: ancestor guards prevent removal passes from disconnecting the chosen content node
Markdown Output: preserves headings, tables, code blocks, lists, and links in a portable format

Runtime Surfaces:

Rust API: direct parse and fetch_and_parse
MCP Server: parser operations available to local AI clients
Python API: same parser core exported via UniFFI
WASM API: browser-side parsing for extension and app interfaces

Browser Layer:

Extension Side Panel: captures the active tab and parses it locally
Browser App: direct HTML and URL parsing interface using the same WASM core

Extraction Pipeline

The parser follows the same general flow as upstream defuddle, but implemented natively in Rust:

Parse HTML into a mutable DOM
Extract metadata and page-level signals
Standardize the document shape
Try site-specific extraction paths where available
Identify the main content candidate
Remove hidden and explicit junk nodes
Remove partial-match junk nodes
Score remaining blocks for content density
Protect main-content ancestors during cleanup
Convert the final cleaned content to markdown

That ancestor protection is one of the critical behavioral details. It prevents an over-broad removal selector from deleting the parent chain that still contains the actual article.

Quick Start

1. Rust Crate

use defuddle_rs::Defuddle;

let html = reqwest::get("https://example.com/article")
    .await?
    .text()
    .await?;

let result = Defuddle::parse(&html, "https://example.com/article")?;

println!("{}", result.title);
println!("{}", result.content_markdown);
println!("{}", result.word_count);

Build and test:

cargo build --release
cargo test

2. MCP Server

Build:

cargo build --release --bin defuddle-mcp

Run over stdio:

target/release/defuddle-mcp stdio

Run over HTTP:

target/release/defuddle-mcp http --bind 127.0.0.1:8080 --path /mcp

Current tools:

parse_html
fetch_and_parse_url
extract_metadata
extract_markdown

See MCP.md for config and transport details.

3. Python Bindings

Generate and install the UniFFI package:

cargo build --release
cargo run --bin uniffi-bindgen -- generate \
  --library target/release/libdefuddle_rs.so \
  --language python \
  --out-dir bindings/python/defuddle
cp target/release/libdefuddle_rs.so bindings/python/defuddle/

uv venv /tmp/defuddle-py-uv
UV_CACHE_DIR=/tmp/uv-cache uv pip install --python /tmp/defuddle-py-uv/bin/python -e bindings/python

Smoke test:

/tmp/defuddle-py-uv/bin/python -c "from pathlib import Path; from defuddle import DefuddleParser; html = Path('tests/fixtures/example.html').read_text(); parser = DefuddleParser(); result = parser.extract_markdown(html, 'https://example.com'); print(result.title); print(result.word_count)"

4. Browser Extension

Build the WASM bundle and extension assets:

npm run build:wasm
npm run build:extension

Load unpacked from:

extension/

Extension-specific details are in extension/README.md.

5. Browser App

Build the static app:

npm run build

This emits dist/.

API Surface

Rust

Defuddle::parse(html, url)
Defuddle::fetch_and_parse(url)

Python

DefuddleParser.parse_html(html, url)
DefuddleParser.fetch_and_parse_url(url)
DefuddleParser.extract_metadata(html, url)
DefuddleParser.extract_markdown(html, url)

Result Shape

DefuddleResult includes:

title
author
published
site
description
image
language
content_html
content_markdown
word_count
schema_org

Repository Layout

defuddle-rs/
├── src/                    # Rust parser core, MCP server, Python API
├── tests/                  # crate, MCP, HTTPS, and Python integration tests
├── bindings/python/        # UniFFI Python package
├── extension/              # browser extension side panel
├── packages/defuddle-wasm/ # WASM wrapper package
├── app/                    # browser app
├── demo/                   # demo pipeline and assets
├── MCP.md                  # MCP setup and transport docs
├── PARITY.md               # parity notes against upstream fixtures
└── README.md

Parity

This repo includes fixture-based parity validation against upstream defuddle, including large and awkward pages like Hacker News threads, MDN docs, Wikipedia, GitHub, and blog content.

See:

Clean-Room Note

This is a clean-room Rust implementation. Upstream defuddle is used as a behavioral and architectural reference, but the code here is not a direct line-by-line translation of the TypeScript source.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
app		app
bindings/python		bindings/python
demo		demo
extension		extension
packages/defuddle-wasm		packages/defuddle-wasm
ported_code/defuddle		ported_code/defuddle
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
MCP.md		MCP.md
PARITY.md		PARITY.md
README.md		README.md
defuddle-rs-logo.png		defuddle-rs-logo.png
defuddle_demo_final.mp4		defuddle_demo_final.mp4
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

defuddle-rs

Overview

Architecture Overview

Diagram Legend

Key Components

Extraction Pipeline

Quick Start

1. Rust Crate

2. MCP Server

3. Python Bindings

4. Browser Extension

5. Browser App

API Surface

Rust

Python

Result Shape

Repository Layout

Parity

Clean-Room Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

defuddle-rs

Overview

Architecture Overview

Diagram Legend

Key Components

Extraction Pipeline

Quick Start

1. Rust Crate

2. MCP Server

3. Python Bindings

4. Browser Extension

5. Browser App

API Surface

Rust

Python

Result Shape

Repository Layout

Parity

Clean-Room Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages