The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.
Basic behavior
- <h1> → #, <h2> → ##, <h3> → ###
- <p> → simple text line
- <ul>/<ol> → Markdown lists
- <a> → [text](url)
- <img> →  (optional, configurable)
- <pre><code> → fenced code blocks
- Tables converted to Markdown or CSV fallback
- Inline spans or styling without semantic meaning are discarded
- Scripts, styles, and invisible nodes are ignored
Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.
Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.
The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.
Basic behavior
Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.
Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.