An open source alternative to Screaming Frog SEO Spider, built LLM-native from day one. 269 technical SEO rules across 18 categories, native MCP server, DuckDB + Parquet storage, MIT licensed.
Status: pre-alpha (May 2026).
crawlforge crawls a website and produces an opinionated technical SEO audit
backed by a queryable local database. If you have used Screaming Frog SEO
Spider, the workflow will feel familiar — same kinds of issues (titles, meta
descriptions, canonicals, hreflang, redirects, structured data, sitemaps,
images, mobile, AMP, accessibility…) — but with three architectural
differences explained below.
It is built around three ideas:
- One columnar source of truth. Each crawl is a directory of Parquet
files plus a DuckDB index. Run arbitrary SQL against the data afterwards,
diff two crawls with a
JOIN, ship audits to a data warehouse without an ETL. - Rules as code. The 269 detection rules are tiny, self-contained Python
files with their own unit test. Adding a new rule is ~30 lines and a
pytest fixture. The rule contract is documented in
docs/RULES_API.md. - MCP-first. A Model Context Protocol server is shipped in the box, so Claude Code / Codex / Cursor can drive a crawl, query its results, summarise issues and write client-ready audits without any glue code.
There is no GUI. There is no SaaS. There is no telemetry.
- Python 3.11+ with
asyncio - httpx for the crawler (Playwright opt-in for JS rendering — Phase D)
- lxml for HTML extraction
- advertools for sitemap / robots.txt helpers
- extruct for structured-data extraction (JSON-LD, Microdata, RDFa)
- datasketch for MinHash + LSH near-duplicate detection
- DuckDB over Parquet for storage and analytical queries
- FastMCP for the MCP server
- Typer + Rich for the CLI
git clone https://github.com/mario-hernandez/crawlforge.git
cd crawlforge
python3 -m venv .venv
source .venv/bin/activate
pip install -e .# Crawl a site (writes crawl.duckdb with all rules evaluated)
crawlforge run https://example.com -n 500 --db crawl.duckdb
# Issue counts by severity
crawlforge query crawl.duckdb "
SELECT severity, COUNT(*) AS n FROM issues GROUP BY severity
"
# Top 20 firing rules
crawlforge query crawl.duckdb "
SELECT rule_id, COUNT(*) AS n FROM issues
GROUP BY rule_id ORDER BY n DESC LIMIT 20
"
# Critical issues with their URL
crawlforge query crawl.duckdb "
SELECT i.rule_id, i.message, u.url
FROM issues i LEFT JOIN urls u ON u.url_id = i.url_id
WHERE severity = 'critical'
"
# Crawl metadata
crawlforge info crawl.duckdb
# List the 269 registered rules
crawlforge rulesOnce registered, ask the assistant in plain language:
"Crawl example.com with crawlforge, max 200 URLs, then give me the top 10 critical issues with their URL."
"From my last crawl, list every redirect chain longer than 2 hops."
"Which hreflang rules fired? Group them by source URL."
The MCP server exposes 8 tools: crawl_run, list_crawls, query_urls,
list_issues, find_redirect_chains, find_orphans, audit_summary,
list_rules.
Claude Code:
claude mcp add crawlforge -s user -- /path/to/.venv/bin/python -m crawlforge.mcpCodex (~/.codex/config.toml):
[mcp_servers.crawlforge]
command = "/path/to/.venv/bin/python"
args = ["-m", "crawlforge.mcp"]Each crawl populates 7 relational tables in a DuckDB file:
urls— one row per URL, with extracted on-page metadata (title, meta, headings, canonical, hreflang, content hash, simhash, response headers, …)links— internal/external link graph with anchor, rel, position, andin_navigation/in_footerflagsredirects— chains with hop count and loop detectionhreflang— declarations from HTML / HTTP header / sitemapstructured_data— extracted JSON-LD / Microdata / RDFa with validation flagsresources— images, scripts, stylesheets referencedissues— output of the 269 rules:rule_id,severity,category,evidence,message
Full column reference: docs/SCHEMA.md.
- 269 SEO rules across 18 categories
- 309 tests passing, 22 skipped (placeholders awaiting schema enrichment)
- ~18,000 lines of Python
- End-to-end smoke test verified: crawl → extract → persist → all 269 rules execute without crashing on a seeded site
- Phase A — Crawl engine + storage + Rule API + 4 pilot rules
- Phase B — 269 SEO rules across 18 categories
- Phase C — MCP server + full CLI
- Phase D — Playwright JS rendering · Performance & Accessibility rules (~123 more) · GSC / GA4 / PSI integrations · crawl diff · HTML reports · lightweight web UI
- Phase E — PyPI release · Homebrew tap · Docker image
Detail in EXECUTION_PLAN.md. Full rule catalog in
docs/RULES_CATALOG.md.
| # | Category | Total | Active |
|---|---|---|---|
| 1 | Response Codes | 27 | 13 |
| 2 | Security | 14 | 9 |
| 3 | URL | 21 | 16 |
| 4 | Page Titles | 4 | 4 |
| 5 | Meta Description | 9 | 5 |
| 6 | Headings | 14 | 9 |
| 7 | Canonicals | 31 | 18 |
| 8 | Pagination | 7 | 2 |
| 9 | Hreflang | 33 | ~28 |
| 10 | Directives | 28 | ~14 |
| 11 | Sitemaps | 12 | 10 |
| 12 | Structured Data | 6 | 2 |
| 13 | Images | 9 | 5 |
| 14 | Mobile | 15 | 7 |
| 15 | AMP | 5 | 5 |
| 16 | Content | 15 | 5 |
| 17 | Links | 14 | 12 |
| 18 | Orphans | 5 | 4 |
| Total | 269 | ~173 |
"Active" = enabled_by_default = True. The remaining rules ship as
placeholders that activate as the schema gains the underlying signals (JS
rendering, performance metrics, response headers, GSC/GA4 cross-data, …).
- Each rule lives in its own file under
src/crawlforge/rules/<category>/with a unit test undertests/unit/rules/<category>/. Seedocs/RULES_API.mdfor the contract. pytestfrom the repo root must stay green.- Conventional commits welcome but not required.
MIT — see LICENSE.
Screaming Frog® and Screaming Frog SEO Spider® are trademarks of
Screaming Frog Ltd (UK, company no.
07277243). crawlforge is an independent open source project, not
affiliated with, endorsed by, or sponsored by Screaming Frog Ltd. References
to the product elsewhere in this document are nominative — they identify a
well-known piece of software in the same category so users can quickly
understand what crawlforge is for. No part of crawlforge is derived from
Screaming Frog's source code, configuration, or proprietary catalogs.
Other product, service, and company names are the property of their respective owners and are referenced only for clarity.
The detection rules implemented here encode well-known, publicly documented technical SEO best practices (HTTP semantics, indexing directives, Schema.org, Core Web Vitals, WCAG, etc.). Each rule links to the relevant primary source — typically Google Search Central, MDN Web Docs, web.dev, Schema.org, the W3C, or sitemaps.org — so the underlying standard can be inspected directly.
