Skip to content

mario-hernandez/crawlforge

Repository files navigation

crawlforge

crawlforge

An open source alternative to Screaming Frog SEO Spider, built LLM-native from day one. 269 technical SEO rules across 18 categories, native MCP server, DuckDB + Parquet storage, MIT licensed.

License: MIT Python 3.11+ Tests Rules MCP

Status: pre-alpha (May 2026).

What it does

crawlforge crawls a website and produces an opinionated technical SEO audit backed by a queryable local database. If you have used Screaming Frog SEO Spider, the workflow will feel familiar — same kinds of issues (titles, meta descriptions, canonicals, hreflang, redirects, structured data, sitemaps, images, mobile, AMP, accessibility…) — but with three architectural differences explained below.

It is built around three ideas:

  1. One columnar source of truth. Each crawl is a directory of Parquet files plus a DuckDB index. Run arbitrary SQL against the data afterwards, diff two crawls with a JOIN, ship audits to a data warehouse without an ETL.
  2. Rules as code. The 269 detection rules are tiny, self-contained Python files with their own unit test. Adding a new rule is ~30 lines and a pytest fixture. The rule contract is documented in docs/RULES_API.md.
  3. MCP-first. A Model Context Protocol server is shipped in the box, so Claude Code / Codex / Cursor can drive a crawl, query its results, summarise issues and write client-ready audits without any glue code.

There is no GUI. There is no SaaS. There is no telemetry.

Stack

  • Python 3.11+ with asyncio
  • httpx for the crawler (Playwright opt-in for JS rendering — Phase D)
  • lxml for HTML extraction
  • advertools for sitemap / robots.txt helpers
  • extruct for structured-data extraction (JSON-LD, Microdata, RDFa)
  • datasketch for MinHash + LSH near-duplicate detection
  • DuckDB over Parquet for storage and analytical queries
  • FastMCP for the MCP server
  • Typer + Rich for the CLI

Installation

git clone https://github.com/mario-hernandez/crawlforge.git
cd crawlforge
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

CLI usage

# Crawl a site (writes crawl.duckdb with all rules evaluated)
crawlforge run https://example.com -n 500 --db crawl.duckdb

# Issue counts by severity
crawlforge query crawl.duckdb "
  SELECT severity, COUNT(*) AS n FROM issues GROUP BY severity
"

# Top 20 firing rules
crawlforge query crawl.duckdb "
  SELECT rule_id, COUNT(*) AS n FROM issues
  GROUP BY rule_id ORDER BY n DESC LIMIT 20
"

# Critical issues with their URL
crawlforge query crawl.duckdb "
  SELECT i.rule_id, i.message, u.url
  FROM issues i LEFT JOIN urls u ON u.url_id = i.url_id
  WHERE severity = 'critical'
"

# Crawl metadata
crawlforge info crawl.duckdb

# List the 269 registered rules
crawlforge rules

Using it from Claude Code / Codex / Cursor

Once registered, ask the assistant in plain language:

"Crawl example.com with crawlforge, max 200 URLs, then give me the top 10 critical issues with their URL."

"From my last crawl, list every redirect chain longer than 2 hops."

"Which hreflang rules fired? Group them by source URL."

The MCP server exposes 8 tools: crawl_run, list_crawls, query_urls, list_issues, find_redirect_chains, find_orphans, audit_summary, list_rules.

Manual registration

Claude Code:

claude mcp add crawlforge -s user -- /path/to/.venv/bin/python -m crawlforge.mcp

Codex (~/.codex/config.toml):

[mcp_servers.crawlforge]
command = "/path/to/.venv/bin/python"
args = ["-m", "crawlforge.mcp"]

Data model

Each crawl populates 7 relational tables in a DuckDB file:

  • urls — one row per URL, with extracted on-page metadata (title, meta, headings, canonical, hreflang, content hash, simhash, response headers, …)
  • links — internal/external link graph with anchor, rel, position, and in_navigation / in_footer flags
  • redirects — chains with hop count and loop detection
  • hreflang — declarations from HTML / HTTP header / sitemap
  • structured_data — extracted JSON-LD / Microdata / RDFa with validation flags
  • resources — images, scripts, stylesheets referenced
  • issues — output of the 269 rules: rule_id, severity, category, evidence, message

Full column reference: docs/SCHEMA.md.

Status

  • 269 SEO rules across 18 categories
  • 309 tests passing, 22 skipped (placeholders awaiting schema enrichment)
  • ~18,000 lines of Python
  • End-to-end smoke test verified: crawl → extract → persist → all 269 rules execute without crashing on a seeded site

Roadmap

  • Phase A — Crawl engine + storage + Rule API + 4 pilot rules
  • Phase B — 269 SEO rules across 18 categories
  • Phase C — MCP server + full CLI
  • Phase D — Playwright JS rendering · Performance & Accessibility rules (~123 more) · GSC / GA4 / PSI integrations · crawl diff · HTML reports · lightweight web UI
  • Phase E — PyPI release · Homebrew tap · Docker image

Detail in EXECUTION_PLAN.md. Full rule catalog in docs/RULES_CATALOG.md.

Coverage by category

# Category Total Active
1 Response Codes 27 13
2 Security 14 9
3 URL 21 16
4 Page Titles 4 4
5 Meta Description 9 5
6 Headings 14 9
7 Canonicals 31 18
8 Pagination 7 2
9 Hreflang 33 ~28
10 Directives 28 ~14
11 Sitemaps 12 10
12 Structured Data 6 2
13 Images 9 5
14 Mobile 15 7
15 AMP 5 5
16 Content 15 5
17 Links 14 12
18 Orphans 5 4
Total 269 ~173

"Active" = enabled_by_default = True. The remaining rules ship as placeholders that activate as the schema gains the underlying signals (JS rendering, performance metrics, response headers, GSC/GA4 cross-data, …).

Contributing

  • Each rule lives in its own file under src/crawlforge/rules/<category>/ with a unit test under tests/unit/rules/<category>/. See docs/RULES_API.md for the contract.
  • pytest from the repo root must stay green.
  • Conventional commits welcome but not required.

License

MIT — see LICENSE.

Trademarks & references

Screaming Frog® and Screaming Frog SEO Spider® are trademarks of Screaming Frog Ltd (UK, company no. 07277243). crawlforge is an independent open source project, not affiliated with, endorsed by, or sponsored by Screaming Frog Ltd. References to the product elsewhere in this document are nominative — they identify a well-known piece of software in the same category so users can quickly understand what crawlforge is for. No part of crawlforge is derived from Screaming Frog's source code, configuration, or proprietary catalogs.

Other product, service, and company names are the property of their respective owners and are referenced only for clarity.

The detection rules implemented here encode well-known, publicly documented technical SEO best practices (HTTP semantics, indexing directives, Schema.org, Core Web Vitals, WCAG, etc.). Each rule links to the relevant primary source — typically Google Search Central, MDN Web Docs, web.dev, Schema.org, the W3C, or sitemaps.org — so the underlying standard can be inspected directly.

About

Open source alternative to Screaming Frog SEO Spider — LLM-native, MCP server included, 269 rules, DuckDB+Parquet storage, MIT licensed.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages