Crawl4AI in 15MB. No Python. No Docker. Runs on a Raspberry Pi.
picocrawl is a single static Go binary that turns URLs into LLM-ready Markdown. It ships three surfaces from one executable: a CLI, an HTTP REST server, and an MCP stdio server. JavaScript rendering is delegated to a remote CDP endpoint (no embedded Chromium), and structured extraction is delegated to OpenAI-compatible LLM providers (no embedded model weights).
The script detects your OS and architecture, downloads the matching release binary, verifies its checksum, and installs it. Works on Linux and macOS:
curl -fsSL https://andreahlert.github.io/picocrawl/install.sh | shIt installs to /usr/local/bin when writable, otherwise ~/.local/bin.
Override with PICOCRAWL_INSTALL_DIR, or pin a version with
PICOCRAWL_VERSION.
Download the archive for your platform from the
releases page, extract it,
and put the picocrawl binary on your PATH. Windows builds are published as
.zip.
If you have Go 1.26 or newer:
go install github.com/andreahlert/picocrawl/cmd/picocrawl@latestgit clone https://github.com/andreahlert/picocrawl
cd picocrawl
make build # produces bin/picocrawlA Homebrew tap is planned. The intended form:
brew install andreahlert/tap/picocrawl # planned# Crawl one URL, print Markdown to stdout
picocrawl crawl https://example.com
# Crawl many URLs from a file, write one Markdown file per URL
picocrawl crawl -f urls.txt -o ./out
# Render JavaScript via a configured CDP browser
picocrawl crawl --browser local-chrome https://example.com
# Extract structured JSON against a JSON Schema
picocrawl extract --schema product.json https://example.com/product
# Run the REST API
picocrawl serve --bind :8080
# Run the MCP server over stdio
picocrawl mcp| crawl4ai | picocrawl | |
|---|---|---|
| Language | Python | Go |
| Distribution | pip package | single static binary |
| Runtime deps | Python 3 + Playwright + Chromium | none |
| JS rendering | bundled headless browser | remote CDP endpoint (opt-in) |
| LLM extraction | many providers, in-process | OpenAI-compatible HTTP providers |
| Interfaces | Python API + REST | CLI + REST + MCP |
| Footprint | heavy | targets a ~15MB binary |
The 15MB figure is a project target, not a measured guarantee. crawl4ai exposes a richer Python library surface; picocrawl deliberately keeps the supported surface to the CLI, REST, and MCP.
picocrawl reads an optional YAML config file. Default location:
~/.config/picocrawl/config.yaml. Override the path with the PICOCRAWL_CONFIG
environment variable.
defaults:
user_agent: "picocrawl/0.x (+https://github.com/andreahlert/picocrawl)"
timeout: 30s
concurrency: 4
cache: true
respect_robots: true
browsers:
- name: local-chrome
cdp: http://localhost:9222
- name: browserless
cdp: wss://chrome.browserless.io
api_key: $BROWSERLESS_TOKEN
llms:
- name: ollama
base_url: http://localhost:11434/v1
model: llama3.2:3b
- name: openai
base_url: https://api.openai.com/v1
model: gpt-4o-mini
api_key: $OPENAI_API_KEY
storage:
path: ~/.picocrawl/picocrawl.db
cache_ttl: 24h
max_size_mb: 500
server:
bind: ":8080"
auth_token: $PICOCRAWL_TOKEN
tls_cert: /etc/picocrawl/cert.pem
tls_key: /etc/picocrawl/key.pemrespect_robots defaults to true; set it to false only for hosts you own.
Any value written as $NAME is resolved from the environment at load time, so
secrets stay out of the file. Settings can also be overridden directly with
PICOCRAWL_-prefixed environment variables (for example PICOCRAWL_SERVER_BIND).
| Command | Description | Key flags |
|---|---|---|
crawl [url] |
Fetch a URL and print LLM-ready Markdown | -f, --file, -o, --out, --browser, --cache, --concurrency, --timeout, --user-agent, --depth, --same-domain, --max-pages, --allow-private |
extract <url> |
Crawl a URL then extract structured JSON via an LLM and schema | --schema (required), --llm (default openai) |
serve |
Run the REST API server | --bind (default :8080), --auth-token, --browser, --tls-cert, --tls-key |
mcp |
Run the MCP server over stdio | none |
runs list |
List recent crawl runs | none |
runs resume <run-id> |
Resume a tracked run from its pending URLs | none |
A deep crawl (crawl --depth N) is tracked as a run and capped at 200 pages by
default; raise or lower the cap with --max-pages. The HTTP fetcher refuses
loopback, private, and link-local addresses; pass --allow-private to crawl
local services during development.
serve speaks plain HTTP unless --tls-cert and --tls-key are both given,
in which case it serves HTTPS directly. Without TLS, run it behind a
TLS-terminating proxy so the bearer token is not sent in cleartext.
Run picocrawl <command> --help for the full flag list.
The server is built on a chi router. Routes under /v1 are protected by bearer
auth when an auth token is set (via --auth-token or server.auth_token);
requests must then send Authorization: Bearer <token>. With no token
configured, /v1 routes are open, and serve prints a warning when it binds a
non-loopback address without a token. The fetcher refuses loopback, private,
and link-local targets, so the server cannot be used as an SSRF proxy.
| Method | Path | Notes |
|---|---|---|
GET |
/healthz |
Liveness check, returns 200 |
POST |
/v1/crawl |
Crawl a URL. Body: {"url", "browser", "depth", "same_domain", "max_pages"}. With depth 0 returns one crawl result; with depth > 0 it runs a tracked deep crawl and returns {"run_id", "results"} |
POST |
/v1/crawl/batch |
Crawl many URLs. Body: {"urls", "concurrency"}. Streams newline-delimited JSON (application/x-ndjson). Capped at 10000 URLs and concurrency 64 |
GET |
/v1/runs |
List recent tracked crawl runs |
GET |
/v1/runs/{id} |
Fetch a single run by id |
POST |
/v1/runs/{id}/resume |
Resume a run from its pending URLs |
Start the MCP server over stdio:
picocrawl mcpIt exposes three tools:
crawl_url, fetch a URL and return LLM-ready Markdown. Parameters:url(required),browser,depth. Withdepth> 0 it runs a tracked deep crawl and returns{"run_id", "results"}.get_cached, return cached Markdown for a URL if present. Parameter:url(required).list_runs, list recent tracked crawl runs, newest first. No parameters.