Skip to content

atoolz/picocrawl

picocrawl

Crawl4AI in 15MB. No Python. No Docker. Runs on a Raspberry Pi.

picocrawl is a single static Go binary that turns URLs into LLM-ready Markdown. It ships three surfaces from one executable: a CLI, an HTTP REST server, and an MCP stdio server. JavaScript rendering is delegated to a remote CDP endpoint (no embedded Chromium), and structured extraction is delegated to OpenAI-compatible LLM providers (no embedded model weights).

Install

Install script (recommended)

The script detects your OS and architecture, downloads the matching release binary, verifies its checksum, and installs it. Works on Linux and macOS:

curl -fsSL https://andreahlert.github.io/picocrawl/install.sh | sh

It installs to /usr/local/bin when writable, otherwise ~/.local/bin. Override with PICOCRAWL_INSTALL_DIR, or pin a version with PICOCRAWL_VERSION.

Prebuilt binary

Download the archive for your platform from the releases page, extract it, and put the picocrawl binary on your PATH. Windows builds are published as .zip.

Go install

If you have Go 1.26 or newer:

go install github.com/andreahlert/picocrawl/cmd/picocrawl@latest

Build from source

git clone https://github.com/andreahlert/picocrawl
cd picocrawl
make build        # produces bin/picocrawl

Homebrew (planned)

A Homebrew tap is planned. The intended form:

brew install andreahlert/tap/picocrawl   # planned

Quick start

# Crawl one URL, print Markdown to stdout
picocrawl crawl https://example.com

# Crawl many URLs from a file, write one Markdown file per URL
picocrawl crawl -f urls.txt -o ./out

# Render JavaScript via a configured CDP browser
picocrawl crawl --browser local-chrome https://example.com

# Extract structured JSON against a JSON Schema
picocrawl extract --schema product.json https://example.com/product

# Run the REST API
picocrawl serve --bind :8080

# Run the MCP server over stdio
picocrawl mcp

picocrawl vs crawl4ai

crawl4ai picocrawl
Language Python Go
Distribution pip package single static binary
Runtime deps Python 3 + Playwright + Chromium none
JS rendering bundled headless browser remote CDP endpoint (opt-in)
LLM extraction many providers, in-process OpenAI-compatible HTTP providers
Interfaces Python API + REST CLI + REST + MCP
Footprint heavy targets a ~15MB binary

The 15MB figure is a project target, not a measured guarantee. crawl4ai exposes a richer Python library surface; picocrawl deliberately keeps the supported surface to the CLI, REST, and MCP.

Configuration

picocrawl reads an optional YAML config file. Default location: ~/.config/picocrawl/config.yaml. Override the path with the PICOCRAWL_CONFIG environment variable.

defaults:
  user_agent: "picocrawl/0.x (+https://github.com/andreahlert/picocrawl)"
  timeout: 30s
  concurrency: 4
  cache: true
  respect_robots: true

browsers:
  - name: local-chrome
    cdp: http://localhost:9222
  - name: browserless
    cdp: wss://chrome.browserless.io
    api_key: $BROWSERLESS_TOKEN

llms:
  - name: ollama
    base_url: http://localhost:11434/v1
    model: llama3.2:3b
  - name: openai
    base_url: https://api.openai.com/v1
    model: gpt-4o-mini
    api_key: $OPENAI_API_KEY

storage:
  path: ~/.picocrawl/picocrawl.db
  cache_ttl: 24h
  max_size_mb: 500

server:
  bind: ":8080"
  auth_token: $PICOCRAWL_TOKEN
  tls_cert: /etc/picocrawl/cert.pem
  tls_key: /etc/picocrawl/key.pem

respect_robots defaults to true; set it to false only for hosts you own.

Any value written as $NAME is resolved from the environment at load time, so secrets stay out of the file. Settings can also be overridden directly with PICOCRAWL_-prefixed environment variables (for example PICOCRAWL_SERVER_BIND).

CLI reference

Command Description Key flags
crawl [url] Fetch a URL and print LLM-ready Markdown -f, --file, -o, --out, --browser, --cache, --concurrency, --timeout, --user-agent, --depth, --same-domain, --max-pages, --allow-private
extract <url> Crawl a URL then extract structured JSON via an LLM and schema --schema (required), --llm (default openai)
serve Run the REST API server --bind (default :8080), --auth-token, --browser, --tls-cert, --tls-key
mcp Run the MCP server over stdio none
runs list List recent crawl runs none
runs resume <run-id> Resume a tracked run from its pending URLs none

A deep crawl (crawl --depth N) is tracked as a run and capped at 200 pages by default; raise or lower the cap with --max-pages. The HTTP fetcher refuses loopback, private, and link-local addresses; pass --allow-private to crawl local services during development.

serve speaks plain HTTP unless --tls-cert and --tls-key are both given, in which case it serves HTTPS directly. Without TLS, run it behind a TLS-terminating proxy so the bearer token is not sent in cleartext.

Run picocrawl <command> --help for the full flag list.

REST reference

The server is built on a chi router. Routes under /v1 are protected by bearer auth when an auth token is set (via --auth-token or server.auth_token); requests must then send Authorization: Bearer <token>. With no token configured, /v1 routes are open, and serve prints a warning when it binds a non-loopback address without a token. The fetcher refuses loopback, private, and link-local targets, so the server cannot be used as an SSRF proxy.

Method Path Notes
GET /healthz Liveness check, returns 200
POST /v1/crawl Crawl a URL. Body: {"url", "browser", "depth", "same_domain", "max_pages"}. With depth 0 returns one crawl result; with depth > 0 it runs a tracked deep crawl and returns {"run_id", "results"}
POST /v1/crawl/batch Crawl many URLs. Body: {"urls", "concurrency"}. Streams newline-delimited JSON (application/x-ndjson). Capped at 10000 URLs and concurrency 64
GET /v1/runs List recent tracked crawl runs
GET /v1/runs/{id} Fetch a single run by id
POST /v1/runs/{id}/resume Resume a run from its pending URLs

MCP usage

Start the MCP server over stdio:

picocrawl mcp

It exposes three tools:

  • crawl_url, fetch a URL and return LLM-ready Markdown. Parameters: url (required), browser, depth. With depth > 0 it runs a tracked deep crawl and returns {"run_id", "results"}.
  • get_cached, return cached Markdown for a URL if present. Parameter: url (required).
  • list_runs, list recent tracked crawl runs, newest first. No parameters.

License

Apache License 2.0, see LICENSE and NOTICE.

About

Single-binary crawler turning URLs into LLM-ready Markdown. CLI, HTTP REST, and MCP stdio in 15MB.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors