picocrawl

Crawl4AI in 15MB. No Python. No Docker. Runs on a Raspberry Pi.

picocrawl is a single static Go binary that turns URLs into LLM-ready Markdown. It ships three surfaces from one executable: a CLI, an HTTP REST server, and an MCP stdio server. JavaScript rendering is delegated to a remote CDP endpoint (no embedded Chromium), and structured extraction is delegated to OpenAI-compatible LLM providers (no embedded model weights).

Install

Install script (recommended)

The script detects your OS and architecture, downloads the matching release binary, verifies its checksum, and installs it. Works on Linux and macOS:

curl -fsSL https://andreahlert.github.io/picocrawl/install.sh | sh

It installs to /usr/local/bin when writable, otherwise ~/.local/bin. Override with PICOCRAWL_INSTALL_DIR, or pin a version with PICOCRAWL_VERSION.

Prebuilt binary

Download the archive for your platform from the releases page, extract it, and put the picocrawl binary on your PATH. Windows builds are published as .zip.

Go install

If you have Go 1.26 or newer:

go install github.com/andreahlert/picocrawl/cmd/picocrawl@latest

Build from source

git clone https://github.com/andreahlert/picocrawl
cd picocrawl
make build        # produces bin/picocrawl

Homebrew (planned)

A Homebrew tap is planned. The intended form:

brew install andreahlert/tap/picocrawl   # planned

Quick start

# Crawl one URL, print Markdown to stdout
picocrawl crawl https://example.com

# Crawl many URLs from a file, write one Markdown file per URL
picocrawl crawl -f urls.txt -o ./out

# Render JavaScript via a configured CDP browser
picocrawl crawl --browser local-chrome https://example.com

# Extract structured JSON against a JSON Schema
picocrawl extract --schema product.json https://example.com/product

# Run the REST API
picocrawl serve --bind :8080

# Run the MCP server over stdio
picocrawl mcp

picocrawl vs crawl4ai

	crawl4ai	picocrawl
Language	Python	Go
Distribution	pip package	single static binary
Runtime deps	Python 3 + Playwright + Chromium	none
JS rendering	bundled headless browser	remote CDP endpoint (opt-in)
LLM extraction	many providers, in-process	OpenAI-compatible HTTP providers
Interfaces	Python API + REST	CLI + REST + MCP
Footprint	heavy	targets a ~15MB binary

The 15MB figure is a project target, not a measured guarantee. crawl4ai exposes a richer Python library surface; picocrawl deliberately keeps the supported surface to the CLI, REST, and MCP.

Configuration

picocrawl reads an optional YAML config file. Default location: ~/.config/picocrawl/config.yaml. Override the path with the PICOCRAWL_CONFIG environment variable.

defaults:
  user_agent: "picocrawl/0.x (+https://github.com/andreahlert/picocrawl)"
  timeout: 30s
  concurrency: 4
  cache: true
  respect_robots: true

browsers:
  - name: local-chrome
    cdp: http://localhost:9222
  - name: browserless
    cdp: wss://chrome.browserless.io
    api_key: $BROWSERLESS_TOKEN

llms:
  - name: ollama
    base_url: http://localhost:11434/v1
    model: llama3.2:3b
  - name: openai
    base_url: https://api.openai.com/v1
    model: gpt-4o-mini
    api_key: $OPENAI_API_KEY

storage:
  path: ~/.picocrawl/picocrawl.db
  cache_ttl: 24h
  max_size_mb: 500

server:
  bind: ":8080"
  auth_token: $PICOCRAWL_TOKEN
  tls_cert: /etc/picocrawl/cert.pem
  tls_key: /etc/picocrawl/key.pem

respect_robots defaults to true; set it to false only for hosts you own.

Any value written as $NAME is resolved from the environment at load time, so secrets stay out of the file. Settings can also be overridden directly with PICOCRAWL_-prefixed environment variables (for example PICOCRAWL_SERVER_BIND).

CLI reference

Command	Description	Key flags
`crawl [url]`	Fetch a URL and print LLM-ready Markdown	`-f, --file`, `-o, --out`, `--browser`, `--cache`, `--concurrency`, `--timeout`, `--user-agent`, `--depth`, `--same-domain`, `--max-pages`, `--allow-private`
`extract <url>`	Crawl a URL then extract structured JSON via an LLM and schema	`--schema` (required), `--llm` (default `openai`)
`serve`	Run the REST API server	`--bind` (default `:8080`), `--auth-token`, `--browser`, `--tls-cert`, `--tls-key`
`mcp`	Run the MCP server over stdio	none
`runs list`	List recent crawl runs	none
`runs resume <run-id>`	Resume a tracked run from its pending URLs	none

A deep crawl (crawl --depth N) is tracked as a run and capped at 200 pages by default; raise or lower the cap with --max-pages. The HTTP fetcher refuses loopback, private, and link-local addresses; pass --allow-private to crawl local services during development.

serve speaks plain HTTP unless --tls-cert and --tls-key are both given, in which case it serves HTTPS directly. Without TLS, run it behind a TLS-terminating proxy so the bearer token is not sent in cleartext.

Run picocrawl <command> --help for the full flag list.

REST reference

The server is built on a chi router. Routes under /v1 are protected by bearer auth when an auth token is set (via --auth-token or server.auth_token); requests must then send Authorization: Bearer <token>. With no token configured, /v1 routes are open, and serve prints a warning when it binds a non-loopback address without a token. The fetcher refuses loopback, private, and link-local targets, so the server cannot be used as an SSRF proxy.

Method	Path	Notes
`GET`	`/healthz`	Liveness check, returns `200`
`POST`	`/v1/crawl`	Crawl a URL. Body: `{"url", "browser", "depth", "same_domain", "max_pages"}`. With `depth` 0 returns one crawl result; with `depth` > 0 it runs a tracked deep crawl and returns `{"run_id", "results"}`
`POST`	`/v1/crawl/batch`	Crawl many URLs. Body: `{"urls", "concurrency"}`. Streams newline-delimited JSON (`application/x-ndjson`). Capped at 10000 URLs and concurrency 64
`GET`	`/v1/runs`	List recent tracked crawl runs
`GET`	`/v1/runs/{id}`	Fetch a single run by id
`POST`	`/v1/runs/{id}/resume`	Resume a run from its pending URLs

MCP usage

Start the MCP server over stdio:

picocrawl mcp

It exposes three tools:

crawl_url, fetch a URL and return LLM-ready Markdown. Parameters: url (required), browser, depth. With depth > 0 it runs a tracked deep crawl and returns {"run_id", "results"}.
get_cached, return cached Markdown for a URL if present. Parameter: url (required).
list_runs, list recent tracked crawl runs, newest first. No parameters.

License

Apache License 2.0, see LICENSE and NOTICE.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
cmd/picocrawl		cmd/picocrawl
internal		internal
site		site
testdata		testdata
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

picocrawl

Install

Install script (recommended)

Prebuilt binary

Go install

Build from source

Homebrew (planned)

Quick start

picocrawl vs crawl4ai

Configuration

CLI reference

REST reference

MCP usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

picocrawl

Install

Install script (recommended)

Prebuilt binary

Go install

Build from source

Homebrew (planned)

Quick start

picocrawl vs crawl4ai

Configuration

CLI reference

REST reference

MCP usage

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages