A concurrent web crawler written in Go. Single-run CLI: give it a starting URL and a depth, it crawls the site concurrently within configured bounds, and writes the results to a JSON file.
This is a learning project, built as a warm-up for a larger Go project (Eldrago, a finite element analysis engine). The goals are to exercise Go's concurrency model (goroutines, channels, worker pools, context cancellation), idiomatic error handling, and the standard library — without reaching for heavy frameworks.
- Concurrent fetching with a bounded worker pool
- Configurable crawl depth and concurrency
- Per-request timeouts and graceful shutdown on
SIGINT - URL deduplication and same-host restriction by default
- Robots.txt respected
- Structured JSON output suitable for downstream analysis
go build -o gocrawl ./cmd/gocrawl
./gocrawl -url https://example.com -depth 2 -concurrency 8 -timeout 10s -out results.json| Flag | Default | Description |
|---|---|---|
-url |
(required) | Starting URL |
-depth |
2 |
Maximum link depth from the starting URL |
-concurrency |
8 |
Number of concurrent workers |
-timeout |
10s |
Per-request HTTP timeout |
-out |
results.json |
Output file path |
-same-host |
true |
Restrict crawling to the starting URL's host |
-user-agent |
gocrawl/0.1 |
User-agent string sent with each request |
{
"start_url": "https://example.com",
"started_at": "2026-04-19T10:15:00Z",
"finished_at": "2026-04-19T10:15:12Z",
"pages_crawled": 47,
"pages_failed": 2,
"pages": [
{
"url": "https://example.com/",
"status": 200,
"depth": 0,
"links_found": 14,
"fetched_at": "2026-04-19T10:15:00Z"
}
],
"errors": [
{
"url": "https://example.com/broken",
"error": "context deadline exceeded"
}
]
}The crawler follows a classic fan-out / fan-in pattern:
┌─────────────┐
│ main loop │
│ (scheduler)│
└──────┬──────┘
│ jobs
▼
┌──────────────────────┐
│ jobs channel │
│ (buffered) │
└──┬──────┬──────┬─────┘
│ │ │
▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐
│ W1 │ │ W2 │ │ Wn │ N workers,
└──┬─┘ └──┬─┘ └──┬─┘ bounded by -concurrency
│ │ │
└──────┼──────┘
▼
┌──────────────────────┐
│ results channel │
└──────────┬───────────┘
▼
┌─────────────┐
│ aggregator │ → JSON output
└─────────────┘
Key design choices:
- Bounded concurrency via a worker pool, not unbounded goroutines. A crawler that spawns one goroutine per URL will crash on large sites and behave badly toward the target host.
context.Contextpropagation throughout the call stack, soCtrl-Ccancels in-flight requests cleanly rather than leaking connections.Fetcherinterface separates HTTP concerns from crawling logic, which makes the crawler testable without hitting the network.- URL deduplication via a mutex-guarded map. A
sync.Mapwould also work; a plain map with async.RWMutexis clearer for a project this size.
gocrawl/
├── cmd/
│ └── gocrawl/
│ └── main.go # Entrypoint, flag parsing, signal handling
├── internal/
│ ├── crawler/
│ │ ├── crawler.go # Worker pool, scheduling, dedup
│ │ ├── crawler_test.go
│ │ └── fetcher.go # Fetcher interface + HTTP implementation
│ ├── parser/
│ │ ├── parser.go # HTML link extraction
│ │ └── parser_test.go
│ └── output/
│ ├── output.go # JSON report writer
│ └── output_test.go
└── go.mod
go test ./...
go test -race ./... # race detector — important for concurrent code
go test -cover ./...Tests use net/http/httptest to spin up local servers rather than hitting real sites. The Fetcher interface is mocked in unit tests where only parsing or scheduling behaviour is under test.
This is deliberately a small project. Things it does not do, by design:
- No JavaScript rendering (static HTML only)
- No persistent state between runs — every run starts fresh
- No distributed crawling
- No content extraction beyond link discovery
- No retry logic on transient failures (one attempt per URL)
Most of my production experience is in TypeScript. This project is a deliberate exercise to build Go reflexes — particularly around the concurrency primitives (goroutines, channels, context, sync) — before starting work on a larger Go project where correctness under concurrency actually matters.
The scope is set to force exposure to the interesting parts of Go without ballooning into a maintained tool. Once it works end-to-end and has decent test coverage, it's done.
MIT