gocrawl

A concurrent web crawler written in Go. Single-run CLI: give it a starting URL and a depth, it crawls the site concurrently within configured bounds, and writes the results to a JSON file.

This is a learning project, built as a warm-up for a larger Go project (Eldrago, a finite element analysis engine). The goals are to exercise Go's concurrency model (goroutines, channels, worker pools, context cancellation), idiomatic error handling, and the standard library — without reaching for heavy frameworks.

Features

Concurrent fetching with a bounded worker pool
Configurable crawl depth and concurrency
Per-request timeouts and graceful shutdown on SIGINT
URL deduplication and same-host restriction by default
Robots.txt respected
Structured JSON output suitable for downstream analysis

Usage

go build -o gocrawl ./cmd/gocrawl

./gocrawl -url https://example.com -depth 2 -concurrency 8 -timeout 10s -out results.json

Flags

Flag	Default	Description
`-url`	(required)	Starting URL
`-depth`	`2`	Maximum link depth from the starting URL
`-concurrency`	`8`	Number of concurrent workers
`-timeout`	`10s`	Per-request HTTP timeout
`-out`	`results.json`	Output file path
`-same-host`	`true`	Restrict crawling to the starting URL's host
`-user-agent`	`gocrawl/0.1`	User-agent string sent with each request

Example output

{
  "start_url": "https://example.com",
  "started_at": "2026-04-19T10:15:00Z",
  "finished_at": "2026-04-19T10:15:12Z",
  "pages_crawled": 47,
  "pages_failed": 2,
  "pages": [
    {
      "url": "https://example.com/",
      "status": 200,
      "depth": 0,
      "links_found": 14,
      "fetched_at": "2026-04-19T10:15:00Z"
    }
  ],
  "errors": [
    {
      "url": "https://example.com/broken",
      "error": "context deadline exceeded"
    }
  ]
}

Architecture

The crawler follows a classic fan-out / fan-in pattern:

                  ┌─────────────┐
                  │  main loop  │
                  │  (scheduler)│
                  └──────┬──────┘
                         │ jobs
                         ▼
              ┌──────────────────────┐
              │   jobs channel       │
              │   (buffered)         │
              └──┬──────┬──────┬─────┘
                 │      │      │
                 ▼      ▼      ▼
              ┌────┐ ┌────┐ ┌────┐
              │ W1 │ │ W2 │ │ Wn │      N workers,
              └──┬─┘ └──┬─┘ └──┬─┘      bounded by -concurrency
                 │      │      │
                 └──────┼──────┘
                        ▼
              ┌──────────────────────┐
              │  results channel     │
              └──────────┬───────────┘
                         ▼
                  ┌─────────────┐
                  │  aggregator │ → JSON output
                  └─────────────┘

Key design choices:

Bounded concurrency via a worker pool, not unbounded goroutines. A crawler that spawns one goroutine per URL will crash on large sites and behave badly toward the target host.
context.Context propagation throughout the call stack, so Ctrl-C cancels in-flight requests cleanly rather than leaking connections.
Fetcher interface separates HTTP concerns from crawling logic, which makes the crawler testable without hitting the network.
URL deduplication via a mutex-guarded map. A sync.Map would also work; a plain map with a sync.RWMutex is clearer for a project this size.

Project layout

gocrawl/
├── cmd/
│   └── gocrawl/
│       └── main.go          # Entrypoint, flag parsing, signal handling
├── internal/
│   ├── crawler/
│   │   ├── crawler.go       # Worker pool, scheduling, dedup
│   │   ├── crawler_test.go
│   │   └── fetcher.go       # Fetcher interface + HTTP implementation
│   ├── parser/
│   │   ├── parser.go        # HTML link extraction
│   │   └── parser_test.go
│   └── output/
│       ├── output.go        # JSON report writer
│       └── output_test.go
└── go.mod

Running the tests

go test ./...
go test -race ./...      # race detector — important for concurrent code
go test -cover ./...

Tests use net/http/httptest to spin up local servers rather than hitting real sites. The Fetcher interface is mocked in unit tests where only parsing or scheduling behaviour is under test.

Limitations

This is deliberately a small project. Things it does not do, by design:

No JavaScript rendering (static HTML only)
No persistent state between runs — every run starts fresh
No distributed crawling
No content extraction beyond link discovery
No retry logic on transient failures (one attempt per URL)

Why this exists

Most of my production experience is in TypeScript. This project is a deliberate exercise to build Go reflexes — particularly around the concurrency primitives (goroutines, channels, context, sync) — before starting work on a larger Go project where correctness under concurrency actually matters.

The scope is set to force exposure to the interesting parts of Go without ballooning into a maintained tool. Once it works end-to-end and has decent test coverage, it's done.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Claude.md		Claude.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gocrawl

Features

Usage

Flags

Example output

Architecture

Project layout

Running the tests

Limitations

Why this exists

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

gocrawl

Features

Usage

Flags

Example output

Architecture

Project layout

Running the tests

Limitations

Why this exists

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages