Crawl a website and export embedding-ready chunks for RAG pipelines.
By Danke Global
crawl2kb crawls a single domain, extracts structured content (JSON-LD entities, FAQ markup, breadcrumbs), splits it into overlapping chunks, and outputs everything as a JSON bundle or Markdown files — ready for embedding and vector search.
- Go 1.25+ (for building from source)
- Supported platforms: Linux (amd64, arm64), macOS (amd64, arm64), Windows (amd64)
- No external dependencies or runtime requirements — single static binary
go install github.com/danke-global/crawl2kb/cmd/crawl2kb@latestOr download a prebuilt binary from Releases.
# Crawl and export JSON
crawl2kb https://example-clinic.com -o clinic.json
# Crawl and export Markdown (one file per page)
crawl2kb https://example-clinic.com -o clinic_output --format markdown
# Both formats
crawl2kb https://example-clinic.com -o clinic_output --format both| Flag | Default | Description |
|---|---|---|
-o |
crawl.json |
Output path (file for json, directory for markdown/both) |
--format |
json |
Output format: json, markdown, both |
--max-depth |
2 |
Maximum crawl depth from start URL |
--max-pages |
200 |
Maximum pages to crawl |
--chunk-size |
800 |
Maximum runes per content chunk (hard limit) |
--chunk-overlap |
64 |
Overlapping runes between chunks |
--exclude |
URL pattern to exclude (repeatable) | |
--user-agent |
crawl2kb/1.0 |
Custom User-Agent |
--no-sitemap |
false |
Disable sitemap.xml seeding |
--verbose |
false |
Verbose logging to stderr |
--version |
Print version and exit |
{
"version": "1",
"source_url": "https://example-clinic.com",
"generated_at": "2026-03-21T14:30:00Z",
"stats": {
"pages_total": 47,
"chunks_total": 189,
"faqs_total": 12,
"entities_total": 3
},
"pages": [
{
"url": "https://example-clinic.com/services/dentistry",
"title": "Dentistry Services — Example Clinic",
"description": "Professional dental care including cleaning, whitening, and implants.",
"breadcrumbs": ["Home", "Services", "Dentistry"],
"markdown": "## Dentistry Services\n\nWe offer a full range of dental services...",
"faqs": [
{
"question": "How do I book an appointment?",
"answer": "You can book online through our website or call +1 (555) 123-4567."
},
{
"question": "Do you accept insurance?",
"answer": "Yes, we accept most major dental insurance plans."
}
],
"structured_data": [
{
"type": "MedicalOrganization",
"properties": {
"name": "Example Clinic",
"telephone": "+1 (555) 123-4567",
"address": "123 Medical Ave, New York, NY 10001",
"rating": "4.8",
"hours": "Monday: 08:00-18:00; Tuesday: 08:00-18:00; Wednesday: 08:00-18:00"
}
}
],
"chunks": [
{
"id": "a1b2c3d4e5f6",
"index": 0,
"type": "content",
"section": "Page summary",
"text": "## Page summary\nProfessional dental care including cleaning, whitening, and implants."
},
{
"id": "b2c3d4e5f6a1",
"index": 1,
"type": "content",
"section": "Dentistry Services",
"text": "## Dentistry Services\nWe offer a full range of dental services including preventive care..."
},
{
"id": "c3d4e5f6a1b2",
"index": 2,
"type": "faq",
"section": "faq",
"text": "## FAQ\nQ: How do I book an appointment?\nA: You can book online through our website or call +1 (555) 123-4567."
},
{
"id": "d4e5f6a1b2c3",
"index": 3,
"type": "structured_data",
"section": "structured_data",
"text": "## MedicalOrganization\naddress: 123 Medical Ave, New York, NY 10001\nhours: Monday: 08:00-18:00; Tuesday: 08:00-18:00\nname: Example Clinic\nrating: 4.8\ntelephone: +1 (555) 123-4567"
}
]
}
]
}clinic_output/
├── manifest.json
└── pages/
├── index.md
├── services-dentistry.md
├── services-cardiology.md
└── about.md
- Page content — cleaned HTML sections preserving heading hierarchy
- FAQ pairs — from JSON-LD
FAQPage, microdataschema.org/Question, and<details>/<summary>accordions - Structured data — JSON-LD entities (
LocalBusiness,Organization,Service,Product, etc.) with flattened properties - Breadcrumbs — from JSON-LD
BreadcrumbListand HTML nav patterns - Meta description —
og:descriptionwith<meta name="description">fallback
- SSRF protection: blocks requests to private IPs (RFC1918), loopback, link-local, cloud metadata endpoints
- Single-domain only: never follows links to external domains
- Respects robots.txt: Colly's built-in robots.txt compliance is enabled
- TLS 1.2+: enforces modern TLS with system CA validation
- Rate limiting: 4 concurrent requests, 200ms delay between requests
- No credentials in URLs: rejects
http://user:pass@hostpatterns
- Document parsing (PDF, DOCX)
- Vector DB exporter presets (Qdrant, Pinecone, Chroma)
- Richer debug/diagnostic metadata
- Public Go package API
MIT