Skip to content

danke-global/crawl2kb

Repository files navigation

crawl2kb

Crawl a website and export embedding-ready chunks for RAG pipelines.

By Danke Global

crawl2kb crawls a single domain, extracts structured content (JSON-LD entities, FAQ markup, breadcrumbs), splits it into overlapping chunks, and outputs everything as a JSON bundle or Markdown files — ready for embedding and vector search.

Requirements

  • Go 1.25+ (for building from source)
  • Supported platforms: Linux (amd64, arm64), macOS (amd64, arm64), Windows (amd64)
  • No external dependencies or runtime requirements — single static binary

Install

go install github.com/danke-global/crawl2kb/cmd/crawl2kb@latest

Or download a prebuilt binary from Releases.

Quick Start

# Crawl and export JSON
crawl2kb https://example-clinic.com -o clinic.json

# Crawl and export Markdown (one file per page)
crawl2kb https://example-clinic.com -o clinic_output --format markdown

# Both formats
crawl2kb https://example-clinic.com -o clinic_output --format both

Flags

Flag Default Description
-o crawl.json Output path (file for json, directory for markdown/both)
--format json Output format: json, markdown, both
--max-depth 2 Maximum crawl depth from start URL
--max-pages 200 Maximum pages to crawl
--chunk-size 800 Maximum runes per content chunk (hard limit)
--chunk-overlap 64 Overlapping runes between chunks
--exclude URL pattern to exclude (repeatable)
--user-agent crawl2kb/1.0 Custom User-Agent
--no-sitemap false Disable sitemap.xml seeding
--verbose false Verbose logging to stderr
--version Print version and exit

Sample Output

{
  "version": "1",
  "source_url": "https://example-clinic.com",
  "generated_at": "2026-03-21T14:30:00Z",
  "stats": {
    "pages_total": 47,
    "chunks_total": 189,
    "faqs_total": 12,
    "entities_total": 3
  },
  "pages": [
    {
      "url": "https://example-clinic.com/services/dentistry",
      "title": "Dentistry Services — Example Clinic",
      "description": "Professional dental care including cleaning, whitening, and implants.",
      "breadcrumbs": ["Home", "Services", "Dentistry"],
      "markdown": "## Dentistry Services\n\nWe offer a full range of dental services...",
      "faqs": [
        {
          "question": "How do I book an appointment?",
          "answer": "You can book online through our website or call +1 (555) 123-4567."
        },
        {
          "question": "Do you accept insurance?",
          "answer": "Yes, we accept most major dental insurance plans."
        }
      ],
      "structured_data": [
        {
          "type": "MedicalOrganization",
          "properties": {
            "name": "Example Clinic",
            "telephone": "+1 (555) 123-4567",
            "address": "123 Medical Ave, New York, NY 10001",
            "rating": "4.8",
            "hours": "Monday: 08:00-18:00; Tuesday: 08:00-18:00; Wednesday: 08:00-18:00"
          }
        }
      ],
      "chunks": [
        {
          "id": "a1b2c3d4e5f6",
          "index": 0,
          "type": "content",
          "section": "Page summary",
          "text": "## Page summary\nProfessional dental care including cleaning, whitening, and implants."
        },
        {
          "id": "b2c3d4e5f6a1",
          "index": 1,
          "type": "content",
          "section": "Dentistry Services",
          "text": "## Dentistry Services\nWe offer a full range of dental services including preventive care..."
        },
        {
          "id": "c3d4e5f6a1b2",
          "index": 2,
          "type": "faq",
          "section": "faq",
          "text": "## FAQ\nQ: How do I book an appointment?\nA: You can book online through our website or call +1 (555) 123-4567."
        },
        {
          "id": "d4e5f6a1b2c3",
          "index": 3,
          "type": "structured_data",
          "section": "structured_data",
          "text": "## MedicalOrganization\naddress: 123 Medical Ave, New York, NY 10001\nhours: Monday: 08:00-18:00; Tuesday: 08:00-18:00\nname: Example Clinic\nrating: 4.8\ntelephone: +1 (555) 123-4567"
        }
      ]
    }
  ]
}

Markdown Output

clinic_output/
├── manifest.json
└── pages/
    ├── index.md
    ├── services-dentistry.md
    ├── services-cardiology.md
    └── about.md

What It Extracts

  • Page content — cleaned HTML sections preserving heading hierarchy
  • FAQ pairs — from JSON-LD FAQPage, microdata schema.org/Question, and <details>/<summary> accordions
  • Structured data — JSON-LD entities (LocalBusiness, Organization, Service, Product, etc.) with flattened properties
  • Breadcrumbs — from JSON-LD BreadcrumbList and HTML nav patterns
  • Meta descriptionog:description with <meta name="description"> fallback

Safety

  • SSRF protection: blocks requests to private IPs (RFC1918), loopback, link-local, cloud metadata endpoints
  • Single-domain only: never follows links to external domains
  • Respects robots.txt: Colly's built-in robots.txt compliance is enabled
  • TLS 1.2+: enforces modern TLS with system CA validation
  • Rate limiting: 4 concurrent requests, 200ms delay between requests
  • No credentials in URLs: rejects http://user:pass@host patterns

Roadmap

  • Document parsing (PDF, DOCX)
  • Vector DB exporter presets (Qdrant, Pinecone, Chroma)
  • Richer debug/diagnostic metadata
  • Public Go package API

License

MIT

About

Crawl a website and export embedding-ready chunks for RAG pipelines

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages