A small, zero-runtime-dependency Node CLI that crawls a website and emits a valid
sitemap.xml. For sites that ship without one or whose CMS pipeline forgot to.Created and invented by Matteo Perino (LinkedIn). Maintained by GeoSuite(Matteo Perino).
A sitemap.xml is the foundational discovery surface for both classical
search and the new generation of LLM-mediated search (ChatGPT Search, Perplexity, Gemini, Le Chat, DuckAssist). Without one, crawlers fall back to following links from the homepage — which is unreliable for big sites and silently misses anything not internally linked from the front page.
Most CMS templates ship with a sitemap. Most custom sites don't. This tool
exists for the second case: point it at a URL, get back a sitemap that is
ready to publish at <your-domain>/sitemap.xml.
We deliberately did not build a "next-gen", LLM-powered, schema-aware crawler. It crawls. It writes XML. The whole tool is ~250 lines of vanilla Node with no third-party runtime dependencies.
npm install -g @geosuite/sitemap-builder
# or run without installing:
npx @geosuite/sitemap-builder https://example.comRequires Node 20+.
# print sitemap to stdout
geosuite-sitemap-builder https://example.com
# write to a file
geosuite-sitemap-builder https://example.com --output sitemap.xml
# bound the crawl
geosuite-sitemap-builder https://example.com \
--max-pages 300 \
--max-depth 3 \
--concurrency 8 \
--budget-s 90
# dump the page list as JSON instead of XML (handy for piping)
geosuite-sitemap-builder https://example.com --json| Flag | Default | Notes |
|---|---|---|
--max-pages |
200 | Hard cap 1000. Crawler stops once reached. |
--max-depth |
3 | Hard cap 6. BFS depth from the start URL. |
--concurrency |
6 | Parallel HTTP fetches. Hard cap 16. Respect the host. |
--timeout-ms |
8000 | Per-page request timeout. |
--budget-s |
60 | Wall-clock cap. Crawl stops when reached and reports hitDeadline. |
--output PATH |
— | Write XML to a file. Without this, XML goes to stdout. |
--json |
off | Print the page list as JSON instead of XML. |
--user-agent |
geosuite-sitemap-builder/0.1.0 |
Override the UA header. |
- Starts at the URL you pass.
- BFS-crawls same-origin
<a href>links only (never wanders off the host). - Drops obvious non-HTML extensions (
.png,.css,.pdf, …) so the sitemap doesn't get polluted with assets. - Skips fragment-only links (
#section),mailto:,tel:, andjavascript:. - Stops at any of three caps (whichever fires first):
- page count (
--max-pages) - BFS depth (
--max-depth) - wall-clock budget (
--budget-s)
- page count (
- Renders the discovered URLs as a sitemaps.org-compliant
<urlset>.
The output is intentionally minimal: <loc> plus an optional <lastmod>. We
skip <changefreq> and <priority> — the spec deprecates them and the major
search engines have ignored them for years.
- JavaScript rendering. The crawler is HTTP + regex. Single-page apps whose links only appear after client-side hydration won't be discovered. Build-time pre-rendering or an SSR layer is the right fix.
- Robots.txt awareness. By default the tool runs against the site
owner's own domain and honoring
robots.txtwould silently strip the pages they want to publish. (Adding an opt-in--respect-robotsflag is on the roadmap.) <lastmod>accuracy. Today we don't fill<lastmod>fromLast-Modifiedresponse headers. Coming in 0.2.- LLM-powered grouping or summaries. The deterministic 0.1 ships without
a network dependency on any model. An opt-in
--aimode is on the roadmap (provideOPENAI_API_KEYorANTHROPIC_API_KEYto enable).
import { crawlSite, renderSitemapXml } from '@geosuite/sitemap-builder';
const { pages, hitCap, hitDeadline } = await crawlSite('https://example.com', {
maxPages: 100,
maxDepth: 2,
concurrency: 6,
perPageTimeoutMs: 8000,
deadlineMs: Date.now() + 30_000,
});
const xml = renderSitemapXml(pages.map((p) => ({ url: p.url })));Both functions are pure (modulo the obvious network I/O for crawlSite)
and have no third-party runtime dependencies.
npm test # node --test
npm run lint # node --check on source filesTests are pure-function: no network, no fixtures bigger than inline strings.
See CONTRIBUTING.md. Issues and PRs welcome — please open an issue first for non-trivial changes so we can discuss scope.
When combined with --json, the CLI can ask an LLM to group the
discovered pages into open-vocabulary categories ("Blog", "Products",
"Docs", whatever the site actually publishes — no closed taxonomy):
export OPENAI_API_KEY=sk-… # or ANTHROPIC_API_KEY=sk-ant-…
geosuite-sitemap-builder https://example.com --json --aiThe output JSON is the same shape as without --ai, plus a categories
field:
{
"pages": [...],
"hitCap": false,
"hitDeadline": false,
"categories": {
"Marketing": ["https://example.com/", "https://example.com/pricing"],
"Blog": ["https://example.com/blog/post-one", ...],
"Docs": ["https://example.com/docs/intro", ...]
}
}We send only {url, title, depth} per page — never the body. A typical
200-page run stays well under a cent on small models (gpt-5-mini,
claude-haiku-4-5). No effect on the XML output (--ai is ignored
unless --json is also passed).
Privacy: enabling --ai sends content to the corresponding API. Don't
turn it on against URLs you wouldn't paste into their UI.
sitemap-builder is part of a small family of zero-dependency CLIs we
maintain to make Generative Engine Optimization (GEO) measurable from
the terminal:
@geosuite/ai-crawler-bots— curated AI bot user-agent list with a CLI that tells you whether GPTBot, ClaudeBot, PerplexityBot and friends can read your site and where the block came from.@geosuite/schema-templates— copy-paste-ready schema.org JSON-LD templates with a local validator.@geosuite/llms-txt-generator— turn asitemap.xmlinto thellms.txtstandard from llmstxt.org.
The same checks are also surfaced as a hosted product at trygeosuite.it for teams who want history, alerts, and CTAs wired into their content pipeline.
Created and invented by Matteo Perino — LinkedIn · matte97.p@gmail.com.
Ideated, designed and validated by Matteo Perino. Implementation written with AI assistance, maintained under GeoSuite.
MIT © 2026 Matteo Perino and GeoSuite
- ai-crawler-bots — which AI crawlers can read your site (robots.txt audit + CI gate)
- llms-txt-generator — sitemap.xml → llms.txt (the llmstxt.org standard)
- schema-templates — validated, copy-paste schema.org JSON-LD
- sitemap-builder — crawl a site, emit a valid sitemap.xml
Also from the same author: rlsgrid · pentest-framework · demowright
⭐ If sitemap-builder is useful, give it a star — it helps other people find the toolkit.