Skip to content

northbridge-systems/agent-reachability-probe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

agent-reachability-probe

A standalone Python probe that simulates the major LLM crawlers and reports whether they can actually reach a given URL.

For each crawler, the probe records:

  • HTTP status code
  • Content-Length and Content-Type
  • whether robots.txt allows that crawler on the path
  • a simple JS-dependency heuristic (response size relative to a browser baseline)

No third-party dependencies. Python 3.9 or newer, standard library only.

Why this exists

Generative engines retrieve content through dedicated crawlers with distinct User-Agent strings. A page that renders perfectly in Chrome may be invisible to GPTBot if it requires JavaScript, returns a different status code for the bot, or is disallowed in robots.txt without the operator's knowledge.

agent-reachability-probe makes those failure modes explicit before the engines route around them silently.

Crawlers covered

GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider — plus a Chrome browser baseline for comparison.

User-Agent strings reflect the documented public versions as of early 2026. When operators change their strings, update CRAWLERS in probe.py.

Install

git clone https://github.com/northbridge-systems/agent-reachability-probe.git
cd agent-reachability-probe

That is the entire install. No pip install, no virtual environment required.

Use

python3 probe.py https://example.com/

Output is a text table by default:

URL: https://example.com/

Crawler               Status     Bytes   robots.txt   JS-dep?  Note
-------------------------------------------------------------------
Browser (Chrome)         200    1 256          n/a        no
GPTBot                   200    1 256      allowed        no
ChatGPT-User             200    1 256      allowed        no
OAI-SearchBot            200    1 256      allowed        no
ClaudeBot                200    1 256      allowed        no
PerplexityBot            200    1 256      allowed        no
Google-Extended          200    1 256      allowed        no
CCBot                    200    1 256      allowed        no
Bytespider               200    1 256      allowed        no

Machine-readable JSON output for pipelines:

python3 probe.py https://example.com/ --json

Custom timeout per request:

python3 probe.py https://example.com/ --timeout 30

How to read the output

Status — anything other than 200 for a specific crawler while the browser baseline returns 200 indicates User-Agent-based gating.

Bytes — values significantly smaller than the browser baseline often indicate a client-side rendered page. The JS-dep? column flags responses smaller than half the baseline size.

robots.txtDISALLOWED means the crawler is excluded by your own robots.txt. That is either intentional or the most expensive single configuration error in generative-engine visibility.

JS-dep?likely is a heuristic, not a verdict. Some pages legitimately serve a smaller HTML to crawlers; others fail to render their main content at all. Confirm by viewing the response directly.

Limitations

  • Heuristic only. A response that passes this probe is not guaranteed to be parsed correctly by the downstream model.
  • User-Agent strings of LLM crawlers change. Treat CRAWLERS in probe.py as a maintained list, not a constant.
  • The JS-dependency check is size-based and does not execute JavaScript. For precise render-state checks, use a headless browser probe alongside this one.

Citation

If you use this software in academic work, please cite it via the Cite this repository button in the right sidebar of this page, or directly via CITATION.cff.

Part of the Compliance-GEO methodology

This tool is part of the Compliance-GEO methodology maintained by Northbridge Systems for citation visibility in regulated consumer markets (telecommunications, financial services, insurance, commerce).

Industrial property rights

Filed at the German Patent and Trademark Office (DPMA) on 27 April 2026:

  • Utility model DE 20 2026 001 867.4 — Computer system for evaluating digital publication placements in generative AI systems
  • Patent application DE 10 2026 002 308.4 — Deterministic, reproducible retrieval evaluation across multiple independent generative AI systems with signature-secured override discipline

Author

Tim Heidfeld · CEFA · Diplom-Investmentanalyst Founder & Principal, Northbridge Systems GmbH · Kiefersfelden, Bavaria, Germany

License

MIT — see LICENSE.

About

User-agent simulator for major LLM crawlers: measures response code, content length, and JS-dependency

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages