██████╗ ██████╗ ██████╗ ███████╗ ██████╗ ███████╗██╗ ██╗███████╗██╗ ██╗████████╗
██╔════╝██╔═══██╗██╔══██╗██╔════╝ ██╔══██╗██╔════╝██║ ██║██╔════╝██║ ██║╚══██╔══╝
██║ ██║ ██║██║ ██║█████╗ ██████╔╝█████╗ ██║ ██║███████╗███████║ ██║
██║ ██║ ██║██║ ██║██╔══╝ ██╔══██╗██╔══╝ ╚██╗ ██╔╝╚════██║██╔══██║ ██║
╚██████╗╚██████╔╝██████╔╝███████╗ ██║ ██║███████╗ ╚████╔╝ ███████║██║ ██║ ██║
╚═════╝ ╚═════╝ ╚═════╝ ╚══════╝ ╚═╝ ╚═╝╚══════╝ ╚═══╝ ╚══════╝╚═╝ ╚═╝ ╚═╝
CrawlGuard - 轻量级 AI 爬虫智能防护反向代理引擎
CrawlGuard 是一款纯 Python 编写的轻量级、零外部依赖的 AI 爬虫智能防护反向代理引擎。它部署在你的 Web 服务器前端,充当一道智能屏障,精准识别并拦截各类 AI 爬虫和恶意 Bot 流量,同时让正常用户访问畅通无阻。
在 AI 技术飞速发展的今天,越来越多的 AI 爬虫(如 GPTBot、ClaudeBot、Bytespider 等)在未经网站所有者授权的情况下大量抓取内容,导致:
- 带宽资源被无端消耗 -- AI 爬虫的抓取量往往远超传统搜索引擎爬虫
- 原创内容被无偿利用 -- 你的文章、数据被用于训练 AI 模型,而你一无所知
- 服务器压力剧增 -- 大量自动化请求影响正常用户的访问体验
- SEO 排名受到干扰 -- 低质量爬虫的频繁抓取可能影响搜索引擎对你站点的评价
| 特性 | CrawlGuard | Cloudflare Bot Management | robots.txt |
|---|---|---|---|
| 部署方式 | 自托管,完全可控 | 依赖第三方服务 | 仅靠自律 |
| 外部依赖 | 零依赖,纯标准库 | 需要账户和付费 | 无 |
| 防护策略 | 多层检测 + JS 挑战 + 蜜罐 | 依赖云端规则 | 仅声明式 |
| 数据隐私 | 数据不出你的服务器 | 数据经过第三方 | N/A |
| 定制能力 | 完全可定制 | 有限 | 无 |
| 成本 | 完全免费 | 高昂的付费套餐 | 免费 |
CrawlGuard 的灵感来源于日常运维中遇到的 AI 爬虫泛滥问题。我们发现,传统的 robots.txt 协议对 AI 爬虫形同虚设 -- 许多 AI 公司的爬虫完全无视 robots.txt 的 Disallow 规则。而现有的商业 Bot 管理方案要么价格昂贵,要么需要将流量交给第三方处理,存在数据隐私风险。因此,我们决定打造一款开源、轻量、自托管的 AI 爬虫防护工具,让每一位网站运营者都能轻松保护自己的内容。
内置覆盖主流 AI 爬虫的 User-Agent 特征库,包括但不限于:
- OpenAI 系列: GPTBot、ChatGPT-User、OAI-SearchBot
- Anthropic 系列: ClaudeBot、Claude-Web、Claude-SearchBot
- Google AI 系列: Google-Extended、GoogleOther
- 字节跳动: Bytespider
- Meta 系列: FacebookBot、Meta-ExternalAgent
- Apple AI: Applebot-Extended
- Common Crawl: CCBot
- Perplexity AI: PerplexityBot、Perplexity-User
- Cohere AI: cohere-ai、CohereBot
- Allen AI: ai2bot、AllenAI
- Amazon: Amazonbot、Amazonbot-Extended
- 通用爬虫/工具: curl、wget、Scrapy、python-requests 等
- SEO 爬虫: AhrefsBot、SemrushBot、MJ12bot 等
- UA 黑名单/白名单 -- 基于正则表达式的 User-Agent 精准匹配,白名单优先
- IP 信誉评分系统 -- 动态追踪每个 IP 的行为,累计惩罚分数,超阈值自动封禁
- 令牌桶限流 -- 支持按分钟和按小时的请求速率限制,防止突发式抓取
- 行为分析 -- 检测路径扫描、序号遍历等异常访问模式
- 蜜罐检测 -- 在页面中注入隐藏链接,只有 Bot 会触发,精准识别自动化工具
- 可疑请求头检测 -- 可配置拦截缺少 Referer 或 Accept 头的请求
- 浏览器自动求解 -- 生成基于哈希的加密谜题,真实浏览器可通过 JavaScript 自动完成
- 可调难度 -- 难度等级 1-10 可配置,平衡安全性与用户体验
- 会话保持 -- 通过 Cookie 记录已验证的浏览器,避免重复挑战
- 数学验证码兜底 -- 可选的文字数学验证码作为降级方案
- 暗色主题设计 -- 现代化的深色 UI,支持响应式布局
- 实时统计 -- 总请求数、放行数、拦截数、挑战数、蜜罐触发数、通过率
- Top 榜单 -- 被拦截最多的 IP 排行、被拦截最多的 Bot 类型排行
- 活动日志 -- 最近 50 条请求的详细记录(时间、动作、IP、路径、UA、原因)
- 自动刷新 -- 10 秒自动刷新,也可手动刷新
- 纯 Python 标准库 -- 仅使用
http.server、urllib、threading、hashlib等内置模块 - 无需安装任何第三方包 --
pip install后即可运行 - 极小的资源占用 -- 适合部署在资源有限的 VPS 上
所有功能均可通过 config.yaml 灵活配置,无需修改代码即可调整检测策略、限流参数、黑白名单等。
提供完整的 REST API 用于监控集成:
| 端点 | 说明 |
|---|---|
GET /api/stats |
获取完整统计数据 |
GET /api/blocked |
获取被拦截的 IP 和 Bot 列表 |
GET /api/activity |
获取最近活动日志 |
GET /api/health |
健康检查端点 |
支持 SIGINT(Ctrl+C)和 SIGTERM 信号,确保正在处理的请求完成后才关闭服务,避免数据丢失。
- Python 3.8+(支持 3.8、3.9、3.10、3.11、3.12)
- 无需任何第三方依赖
方式一:从 PyPI 安装(推荐)
pip install crawlguard方式二:从源码安装
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .# 克隆仓库
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
# 复制配置文件
cp config.example.yaml config.yaml
# 启动反向代理(默认监听 8080 端口,转发到 localhost:3000)
python -m crawlguard.cli start --target http://localhost:3000
# 自定义端口启动
python -m crawlguard.cli start --target http://localhost:3000 --port 8080
# 启动监控面板(默认 8081 端口)
python -m crawlguard.cli dashboard
# 运行自检测试
python -m crawlguard.cli test
# 查看当前状态
python -m crawlguard.cli status# 启动代理
crawlguard start --target http://localhost:3000
# 启动面板
crawlguard dashboard
# 自检
crawlguard test
# 查看版本
crawlguard versionCrawlGuard 使用 YAML 格式的配置文件,以下是核心配置项说明:
# 服务器设置
server:
listen_host: "0.0.0.0" # 监听地址
listen_port: 8080 # 代理监听端口
target_host: "http://localhost:3000" # 上游服务器地址
workers: 1 # 工作线程数
# 监控面板设置
dashboard:
enabled: true # 是否启用 Web 面板
port: 8081 # 面板端口(需与代理端口不同)
host: "0.0.0.0" # 面板绑定地址
# 挑战系统设置
challenges:
js_challenge: true # 启用 JS 加密挑战
rate_limit: true # 启用限流
honeypot: true # 启用蜜罐检测
user_agent_check: true # 启用 UA 黑白名单
captcha_fallback: false # 启用数学验证码兜底
js_difficulty: 3 # JS 谜题难度(1-10)
challenge_timeout: 300 # 挑战过期时间(秒)
# 限流设置
rate_limit:
enabled: true
requests_per_minute: 60 # 每个 IP 每分钟最大请求数
requests_per_hour: 1000 # 每个 IP 每小时最大请求数
burst_size: 10 # 令牌桶突发大小
# IP 信誉系统
ip_reputation:
enabled: true
max_score: 100 # 最大信誉分数
block_threshold: 80 # 封禁阈值
decay_interval: 3600 # 分数衰减间隔(秒)
# 安全设置
security:
secret_key: "change-me-to-a-random-secret-string" # 令牌生成密钥
allowed_methods: # 允许的 HTTP 方法
- GET
- POST
- HEAD
- OPTIONS| 挑战类型 | 说明 | 适用场景 |
|---|---|---|
| JS 挑战 | 生成加密谜题,浏览器自动求解 | 大部分场景,对用户无感知 |
| 限流 | 令牌桶算法,超速返回 429 | 防止突发式抓取 |
| 蜜罐 | 注入隐藏链接,Bot 触发即封禁 | 精准识别自动化工具 |
| UA 检查 | User-Agent 黑白名单匹配 | 快速拦截已知 Bot |
| 数学验证码 | 简单算术题,人工可解 | JS 挑战不可用时的降级方案 |
启动代理后,访问 http://localhost:8081 即可打开监控面板。
API 集成示例:
# 获取统计数据
curl http://localhost:8081/api/stats
# 获取被拦截的 IP 和 Bot
curl http://localhost:8081/api/blocked
# 获取最近活动
curl http://localhost:8081/api/activity
# 健康检查
curl http://localhost:8081/api/health# 白名单 -- 始终放行的搜索引擎爬虫
whitelist:
bots:
- Googlebot # Google 搜索爬虫
- Bingbot # Bing 搜索爬虫
- Baiduspider # 百度搜索爬虫
- YandexBot # Yandex 搜索爬虫
- DuckDuckBot # DuckDuckGo 搜索爬虫
ips: [] # 白名单 IP 地址
user_agents: [] # 额外的白名单 UA 模式
# 黑名单 -- 始终拦截的 AI 爬虫
blacklist:
bots:
- GPTBot # OpenAI GPT 爬虫
- ClaudeBot # Anthropic Claude 爬虫
- Google-Extended # Google AI 训练爬虫
- Bytespider # 字节跳动爬虫
- PerplexityBot # Perplexity AI 爬虫
ips: [] # 黑名单 IP 地址
user_agents: [] # 额外的黑名单 UA 模式场景一:保护个人博客
# 博客运行在 localhost:80,CrawlGuard 监听 80 端口对外服务
python -m crawlguard.cli start --target http://localhost:8080 --port 80场景二:保护 API 服务器
# config.yaml
server:
listen_port: 443
target_host: "http://api-backend:3000"
challenges:
js_challenge: false # API 场景关闭 JS 挑战
rate_limit: true # 保留限流
honeypot: false # API 场景关闭蜜罐
rate_limit:
requests_per_minute: 120
requests_per_hour: 5000场景三:保护静态网站
# 静态文件由 Nginx 提供,CrawlGuard 作为前置代理
python -m crawlguard.cli start --target http://localhost:80 --port 8080 --no-dashboard- 轻量至上 -- 零外部依赖,纯 Python 标准库实现,部署即飞
- 简单易用 -- 一条命令启动,YAML 配置驱动,开箱即用
- 多层防护 -- 不依赖单一检测手段,UA 匹配 + 行为分析 + 挑战验证 + 信誉评分四层联动
- 数据自主 -- 所有数据和日志留在你的服务器上,不经过任何第三方
- 可读性强 -- 代码清晰易懂,方便社区审查和贡献
- 开发效率高 -- 快速迭代,快速响应新的 AI 爬虫威胁
- 生态丰富 -- 方便未来扩展机器学习检测等高级功能
- 跨平台 -- Linux、macOS、Windows 均可运行
- WAF 规则引擎 -- 支持 SQL 注入、XSS 等常见 Web 攻击检测
- 地理位置封禁 -- 基于 IP 地理信息的访问控制
- 机器学习检测 -- 利用请求特征训练模型,识别未知 Bot
- Docker 支持 -- 提供官方 Docker 镜像和 docker-compose 配置
- 插件系统 -- 支持自定义检测插件和响应动作
- WebSocket 支持 -- 保护 WebSocket 连接
- 集群模式 -- 支持多实例共享状态(Redis 后端)
- 告警通知 -- Webhook / 邮件 / 钉钉告警集成
我们欢迎以下方向的贡献:
- 新增 AI Bot 检测规则
- 改进检测算法和策略
- 编写文档和教程
- 提交 Bug 报告和修复
- 分享部署经验和最佳实践
# 克隆并安装
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .
# 配置
cp config.example.yaml config.yaml
vim config.yaml # 编辑配置
# 启动
crawlguard start --target http://your-backend:3000FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -e .
EXPOSE 8080 8081
COPY config.example.yaml config.yaml
CMD ["python", "-m", "crawlguard.cli", "start", "--target", "http://host.docker.internal:3000"]# 构建镜像
docker build -t crawlguard .
# 运行容器
docker run -d \
--name crawlguard \
-p 8080:8080 \
-p 8081:8081 \
-v $(pwd)/config.yaml:/app/config.yaml \
crawlguard[Unit]
Description=CrawlGuard AI Crawler Protection
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/crawlguard
ExecStart=/usr/bin/python3 -m crawlguard.cli start --config /opt/crawlguard/config.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=multi-user.target# 启用并启动服务
sudo systemctl daemon-reload
sudo systemctl enable crawlguard
sudo systemctl start crawlguard
# 查看日志
sudo journalctl -u crawlguard -fupstream crawlguard {
server 127.0.0.1:8080;
}
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://crawlguard;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# 监控面板(建议限制访问)
server {
listen 8081;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8081;
allow 127.0.0.1;
allow your-office-ip;
deny all;
}
}| 环境 | 支持情况 |
|---|---|
| Linux (Ubuntu/Debian/CentOS) | 完全支持 |
| macOS | 完全支持 |
| Windows | 完全支持 |
| Python 3.8 | 完全支持 |
| Python 3.9 | 完全支持 |
| Python 3.10 | 完全支持 |
| Python 3.11 | 完全支持 |
| Python 3.12 | 完全支持 |
我们非常欢迎社区贡献!以下是参与贡献的指南。
- Fork 本仓库
- 创建你的特性分支:
git checkout -b feature/your-feature - 提交你的改动:
git commit -m 'feat: add your feature' - 推送到分支:
git push origin feature/your-feature - 提交 Pull Request
- 使用清晰的标题描述问题
- 附上复现步骤和环境信息
- 如果是 Bug 报告,请附上相关日志
- 如果是功能建议,请详细描述使用场景
- 遵循 PEP 8 编码规范
- 为新增功能编写单元测试
- 保持代码注释清晰完整
- Commit 信息建议使用 Conventional Commits 格式
# 克隆仓库
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
# 安装开发依赖
pip install -e ".[dev]"
# 运行测试
pytest
# 运行自检
python -m crawlguard.cli test
# 验证配置
python -m crawlguard.cli start --dry-run本项目基于 MIT License 开源。
MIT License
Copyright (c) 2024 CrawlGuard Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
如果 CrawlGuard 对你有帮助,请给项目一个 Star ⭐
Made with ❤️ by CrawlGuard Team
CrawlGuard 是一款以純 Python 撰寫的輕量級、零外部依賴的 AI 爬蟲智慧防護反向代理引擎。它部署在你的 Web 伺服器前端,作為一道智慧屏障,精準識別並攔截各類 AI 爬蟲與惡意 Bot 流量,同時讓正常使用者訪問暢通無阻。
在 AI 技術飛速發展的今天,越來越多的 AI 爬蟲(如 GPTBot、ClaudeBot、Bytespider 等)在未經網站所有者授權的情況下大量抓取內容,導致:
- 頻寬資源被無端消耗 -- AI 爬蟲的抓取量往往遠超傳統搜尋引擎爬蟲
- 原創內容被無償利用 -- 你的文章、資料被用於訓練 AI 模型,而你一無所知
- 伺服器壓力劇增 -- 大量自動化請求影響正常使用者的訪問體驗
- SEO 排名受到干擾 -- 低品質爬蟲的頻繁抓取可能影響搜尋引擎對你站點的評價
| 特性 | CrawlGuard | Cloudflare Bot Management | robots.txt |
|---|---|---|---|
| 部署方式 | 自託管,完全可控 | 依賴第三方服務 | 僅靠自律 |
| 外部依賴 | 零依賴,純標準庫 | 需要帳戶和付費 | 無 |
| 防護策略 | 多層偵測 + JS 挑戰 + 蜜罐 | 依賴雲端規則 | 僅宣告式 |
| 資料隱私 | 資料不出你的伺服器 | 資料經過第三方 | N/A |
| 客製能力 | 完全可客製 | 有限 | 無 |
| 成本 | 完全免費 | 昂貴的付費方案 | 免費 |
CrawlGuard 的靈感來源於日常維運中遇到的 AI 爬蟲氾濫問題。我們發現,傳統的 robots.txt 協議對 AI 爬蟲形同虛設 -- 許多 AI 公司的爬蟲完全無視 robots.txt 的 Disallow 規則。而現有的商業 Bot 管理方案要麼價格昂貴,要麼需要將流量交給第三方處理,存在資料隱私風險。因此,我們決定打造一款開源、輕量、自託管的 AI 爬蟲防護工具,讓每一位網站經營者都能輕鬆保護自己的內容。
內建覆蓋主流 AI 爬蟲的 User-Agent 特徵庫,包括但不限於:
- OpenAI 系列: GPTBot、ChatGPT-User、OAI-SearchBot
- Anthropic 系列: ClaudeBot、Claude-Web、Claude-SearchBot
- Google AI 系列: Google-Extended、GoogleOther
- 字節跳動: Bytespider
- Meta 系列: FacebookBot、Meta-ExternalAgent
- Apple AI: Applebot-Extended
- Common Crawl: CCBot
- Perplexity AI: PerplexityBot、Perplexity-User
- Cohere AI: cohere-ai、CohereBot
- Allen AI: ai2bot、AllenAI
- Amazon: Amazonbot、Amazonbot-Extended
- 通用爬蟲/工具: curl、wget、Scrapy、python-requests 等
- SEO 爬蟲: AhrefsBot、SemrushBot、MJ12bot 等
- UA 黑名單/白名單 -- 基於正規表達式的 User-Agent 精準匹配,白名單優先
- IP 信譽評分系統 -- 動態追蹤每個 IP 的行為,累計懲罰分數,超閾值自動封禁
- 令牌桶限流 -- 支援按分鐘和按小時的請求速率限制,防止突發式抓取
- 行為分析 -- 偵測路徑掃描、序號遍歷等異常存取模式
- 蜜罐偵測 -- 在頁面中注入隱藏連結,只有 Bot 會觸發,精準識別自動化工具
- 可疑請求頭偵測 -- 可設定攔截缺少 Referer 或 Accept 頭的請求
- 瀏覽器自動求解 -- 產生基於雜湊的加密謎題,真實瀏覽器可透過 JavaScript 自動完成
- 可調難度 -- 難度等級 1-10 可設定,平衡安全性與使用者體驗
- 工作階段保持 -- 透過 Cookie 記錄已驗證的瀏覽器,避免重複挑戰
- 數學驗證碼兜底 -- 可選的文字數學驗證碼作為降級方案
- 暗色主題設計 -- 現代化的深色 UI,支援響應式佈局
- 即時統計 -- 總請求數、放行數、攔截數、挑戰數、蜜罐觸發數、通過率
- Top 排行 -- 被攔截最多的 IP 排行、被攔截最多的 Bot 類型排行
- 活動日誌 -- 最近 50 筆請求的詳細記錄(時間、動作、IP、路徑、UA、原因)
- 自動重新整理 -- 10 秒自動重新整理,也可手動重新整理
- 純 Python 標準庫 -- 僅使用
http.server、urllib、threading、hashlib等內建模組 - 無需安裝任何第三方套件 --
pip install後即可執行 - 極小的資源佔用 -- 適合部署在資源有限的 VPS 上
所有功能均可透過 config.yaml 靈活設定,無需修改程式碼即可調整偵測策略、限流參數、黑白名單等。
提供完整的 REST API 用於監控整合:
| 端點 | 說明 |
|---|---|
GET /api/stats |
取得完整統計資料 |
GET /api/blocked |
取得被攔截的 IP 和 Bot 列表 |
GET /api/activity |
取得最近活動日誌 |
GET /api/health |
健康檢查端點 |
支援 SIGINT(Ctrl+C)和 SIGTERM 訊號,確保正在處理的請求完成後才關閉服務,避免資料遺失。
- Python 3.8+(支援 3.8、3.9、3.10、3.11、3.12)
- 無需任何第三方依賴
方式一:從 PyPI 安裝(推薦)
pip install crawlguard方式二:從原始碼安裝
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .# 克隆儲存庫
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
# 複製設定檔
cp config.example.yaml config.yaml
# 啟動反向代理(預設監聽 8080 連接埠,轉發到 localhost:3000)
python -m crawlguard.cli start --target http://localhost:3000
# 自訂連接埠啟動
python -m crawlguard.cli start --target http://localhost:3000 --port 8080
# 啟動監控面板(預設 8081 連接埠)
python -m crawlguard.cli dashboard
# 執行自檢測試
python -m crawlguard.cli test
# 查看目前狀態
python -m crawlguard.cli status# 啟動代理
crawlguard start --target http://localhost:3000
# 啟動面板
crawlguard dashboard
# 自檢
crawlguard test
# 查看版本
crawlguard versionCrawlGuard 使用 YAML 格式的設定檔,以下是核心設定項說明:
# 伺服器設定
server:
listen_host: "0.0.0.0" # 監聽位址
listen_port: 8080 # 代理監聽連接埠
target_host: "http://localhost:3000" # 上游伺服器位址
workers: 1 # 工作執行緒數
# 監控面板設定
dashboard:
enabled: true # 是否啟用 Web 面板
port: 8081 # 面板連接埠(需與代理連接埠不同)
host: "0.0.0.0" # 面板綁定位址
# 挑戰系統設定
challenges:
js_challenge: true # 啟用 JS 加密挑戰
rate_limit: true # 啟用限流
honeypot: true # 啟用蜜罐偵測
user_agent_check: true # 啟用 UA 黑白名單
captcha_fallback: false # 啟用數學驗證碼兜底
js_difficulty: 3 # JS 謎題難度(1-10)
challenge_timeout: 300 # 挑戰過期時間(秒)
# 限流設定
rate_limit:
enabled: true
requests_per_minute: 60 # 每個 IP 每分鐘最大請求數
requests_per_hour: 1000 # 每個 IP 每小時最大請求數
burst_size: 10 # 令牌桶突發大小
# IP 信譽系統
ip_reputation:
enabled: true
max_score: 100 # 最大信譽分數
block_threshold: 80 # 封禁閾值
decay_interval: 3600 # 分數衰減間隔(秒)
# 安全設定
security:
secret_key: "change-me-to-a-random-secret-string" # 令牌產生密鑰
allowed_methods: # 允許的 HTTP 方法
- GET
- POST
- HEAD
- OPTIONS| 挑戰類型 | 說明 | 適用場景 |
|---|---|---|
| JS 挑戰 | 產生加密謎題,瀏覽器自動求解 | 大部分場景,對使用者無感 |
| 限流 | 令牌桶演算法,超速回傳 429 | 防止突發式抓取 |
| 蜜罐 | 注入隱藏連結,Bot 觸發即封禁 | 精準識別自動化工具 |
| UA 檢查 | User-Agent 黑白名單匹配 | 快速攔截已知 Bot |
| 數學驗證碼 | 簡單算術題,人工可解 | JS 挑戰不可用時的降級方案 |
啟動代理後,存取 http://localhost:8081 即可開啟監控面板。
API 整合範例:
# 取得統計資料
curl http://localhost:8081/api/stats
# 取得被攔截的 IP 和 Bot
curl http://localhost:8081/api/blocked
# 取得最近活動
curl http://localhost:8081/api/activity
# 健康檢查
curl http://localhost:8081/api/health# 白名單 -- 始終放行的搜尋引擎爬蟲
whitelist:
bots:
- Googlebot # Google 搜尋爬蟲
- Bingbot # Bing 搜尋爬蟲
- Baiduspider # 百度搜尋爬蟲
- YandexBot # Yandex 搜尋爬蟲
- DuckDuckBot # DuckDuckGo 搜尋爬蟲
ips: [] # 白名單 IP 位址
user_agents: [] # 額外的白名單 UA 模式
# 黑名單 -- 始終攔截的 AI 爬蟲
blacklist:
bots:
- GPTBot # OpenAI GPT 爬蟲
- ClaudeBot # Anthropic Claude 爬蟲
- Google-Extended # Google AI 訓練爬蟲
- Bytespider # 字節跳動爬蟲
- PerplexityBot # Perplexity AI 爬蟲
ips: [] # 黑名單 IP 位址
user_agents: [] # 額外的黑名單 UA 模式場景一:保護個人部落格
# 部落格運行在 localhost:8080,CrawlGuard 監聽 80 連接埠對外服務
python -m crawlguard.cli start --target http://localhost:8080 --port 80場景二:保護 API 伺服器
# config.yaml
server:
listen_port: 443
target_host: "http://api-backend:3000"
challenges:
js_challenge: false # API 場景關閉 JS 挑戰
rate_limit: true # 保留限流
honeypot: false # API 場景關閉蜜罐
rate_limit:
requests_per_minute: 120
requests_per_hour: 5000場景三:保護靜態網站
# 靜態檔案由 Nginx 提供,CrawlGuard 作為前置代理
python -m crawlguard.cli start --target http://localhost:80 --port 8080 --no-dashboard- 輕量至上 -- 零外部依賴,純 Python 標準庫實作,部署即飛
- 簡單易用 -- 一條指令啟動,YAML 設定驅動,開箱即用
- 多層防護 -- 不依賴單一偵測手段,UA 匹配 + 行為分析 + 挑戰驗證 + 信譽評分四層聯動
- 資料自主 -- 所有資料和日誌留在你的伺服器上,不經過任何第三方
- 可讀性強 -- 程式碼清晰易懂,方便社群審查和貢獻
- 開發效率高 -- 快速迭代,快速回應新的 AI 爬蟲威脅
- 生態豐富 -- 方便未來擴展機器學習偵測等進階功能
- 跨平台 -- Linux、macOS、Windows 均可執行
- WAF 規則引擎 -- 支援 SQL 注入、XSS 等常見 Web 攻擊偵測
- 地理位置封禁 -- 基於 IP 地理資訊的存取控制
- 機器學習偵測 -- 利用請求特徵訓練模型,識別未知 Bot
- Docker 支援 -- 提供官方 Docker 映像檔和 docker-compose 設定
- 外掛系統 -- 支援自訂偵測外掛和回應動作
- WebSocket 支援 -- 保護 WebSocket 連線
- 叢集模式 -- 支援多實例共享狀態(Redis 後端)
- 告警通知 -- Webhook / 郵件 / 釘釘告警整合
我們歡迎以下方向的貢獻:
- 新增 AI Bot 偵測規則
- 改進偵測演算法和策略
- 撰寫文件和教學
- 提交 Bug 回報和修復
- 分享部署經驗和最佳實踐
# 克隆並安裝
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .
# 設定
cp config.example.yaml config.yaml
vim config.yaml # 編輯設定
# 啟動
crawlguard start --target http://your-backend:3000FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -e .
EXPOSE 8080 8081
COPY config.example.yaml config.yaml
CMD ["python", "-m", "crawlguard.cli", "start", "--target", "http://host.docker.internal:3000"]# 建置映像檔
docker build -t crawlguard .
# 執行容器
docker run -d \
--name crawlguard \
-p 8080:8080 \
-p 8081:8081 \
-v $(pwd)/config.yaml:/app/config.yaml \
crawlguard[Unit]
Description=CrawlGuard AI Crawler Protection
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/crawlguard
ExecStart=/usr/bin/python3 -m crawlguard.cli start --config /opt/crawlguard/config.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=multi-user.target# 啟用並啟動服務
sudo systemctl daemon-reload
sudo systemctl enable crawlguard
sudo systemctl start crawlguard
# 查看日誌
sudo journalctl -u crawlguard -fupstream crawlguard {
server 127.0.0.1:8080;
}
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://crawlguard;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# 監控面板(建議限制存取)
server {
listen 8081;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8081;
allow 127.0.0.1;
allow your-office-ip;
deny all;
}
}| 環境 | 支援情況 |
|---|---|
| Linux (Ubuntu/Debian/CentOS) | 完全支援 |
| macOS | 完全支援 |
| Windows | 完全支援 |
| Python 3.8 | 完全支援 |
| Python 3.9 | 完全支援 |
| Python 3.10 | 完全支援 |
| Python 3.11 | 完全支援 |
| Python 3.12 | 完全支援 |
我們非常歡迎社群貢獻!以下是參與貢獻的指南。
- Fork 本儲存庫
- 建立你的特性分支:
git checkout -b feature/your-feature - 提交你的變更:
git commit -m 'feat: add your feature' - 推送到分支:
git push origin feature/your-feature - 提交 Pull Request
- 使用清晰的標題描述問題
- 附上重現步驟和環境資訊
- 如果是 Bug 回報,請附上相關日誌
- 如果是功能建議,請詳細描述使用場景
- 遵循 PEP 8 編碼規範
- 為新增功能撰寫單元測試
- 保持程式碼註解清晰完整
- Commit 訊息建議使用 Conventional Commits 格式
# 克隆儲存庫
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
# 安裝開發依賴
pip install -e ".[dev]"
# 執行測試
pytest
# 執行自檢
python -m crawlguard.cli test
# 驗證設定
python -m crawlguard.cli start --dry-run本專案基於 MIT License 開源。
MIT License
Copyright (c) 2024 CrawlGuard Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
如果 CrawlGuard 對你有幫助,請給專案一個 Star ⭐
Made with ❤️ by CrawlGuard Team
CrawlGuard is a lightweight, zero-dependency AI crawler intelligent protection reverse proxy engine written in pure Python. It sits in front of your web server as an intelligent shield, accurately identifying and blocking AI crawlers and malicious bot traffic while allowing legitimate users through seamlessly.
As AI technology advances at breakneck speed, an increasing number of AI crawlers (such as GPTBot, ClaudeBot, Bytespider, and others) are scraping website content without authorization, leading to:
- Wasted bandwidth -- AI crawlers often generate far more requests than traditional search engine bots
- Uncompensated content use -- Your articles and data are used to train AI models without your knowledge or consent
- Server overload -- Massive automated requests degrade the experience for real users
- SEO interference -- Excessive low-quality crawling can negatively impact how search engines evaluate your site
| Feature | CrawlGuard | Cloudflare Bot Management | robots.txt |
|---|---|---|---|
| Deployment | Self-hosted, full control | Third-party dependency | Honor-based only |
| Dependencies | Zero, pure standard library | Account + payment required | None |
| Protection | Multi-layer detection + JS challenge + honeypot | Cloud-based rules | Declarative only |
| Data privacy | Data stays on your server | Routed through third party | N/A |
| Customization | Fully customizable | Limited | None |
| Cost | Completely free | Expensive paid plans | Free |
CrawlGuard was born out of a real-world problem we encountered in day-to-day operations: the rampant proliferation of AI crawlers. We noticed that the traditional robots.txt protocol is effectively useless against AI crawlers -- many AI companies' bots simply ignore Disallow rules. Existing commercial bot management solutions are either prohibitively expensive or require routing traffic through a third party, raising serious data privacy concerns. We set out to build an open-source, lightweight, self-hosted AI crawler protection tool that empowers every website owner to safeguard their content with ease.
A built-in User-Agent signature database covering major AI crawlers, including but not limited to:
- OpenAI: GPTBot, ChatGPT-User, OAI-SearchBot
- Anthropic: ClaudeBot, Claude-Web, Claude-SearchBot
- Google AI: Google-Extended, GoogleOther
- ByteDance: Bytespider
- Meta: FacebookBot, Meta-ExternalAgent
- Apple AI: Applebot-Extended
- Common Crawl: CCBot
- Perplexity AI: PerplexityBot, Perplexity-User
- Cohere AI: cohere-ai, CohereBot
- Allen AI: ai2bot, AllenAI
- Amazon: Amazonbot, Amazonbot-Extended
- Generic crawlers/tools: curl, wget, Scrapy, python-requests, and more
- SEO crawlers: AhrefsBot, SemrushBot, MJ12bot, and more
- UA blacklist/whitelist -- Regex-based User-Agent matching with whitelist priority
- IP reputation scoring -- Dynamically tracks per-IP behavior, accumulates penalty scores, and auto-blocks when thresholds are exceeded
- Token bucket rate limiting -- Per-minute and per-hour request rate limits to prevent burst scraping
- Behavioral analysis -- Detects path scanning, sequential enumeration, and other anomalous access patterns
- Honeypot detection -- Injects invisible links into pages; only bots trigger them, providing precise identification of automated tools
- Suspicious header detection -- Optionally block requests missing Referer or Accept headers
- Automatic browser solving -- Generates hash-based crypto puzzles that real browsers solve automatically via JavaScript
- Adjustable difficulty -- Difficulty level 1-10, balancing security and user experience
- Session persistence -- Verified browsers are remembered via cookies to avoid repeated challenges
- Math CAPTCHA fallback -- Optional text-based math challenge as a degradation strategy
- Dark theme design -- Modern dark UI with responsive layout
- Live statistics -- Total requests, allowed, blocked, challenged, honeypot triggers, and pass rate
- Leaderboards -- Top blocked IPs and top blocked bot types
- Activity log -- Detailed records of the last 50 requests (time, action, IP, path, UA, reason)
- Auto-refresh -- 10-second auto-refresh with manual refresh option
- Pure Python standard library -- Only uses built-in modules like
http.server,urllib,threading,hashlib, etc. - No third-party packages required -- Ready to run right after
pip install - Minimal resource footprint -- Suitable for deployment on resource-constrained VPS instances
All features are configurable through config.yaml -- adjust detection strategies, rate limiting parameters, blacklists/whitelists, and more without touching a single line of code.
A full REST API is available for monitoring integration:
| Endpoint | Description |
|---|---|
GET /api/stats |
Get comprehensive statistics |
GET /api/blocked |
Get blocked IPs and bot list |
GET /api/activity |
Get recent activity log |
GET /api/health |
Health check endpoint |
Supports SIGINT (Ctrl+C) and SIGTERM signals, ensuring in-flight requests complete before the server shuts down to prevent data loss.
- Python 3.8+ (supports 3.8, 3.9, 3.10, 3.11, 3.12)
- No third-party dependencies
Option 1: Install from PyPI (recommended)
pip install crawlguardOption 2: Install from source
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .# Clone the repository
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
# Copy the configuration file
cp config.example.yaml config.yaml
# Start the reverse proxy (listens on port 8080, forwards to localhost:3000)
python -m crawlguard.cli start --target http://localhost:3000
# Start with a custom port
python -m crawlguard.cli start --target http://localhost:3000 --port 8080
# Start the monitoring dashboard (default port 8081)
python -m crawlguard.cli dashboard
# Run self-tests
python -m crawlguard.cli test
# Check current status
python -m crawlguard.cli status# Start the proxy
crawlguard start --target http://localhost:3000
# Start the dashboard
crawlguard dashboard
# Run self-tests
crawlguard test
# Show version
crawlguard versionCrawlGuard uses a YAML configuration file. Below are the key configuration options:
# Server settings
server:
listen_host: "0.0.0.0" # Address to bind to
listen_port: 8080 # Proxy listen port
target_host: "http://localhost:3000" # Upstream server URL
workers: 1 # Number of worker threads
# Dashboard settings
dashboard:
enabled: true # Enable web dashboard
port: 8081 # Dashboard port (must differ from proxy port)
host: "0.0.0.0" # Dashboard bind address
# Challenge settings
challenges:
js_challenge: true # Enable JavaScript challenge
rate_limit: true # Enable rate limiting
honeypot: true # Enable honeypot detection
user_agent_check: true # Enable UA blacklist/whitelist
captcha_fallback: false # Enable math CAPTCHA fallback
js_difficulty: 3 # JS puzzle difficulty (1-10)
challenge_timeout: 300 # Challenge expiration in seconds
# Rate limiting
rate_limit:
enabled: true
requests_per_minute: 60 # Max requests per minute per IP
requests_per_hour: 1000 # Max requests per hour per IP
burst_size: 10 # Token bucket burst size
# IP reputation system
ip_reputation:
enabled: true
max_score: 100 # Maximum reputation score
block_threshold: 80 # Score threshold to block an IP
decay_interval: 3600 # Score decay interval in seconds
# Security settings
security:
secret_key: "change-me-to-a-random-secret-string" # Secret for token generation
allowed_methods: # Allowed HTTP methods
- GET
- POST
- HEAD
- OPTIONS| Challenge Type | Description | Best For |
|---|---|---|
| JS Challenge | Generates a crypto puzzle solved automatically by browsers | Most scenarios, transparent to users |
| Rate Limiting | Token bucket algorithm, returns 429 when exceeded | Preventing burst scraping |
| Honeypot | Injects hidden links that only bots trigger | Precise identification of automated tools |
| UA Check | User-Agent blacklist/whitelist matching | Fast blocking of known bots |
| Math CAPTCHA | Simple arithmetic problems solvable by humans | Fallback when JS challenges are unavailable |
Once the proxy is running, visit http://localhost:8081 to open the monitoring dashboard.
API integration examples:
# Get statistics
curl http://localhost:8081/api/stats
# Get blocked IPs and bots
curl http://localhost:8081/api/blocked
# Get recent activity
curl http://localhost:8081/api/activity
# Health check
curl http://localhost:8081/api/health# Whitelist -- search engine crawlers that should ALWAYS be allowed
whitelist:
bots:
- Googlebot # Google search crawler
- Bingbot # Bing search crawler
- Baiduspider # Baidu search crawler
- YandexBot # Yandex search crawler
- DuckDuckBot # DuckDuckGo search crawler
ips: [] # Whitelisted IP addresses
user_agents: [] # Additional whitelisted UA patterns
# Blacklist -- AI crawlers that should ALWAYS be blocked
blacklist:
bots:
- GPTBot # OpenAI GPT crawler
- ClaudeBot # Anthropic Claude crawler
- Google-Extended # Google AI training crawler
- Bytespider # ByteDance crawler
- PerplexityBot # Perplexity AI crawler
ips: [] # Blacklisted IP addresses
user_agents: [] # Additional blacklisted UA patternsScenario 1: Protecting a Blog
# Blog runs on localhost:8080, CrawlGuard listens on port 80 for public access
python -m crawlguard.cli start --target http://localhost:8080 --port 80Scenario 2: Protecting an API Server
# config.yaml
server:
listen_port: 443
target_host: "http://api-backend:3000"
challenges:
js_challenge: false # Disable JS challenge for API scenarios
rate_limit: true # Keep rate limiting
honeypot: false # Disable honeypot for API scenarios
rate_limit:
requests_per_minute: 120
requests_per_hour: 5000Scenario 3: Protecting a Static Site
# Static files served by Nginx, CrawlGuard as a front-facing proxy
python -m crawlguard.cli start --target http://localhost:80 --port 8080 --no-dashboard- Lightweight first -- Zero external dependencies, pure Python standard library, deploy and fly
- Simple to use -- Single command to start, YAML-driven configuration, works out of the box
- Multi-layered defense -- No reliance on a single detection method; UA matching + behavioral analysis + challenge verification + reputation scoring work together
- Data sovereignty -- All data and logs stay on your server, never routed through any third party
- High readability -- Clean, understandable code that is easy for the community to audit and contribute to
- Rapid development -- Fast iteration cycles to respond quickly to new AI crawler threats
- Rich ecosystem -- Easy to extend with machine learning detection and other advanced features in the future
- Cross-platform -- Runs on Linux, macOS, and Windows without modification
- WAF rule engine -- Support for SQL injection, XSS, and other common web attack detection
- Geo-blocking -- Access control based on IP geolocation
- ML-based detection -- Train models on request features to identify unknown bots
- Docker support -- Official Docker images and docker-compose configurations
- Plugin system -- Support for custom detection plugins and response actions
- WebSocket support -- Protect WebSocket connections
- Cluster mode -- Multi-instance state sharing (Redis backend)
- Alert notifications -- Webhook / Email / DingTalk alert integration
We welcome contributions in the following areas:
- Adding new AI bot detection rules
- Improving detection algorithms and strategies
- Writing documentation and tutorials
- Submitting bug reports and fixes
- Sharing deployment experiences and best practices
# Clone and install
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .
# Configure
cp config.example.yaml config.yaml
vim config.yaml # Edit configuration
# Start
crawlguard start --target http://your-backend:3000FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -e .
EXPOSE 8080 8081
COPY config.example.yaml config.yaml
CMD ["python", "-m", "crawlguard.cli", "start", "--target", "http://host.docker.internal:3000"]# Build the image
docker build -t crawlguard .
# Run the container
docker run -d \
--name crawlguard \
-p 8080:8080 \
-p 8081:8081 \
-v $(pwd)/config.yaml:/app/config.yaml \
crawlguard[Unit]
Description=CrawlGuard AI Crawler Protection
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/crawlguard
ExecStart=/usr/bin/python3 -m crawlguard.cli start --config /opt/crawlguard/config.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=multi-user.target# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable crawlguard
sudo systemctl start crawlguard
# View logs
sudo journalctl -u crawlguard -fupstream crawlguard {
server 127.0.0.1:8080;
}
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://crawlguard;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# Monitoring dashboard (recommend restricting access)
server {
listen 8081;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8081;
allow 127.0.0.1;
allow your-office-ip;
deny all;
}
}| Environment | Support |
|---|---|
| Linux (Ubuntu/Debian/CentOS) | Fully supported |
| macOS | Fully supported |
| Windows | Fully supported |
| Python 3.8 | Fully supported |
| Python 3.9 | Fully supported |
| Python 3.10 | Fully supported |
| Python 3.11 | Fully supported |
| Python 3.12 | Fully supported |
Community contributions are highly welcome! Here is how you can get involved.
- Fork this repository
- Create your feature branch:
git checkout -b feature/your-feature - Commit your changes:
git commit -m 'feat: add your feature' - Push to the branch:
git push origin feature/your-feature - Submit a Pull Request
- Use a clear, descriptive title
- Include steps to reproduce and environment details
- For bug reports, attach relevant logs
- For feature requests, describe the use case in detail
- Follow PEP 8 coding conventions
- Write unit tests for new features
- Keep code comments clear and comprehensive
- Use Conventional Commits format for commit messages
# Clone the repository
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run self-tests
python -m crawlguard.cli test
# Validate configuration
python -m crawlguard.cli start --dry-runThis project is licensed under the MIT License.
MIT License
Copyright (c) 2024 CrawlGuard Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
If CrawlGuard has been helpful to you, please give the project a Star ⭐
Made with ❤️ by CrawlGuard Team