Skip to content

gitstq/CrawlGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

简体中文 | 繁體中文 | English

CI Python 3.8+ License: MIT

  ██████╗ ██████╗ ██████╗ ███████╗   ██████╗ ███████╗██╗   ██╗███████╗██╗  ██╗████████╗
 ██╔════╝██╔═══██╗██╔══██╗██╔════╝   ██╔══██╗██╔════╝██║   ██║██╔════╝██║  ██║╚══██╔══╝
 ██║     ██║   ██║██║  ██║█████╗     ██████╔╝█████╗  ██║   ██║███████╗███████║   ██║
 ██║     ██║   ██║██║  ██║██╔══╝     ██╔══██╗██╔══╝  ╚██╗ ██╔╝╚════██║██╔══██║   ██║
 ╚██████╗╚██████╔╝██████╔╝███████╗   ██║  ██║███████╗ ╚████╔╝ ███████║██║  ██║   ██║
  ╚═════╝ ╚═════╝ ╚═════╝ ╚══════╝   ╚═╝  ╚═╝╚══════╝  ╚═══╝  ╚══════╝╚═╝  ╚═╝   ╚═╝

CrawlGuard - 轻量级 AI 爬虫智能防护反向代理引擎


🎉 项目介绍

CrawlGuard 是一款纯 Python 编写的轻量级、零外部依赖的 AI 爬虫智能防护反向代理引擎。它部署在你的 Web 服务器前端,充当一道智能屏障,精准识别并拦截各类 AI 爬虫和恶意 Bot 流量,同时让正常用户访问畅通无阻。

解决的痛点

在 AI 技术飞速发展的今天,越来越多的 AI 爬虫(如 GPTBot、ClaudeBot、Bytespider 等)在未经网站所有者授权的情况下大量抓取内容,导致:

  • 带宽资源被无端消耗 -- AI 爬虫的抓取量往往远超传统搜索引擎爬虫
  • 原创内容被无偿利用 -- 你的文章、数据被用于训练 AI 模型,而你一无所知
  • 服务器压力剧增 -- 大量自动化请求影响正常用户的访问体验
  • SEO 排名受到干扰 -- 低质量爬虫的频繁抓取可能影响搜索引擎对你站点的评价

与现有方案的差异化亮点

特性 CrawlGuard Cloudflare Bot Management robots.txt
部署方式 自托管,完全可控 依赖第三方服务 仅靠自律
外部依赖 零依赖,纯标准库 需要账户和付费
防护策略 多层检测 + JS 挑战 + 蜜罐 依赖云端规则 仅声明式
数据隐私 数据不出你的服务器 数据经过第三方 N/A
定制能力 完全可定制 有限
成本 完全免费 高昂的付费套餐 免费

灵感来源

CrawlGuard 的灵感来源于日常运维中遇到的 AI 爬虫泛滥问题。我们发现,传统的 robots.txt 协议对 AI 爬虫形同虚设 -- 许多 AI 公司的爬虫完全无视 robots.txtDisallow 规则。而现有的商业 Bot 管理方案要么价格昂贵,要么需要将流量交给第三方处理,存在数据隐私风险。因此,我们决定打造一款开源、轻量、自托管的 AI 爬虫防护工具,让每一位网站运营者都能轻松保护自己的内容。


✨ 核心特性

🤖 40+ 已知 AI Bot 检测规则

内置覆盖主流 AI 爬虫的 User-Agent 特征库,包括但不限于:

  • OpenAI 系列: GPTBot、ChatGPT-User、OAI-SearchBot
  • Anthropic 系列: ClaudeBot、Claude-Web、Claude-SearchBot
  • Google AI 系列: Google-Extended、GoogleOther
  • 字节跳动: Bytespider
  • Meta 系列: FacebookBot、Meta-ExternalAgent
  • Apple AI: Applebot-Extended
  • Common Crawl: CCBot
  • Perplexity AI: PerplexityBot、Perplexity-User
  • Cohere AI: cohere-ai、CohereBot
  • Allen AI: ai2bot、AllenAI
  • Amazon: Amazonbot、Amazonbot-Extended
  • 通用爬虫/工具: curl、wget、Scrapy、python-requests 等
  • SEO 爬虫: AhrefsBot、SemrushBot、MJ12bot 等

🛡️ 多策略检测引擎

  • UA 黑名单/白名单 -- 基于正则表达式的 User-Agent 精准匹配,白名单优先
  • IP 信誉评分系统 -- 动态追踪每个 IP 的行为,累计惩罚分数,超阈值自动封禁
  • 令牌桶限流 -- 支持按分钟和按小时的请求速率限制,防止突发式抓取
  • 行为分析 -- 检测路径扫描、序号遍历等异常访问模式
  • 蜜罐检测 -- 在页面中注入隐藏链接,只有 Bot 会触发,精准识别自动化工具
  • 可疑请求头检测 -- 可配置拦截缺少 Referer 或 Accept 头的请求

🔐 JS 加密挑战系统

  • 浏览器自动求解 -- 生成基于哈希的加密谜题,真实浏览器可通过 JavaScript 自动完成
  • 可调难度 -- 难度等级 1-10 可配置,平衡安全性与用户体验
  • 会话保持 -- 通过 Cookie 记录已验证的浏览器,避免重复挑战
  • 数学验证码兜底 -- 可选的文字数学验证码作为降级方案

📊 实时 Web 监控面板

  • 暗色主题设计 -- 现代化的深色 UI,支持响应式布局
  • 实时统计 -- 总请求数、放行数、拦截数、挑战数、蜜罐触发数、通过率
  • Top 榜单 -- 被拦截最多的 IP 排行、被拦截最多的 Bot 类型排行
  • 活动日志 -- 最近 50 条请求的详细记录(时间、动作、IP、路径、UA、原因)
  • 自动刷新 -- 10 秒自动刷新,也可手动刷新

📦 零外部依赖

  • 纯 Python 标准库 -- 仅使用 http.serverurllibthreadinghashlib 等内置模块
  • 无需安装任何第三方包 -- pip install 后即可运行
  • 极小的资源占用 -- 适合部署在资源有限的 VPS 上

⚙️ YAML 驱动配置

所有功能均可通过 config.yaml 灵活配置,无需修改代码即可调整检测策略、限流参数、黑白名单等。

🔌 REST API 接口

提供完整的 REST API 用于监控集成:

端点 说明
GET /api/stats 获取完整统计数据
GET /api/blocked 获取被拦截的 IP 和 Bot 列表
GET /api/activity 获取最近活动日志
GET /api/health 健康检查端点

🔄 优雅关停

支持 SIGINT(Ctrl+C)和 SIGTERM 信号,确保正在处理的请求完成后才关闭服务,避免数据丢失。


🚀 快速开始

环境要求

  • Python 3.8+(支持 3.8、3.9、3.10、3.11、3.12)
  • 无需任何第三方依赖

安装方式

方式一:从 PyPI 安装(推荐)

pip install crawlguard

方式二:从源码安装

git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .

快速启动

# 克隆仓库
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard

# 复制配置文件
cp config.example.yaml config.yaml

# 启动反向代理(默认监听 8080 端口,转发到 localhost:3000)
python -m crawlguard.cli start --target http://localhost:3000

# 自定义端口启动
python -m crawlguard.cli start --target http://localhost:3000 --port 8080

# 启动监控面板(默认 8081 端口)
python -m crawlguard.cli dashboard

# 运行自检测试
python -m crawlguard.cli test

# 查看当前状态
python -m crawlguard.cli status

使用 pip 安装后

# 启动代理
crawlguard start --target http://localhost:3000

# 启动面板
crawlguard dashboard

# 自检
crawlguard test

# 查看版本
crawlguard version

📖 详细使用指南

配置文件说明

CrawlGuard 使用 YAML 格式的配置文件,以下是核心配置项说明:

# 服务器设置
server:
  listen_host: "0.0.0.0"              # 监听地址
  listen_port: 8080                    # 代理监听端口
  target_host: "http://localhost:3000" # 上游服务器地址
  workers: 1                           # 工作线程数

# 监控面板设置
dashboard:
  enabled: true                        # 是否启用 Web 面板
  port: 8081                           # 面板端口(需与代理端口不同)
  host: "0.0.0.0"                      # 面板绑定地址

# 挑战系统设置
challenges:
  js_challenge: true                   # 启用 JS 加密挑战
  rate_limit: true                     # 启用限流
  honeypot: true                       # 启用蜜罐检测
  user_agent_check: true               # 启用 UA 黑白名单
  captcha_fallback: false              # 启用数学验证码兜底
  js_difficulty: 3                     # JS 谜题难度(1-10)
  challenge_timeout: 300               # 挑战过期时间(秒)

# 限流设置
rate_limit:
  enabled: true
  requests_per_minute: 60              # 每个 IP 每分钟最大请求数
  requests_per_hour: 1000              # 每个 IP 每小时最大请求数
  burst_size: 10                       # 令牌桶突发大小

# IP 信誉系统
ip_reputation:
  enabled: true
  max_score: 100                       # 最大信誉分数
  block_threshold: 80                  # 封禁阈值
  decay_interval: 3600                 # 分数衰减间隔(秒)

# 安全设置
security:
  secret_key: "change-me-to-a-random-secret-string"  # 令牌生成密钥
  allowed_methods:                     # 允许的 HTTP 方法
    - GET
    - POST
    - HEAD
    - OPTIONS

挑战类型说明

挑战类型 说明 适用场景
JS 挑战 生成加密谜题,浏览器自动求解 大部分场景,对用户无感知
限流 令牌桶算法,超速返回 429 防止突发式抓取
蜜罐 注入隐藏链接,Bot 触发即封禁 精准识别自动化工具
UA 检查 User-Agent 黑白名单匹配 快速拦截已知 Bot
数学验证码 简单算术题,人工可解 JS 挑战不可用时的降级方案

监控面板与 API

启动代理后,访问 http://localhost:8081 即可打开监控面板。

API 集成示例:

# 获取统计数据
curl http://localhost:8081/api/stats

# 获取被拦截的 IP 和 Bot
curl http://localhost:8081/api/blocked

# 获取最近活动
curl http://localhost:8081/api/activity

# 健康检查
curl http://localhost:8081/api/health

黑白名单配置

# 白名单 -- 始终放行的搜索引擎爬虫
whitelist:
  bots:
    - Googlebot       # Google 搜索爬虫
    - Bingbot         # Bing 搜索爬虫
    - Baiduspider     # 百度搜索爬虫
    - YandexBot       # Yandex 搜索爬虫
    - DuckDuckBot     # DuckDuckGo 搜索爬虫
  ips: []             # 白名单 IP 地址
  user_agents: []     # 额外的白名单 UA 模式

# 黑名单 -- 始终拦截的 AI 爬虫
blacklist:
  bots:
    - GPTBot          # OpenAI GPT 爬虫
    - ClaudeBot       # Anthropic Claude 爬虫
    - Google-Extended # Google AI 训练爬虫
    - Bytespider      # 字节跳动爬虫
    - PerplexityBot   # Perplexity AI 爬虫
  ips: []             # 黑名单 IP 地址
  user_agents: []     # 额外的黑名单 UA 模式

典型部署场景

场景一:保护个人博客

# 博客运行在 localhost:80,CrawlGuard 监听 80 端口对外服务
python -m crawlguard.cli start --target http://localhost:8080 --port 80

场景二:保护 API 服务器

# config.yaml
server:
  listen_port: 443
  target_host: "http://api-backend:3000"
challenges:
  js_challenge: false    # API 场景关闭 JS 挑战
  rate_limit: true       # 保留限流
  honeypot: false        # API 场景关闭蜜罐
rate_limit:
  requests_per_minute: 120
  requests_per_hour: 5000

场景三:保护静态网站

# 静态文件由 Nginx 提供,CrawlGuard 作为前置代理
python -m crawlguard.cli start --target http://localhost:80 --port 8080 --no-dashboard

💡 设计思路与迭代规划

设计哲学

  • 轻量至上 -- 零外部依赖,纯 Python 标准库实现,部署即飞
  • 简单易用 -- 一条命令启动,YAML 配置驱动,开箱即用
  • 多层防护 -- 不依赖单一检测手段,UA 匹配 + 行为分析 + 挑战验证 + 信誉评分四层联动
  • 数据自主 -- 所有数据和日志留在你的服务器上,不经过任何第三方

为什么选择 Python

  • 可读性强 -- 代码清晰易懂,方便社区审查和贡献
  • 开发效率高 -- 快速迭代,快速响应新的 AI 爬虫威胁
  • 生态丰富 -- 方便未来扩展机器学习检测等高级功能
  • 跨平台 -- Linux、macOS、Windows 均可运行

迭代规划

  • WAF 规则引擎 -- 支持 SQL 注入、XSS 等常见 Web 攻击检测
  • 地理位置封禁 -- 基于 IP 地理信息的访问控制
  • 机器学习检测 -- 利用请求特征训练模型,识别未知 Bot
  • Docker 支持 -- 提供官方 Docker 镜像和 docker-compose 配置
  • 插件系统 -- 支持自定义检测插件和响应动作
  • WebSocket 支持 -- 保护 WebSocket 连接
  • 集群模式 -- 支持多实例共享状态(Redis 后端)
  • 告警通知 -- Webhook / 邮件 / 钉钉告警集成

社区贡献方向

我们欢迎以下方向的贡献:

  • 新增 AI Bot 检测规则
  • 改进检测算法和策略
  • 编写文档和教程
  • 提交 Bug 报告和修复
  • 分享部署经验和最佳实践

📦 打包与部署指南

直接部署

# 克隆并安装
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .

# 配置
cp config.example.yaml config.yaml
vim config.yaml  # 编辑配置

# 启动
crawlguard start --target http://your-backend:3000

Docker 部署

FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install --no-cache-dir -e .

EXPOSE 8080 8081

COPY config.example.yaml config.yaml

CMD ["python", "-m", "crawlguard.cli", "start", "--target", "http://host.docker.internal:3000"]
# 构建镜像
docker build -t crawlguard .

# 运行容器
docker run -d \
  --name crawlguard \
  -p 8080:8080 \
  -p 8081:8081 \
  -v $(pwd)/config.yaml:/app/config.yaml \
  crawlguard

systemd 服务配置

[Unit]
Description=CrawlGuard AI Crawler Protection
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/crawlguard
ExecStart=/usr/bin/python3 -m crawlguard.cli start --config /opt/crawlguard/config.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target
# 启用并启动服务
sudo systemctl daemon-reload
sudo systemctl enable crawlguard
sudo systemctl start crawlguard

# 查看日志
sudo journalctl -u crawlguard -f

Nginx 反向代理配置

upstream crawlguard {
    server 127.0.0.1:8080;
}

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://crawlguard;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# 监控面板(建议限制访问)
server {
    listen 8081;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8081;
        allow 127.0.0.1;
        allow your-office-ip;
        deny all;
    }
}

兼容环境

环境 支持情况
Linux (Ubuntu/Debian/CentOS) 完全支持
macOS 完全支持
Windows 完全支持
Python 3.8 完全支持
Python 3.9 完全支持
Python 3.10 完全支持
Python 3.11 完全支持
Python 3.12 完全支持

🤝 贡献指南

我们非常欢迎社区贡献!以下是参与贡献的指南。

提交 Pull Request

  1. Fork 本仓库
  2. 创建你的特性分支:git checkout -b feature/your-feature
  3. 提交你的改动:git commit -m 'feat: add your feature'
  4. 推送到分支:git push origin feature/your-feature
  5. 提交 Pull Request

提交 Issue

  • 使用清晰的标题描述问题
  • 附上复现步骤和环境信息
  • 如果是 Bug 报告,请附上相关日志
  • 如果是功能建议,请详细描述使用场景

代码规范

  • 遵循 PEP 8 编码规范
  • 为新增功能编写单元测试
  • 保持代码注释清晰完整
  • Commit 信息建议使用 Conventional Commits 格式

开发环境搭建

# 克隆仓库
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard

# 安装开发依赖
pip install -e ".[dev]"

# 运行测试
pytest

# 运行自检
python -m crawlguard.cli test

# 验证配置
python -m crawlguard.cli start --dry-run

📄 开源协议

本项目基于 MIT License 开源。

MIT License

Copyright (c) 2024 CrawlGuard Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

如果 CrawlGuard 对你有帮助,请给项目一个 Star ⭐

Made with ❤️ by CrawlGuard Team


繁體中文


🎉 專案介紹

CrawlGuard 是一款以純 Python 撰寫的輕量級、零外部依賴的 AI 爬蟲智慧防護反向代理引擎。它部署在你的 Web 伺服器前端,作為一道智慧屏障,精準識別並攔截各類 AI 爬蟲與惡意 Bot 流量,同時讓正常使用者訪問暢通無阻。

解決的痛點

在 AI 技術飛速發展的今天,越來越多的 AI 爬蟲(如 GPTBot、ClaudeBot、Bytespider 等)在未經網站所有者授權的情況下大量抓取內容,導致:

  • 頻寬資源被無端消耗 -- AI 爬蟲的抓取量往往遠超傳統搜尋引擎爬蟲
  • 原創內容被無償利用 -- 你的文章、資料被用於訓練 AI 模型,而你一無所知
  • 伺服器壓力劇增 -- 大量自動化請求影響正常使用者的訪問體驗
  • SEO 排名受到干擾 -- 低品質爬蟲的頻繁抓取可能影響搜尋引擎對你站點的評價

與現有方案的差異化亮點

特性 CrawlGuard Cloudflare Bot Management robots.txt
部署方式 自託管,完全可控 依賴第三方服務 僅靠自律
外部依賴 零依賴,純標準庫 需要帳戶和付費
防護策略 多層偵測 + JS 挑戰 + 蜜罐 依賴雲端規則 僅宣告式
資料隱私 資料不出你的伺服器 資料經過第三方 N/A
客製能力 完全可客製 有限
成本 完全免費 昂貴的付費方案 免費

靈感來源

CrawlGuard 的靈感來源於日常維運中遇到的 AI 爬蟲氾濫問題。我們發現,傳統的 robots.txt 協議對 AI 爬蟲形同虛設 -- 許多 AI 公司的爬蟲完全無視 robots.txtDisallow 規則。而現有的商業 Bot 管理方案要麼價格昂貴,要麼需要將流量交給第三方處理,存在資料隱私風險。因此,我們決定打造一款開源、輕量、自託管的 AI 爬蟲防護工具,讓每一位網站經營者都能輕鬆保護自己的內容。


✨ 核心特性

🤖 40+ 已知 AI Bot 偵測規則

內建覆蓋主流 AI 爬蟲的 User-Agent 特徵庫,包括但不限於:

  • OpenAI 系列: GPTBot、ChatGPT-User、OAI-SearchBot
  • Anthropic 系列: ClaudeBot、Claude-Web、Claude-SearchBot
  • Google AI 系列: Google-Extended、GoogleOther
  • 字節跳動: Bytespider
  • Meta 系列: FacebookBot、Meta-ExternalAgent
  • Apple AI: Applebot-Extended
  • Common Crawl: CCBot
  • Perplexity AI: PerplexityBot、Perplexity-User
  • Cohere AI: cohere-ai、CohereBot
  • Allen AI: ai2bot、AllenAI
  • Amazon: Amazonbot、Amazonbot-Extended
  • 通用爬蟲/工具: curl、wget、Scrapy、python-requests 等
  • SEO 爬蟲: AhrefsBot、SemrushBot、MJ12bot 等

🛡️ 多策略偵測引擎

  • UA 黑名單/白名單 -- 基於正規表達式的 User-Agent 精準匹配,白名單優先
  • IP 信譽評分系統 -- 動態追蹤每個 IP 的行為,累計懲罰分數,超閾值自動封禁
  • 令牌桶限流 -- 支援按分鐘和按小時的請求速率限制,防止突發式抓取
  • 行為分析 -- 偵測路徑掃描、序號遍歷等異常存取模式
  • 蜜罐偵測 -- 在頁面中注入隱藏連結,只有 Bot 會觸發,精準識別自動化工具
  • 可疑請求頭偵測 -- 可設定攔截缺少 Referer 或 Accept 頭的請求

🔐 JS 加密挑戰系統

  • 瀏覽器自動求解 -- 產生基於雜湊的加密謎題,真實瀏覽器可透過 JavaScript 自動完成
  • 可調難度 -- 難度等級 1-10 可設定,平衡安全性與使用者體驗
  • 工作階段保持 -- 透過 Cookie 記錄已驗證的瀏覽器,避免重複挑戰
  • 數學驗證碼兜底 -- 可選的文字數學驗證碼作為降級方案

📊 即時 Web 監控面板

  • 暗色主題設計 -- 現代化的深色 UI,支援響應式佈局
  • 即時統計 -- 總請求數、放行數、攔截數、挑戰數、蜜罐觸發數、通過率
  • Top 排行 -- 被攔截最多的 IP 排行、被攔截最多的 Bot 類型排行
  • 活動日誌 -- 最近 50 筆請求的詳細記錄(時間、動作、IP、路徑、UA、原因)
  • 自動重新整理 -- 10 秒自動重新整理,也可手動重新整理

📦 零外部依賴

  • 純 Python 標準庫 -- 僅使用 http.serverurllibthreadinghashlib 等內建模組
  • 無需安裝任何第三方套件 -- pip install 後即可執行
  • 極小的資源佔用 -- 適合部署在資源有限的 VPS 上

⚙️ YAML 驅動設定

所有功能均可透過 config.yaml 靈活設定,無需修改程式碼即可調整偵測策略、限流參數、黑白名單等。

🔌 REST API 介面

提供完整的 REST API 用於監控整合:

端點 說明
GET /api/stats 取得完整統計資料
GET /api/blocked 取得被攔截的 IP 和 Bot 列表
GET /api/activity 取得最近活動日誌
GET /api/health 健康檢查端點

🔄 優雅關機

支援 SIGINT(Ctrl+C)和 SIGTERM 訊號,確保正在處理的請求完成後才關閉服務,避免資料遺失。


🚀 快速開始

環境需求

  • Python 3.8+(支援 3.8、3.9、3.10、3.11、3.12)
  • 無需任何第三方依賴

安裝方式

方式一:從 PyPI 安裝(推薦)

pip install crawlguard

方式二:從原始碼安裝

git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .

快速啟動

# 克隆儲存庫
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard

# 複製設定檔
cp config.example.yaml config.yaml

# 啟動反向代理(預設監聽 8080 連接埠,轉發到 localhost:3000)
python -m crawlguard.cli start --target http://localhost:3000

# 自訂連接埠啟動
python -m crawlguard.cli start --target http://localhost:3000 --port 8080

# 啟動監控面板(預設 8081 連接埠)
python -m crawlguard.cli dashboard

# 執行自檢測試
python -m crawlguard.cli test

# 查看目前狀態
python -m crawlguard.cli status

使用 pip 安裝後

# 啟動代理
crawlguard start --target http://localhost:3000

# 啟動面板
crawlguard dashboard

# 自檢
crawlguard test

# 查看版本
crawlguard version

📖 詳細使用指南

設定檔說明

CrawlGuard 使用 YAML 格式的設定檔,以下是核心設定項說明:

# 伺服器設定
server:
  listen_host: "0.0.0.0"              # 監聽位址
  listen_port: 8080                    # 代理監聽連接埠
  target_host: "http://localhost:3000" # 上游伺服器位址
  workers: 1                           # 工作執行緒數

# 監控面板設定
dashboard:
  enabled: true                        # 是否啟用 Web 面板
  port: 8081                           # 面板連接埠(需與代理連接埠不同)
  host: "0.0.0.0"                      # 面板綁定位址

# 挑戰系統設定
challenges:
  js_challenge: true                   # 啟用 JS 加密挑戰
  rate_limit: true                     # 啟用限流
  honeypot: true                       # 啟用蜜罐偵測
  user_agent_check: true               # 啟用 UA 黑白名單
  captcha_fallback: false              # 啟用數學驗證碼兜底
  js_difficulty: 3                     # JS 謎題難度(1-10)
  challenge_timeout: 300               # 挑戰過期時間(秒)

# 限流設定
rate_limit:
  enabled: true
  requests_per_minute: 60              # 每個 IP 每分鐘最大請求數
  requests_per_hour: 1000              # 每個 IP 每小時最大請求數
  burst_size: 10                       # 令牌桶突發大小

# IP 信譽系統
ip_reputation:
  enabled: true
  max_score: 100                       # 最大信譽分數
  block_threshold: 80                  # 封禁閾值
  decay_interval: 3600                 # 分數衰減間隔(秒)

# 安全設定
security:
  secret_key: "change-me-to-a-random-secret-string"  # 令牌產生密鑰
  allowed_methods:                     # 允許的 HTTP 方法
    - GET
    - POST
    - HEAD
    - OPTIONS

挑戰類型說明

挑戰類型 說明 適用場景
JS 挑戰 產生加密謎題,瀏覽器自動求解 大部分場景,對使用者無感
限流 令牌桶演算法,超速回傳 429 防止突發式抓取
蜜罐 注入隱藏連結,Bot 觸發即封禁 精準識別自動化工具
UA 檢查 User-Agent 黑白名單匹配 快速攔截已知 Bot
數學驗證碼 簡單算術題,人工可解 JS 挑戰不可用時的降級方案

監控面板與 API

啟動代理後,存取 http://localhost:8081 即可開啟監控面板。

API 整合範例:

# 取得統計資料
curl http://localhost:8081/api/stats

# 取得被攔截的 IP 和 Bot
curl http://localhost:8081/api/blocked

# 取得最近活動
curl http://localhost:8081/api/activity

# 健康檢查
curl http://localhost:8081/api/health

黑白名單設定

# 白名單 -- 始終放行的搜尋引擎爬蟲
whitelist:
  bots:
    - Googlebot       # Google 搜尋爬蟲
    - Bingbot         # Bing 搜尋爬蟲
    - Baiduspider     # 百度搜尋爬蟲
    - YandexBot       # Yandex 搜尋爬蟲
    - DuckDuckBot     # DuckDuckGo 搜尋爬蟲
  ips: []             # 白名單 IP 位址
  user_agents: []     # 額外的白名單 UA 模式

# 黑名單 -- 始終攔截的 AI 爬蟲
blacklist:
  bots:
    - GPTBot          # OpenAI GPT 爬蟲
    - ClaudeBot       # Anthropic Claude 爬蟲
    - Google-Extended # Google AI 訓練爬蟲
    - Bytespider      # 字節跳動爬蟲
    - PerplexityBot   # Perplexity AI 爬蟲
  ips: []             # 黑名單 IP 位址
  user_agents: []     # 額外的黑名單 UA 模式

典型部署場景

場景一:保護個人部落格

# 部落格運行在 localhost:8080,CrawlGuard 監聽 80 連接埠對外服務
python -m crawlguard.cli start --target http://localhost:8080 --port 80

場景二:保護 API 伺服器

# config.yaml
server:
  listen_port: 443
  target_host: "http://api-backend:3000"
challenges:
  js_challenge: false    # API 場景關閉 JS 挑戰
  rate_limit: true       # 保留限流
  honeypot: false        # API 場景關閉蜜罐
rate_limit:
  requests_per_minute: 120
  requests_per_hour: 5000

場景三:保護靜態網站

# 靜態檔案由 Nginx 提供,CrawlGuard 作為前置代理
python -m crawlguard.cli start --target http://localhost:80 --port 8080 --no-dashboard

💡 設計思路與迭代規劃

設計哲學

  • 輕量至上 -- 零外部依賴,純 Python 標準庫實作,部署即飛
  • 簡單易用 -- 一條指令啟動,YAML 設定驅動,開箱即用
  • 多層防護 -- 不依賴單一偵測手段,UA 匹配 + 行為分析 + 挑戰驗證 + 信譽評分四層聯動
  • 資料自主 -- 所有資料和日誌留在你的伺服器上,不經過任何第三方

為什麼選擇 Python

  • 可讀性強 -- 程式碼清晰易懂,方便社群審查和貢獻
  • 開發效率高 -- 快速迭代,快速回應新的 AI 爬蟲威脅
  • 生態豐富 -- 方便未來擴展機器學習偵測等進階功能
  • 跨平台 -- Linux、macOS、Windows 均可執行

迭代規劃

  • WAF 規則引擎 -- 支援 SQL 注入、XSS 等常見 Web 攻擊偵測
  • 地理位置封禁 -- 基於 IP 地理資訊的存取控制
  • 機器學習偵測 -- 利用請求特徵訓練模型,識別未知 Bot
  • Docker 支援 -- 提供官方 Docker 映像檔和 docker-compose 設定
  • 外掛系統 -- 支援自訂偵測外掛和回應動作
  • WebSocket 支援 -- 保護 WebSocket 連線
  • 叢集模式 -- 支援多實例共享狀態(Redis 後端)
  • 告警通知 -- Webhook / 郵件 / 釘釘告警整合

社群貢獻方向

我們歡迎以下方向的貢獻:

  • 新增 AI Bot 偵測規則
  • 改進偵測演算法和策略
  • 撰寫文件和教學
  • 提交 Bug 回報和修復
  • 分享部署經驗和最佳實踐

📦 打包與部署指南

直接部署

# 克隆並安裝
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .

# 設定
cp config.example.yaml config.yaml
vim config.yaml  # 編輯設定

# 啟動
crawlguard start --target http://your-backend:3000

Docker 部署

FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install --no-cache-dir -e .

EXPOSE 8080 8081

COPY config.example.yaml config.yaml

CMD ["python", "-m", "crawlguard.cli", "start", "--target", "http://host.docker.internal:3000"]
# 建置映像檔
docker build -t crawlguard .

# 執行容器
docker run -d \
  --name crawlguard \
  -p 8080:8080 \
  -p 8081:8081 \
  -v $(pwd)/config.yaml:/app/config.yaml \
  crawlguard

systemd 服務設定

[Unit]
Description=CrawlGuard AI Crawler Protection
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/crawlguard
ExecStart=/usr/bin/python3 -m crawlguard.cli start --config /opt/crawlguard/config.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target
# 啟用並啟動服務
sudo systemctl daemon-reload
sudo systemctl enable crawlguard
sudo systemctl start crawlguard

# 查看日誌
sudo journalctl -u crawlguard -f

Nginx 反向代理設定

upstream crawlguard {
    server 127.0.0.1:8080;
}

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://crawlguard;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# 監控面板(建議限制存取)
server {
    listen 8081;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8081;
        allow 127.0.0.1;
        allow your-office-ip;
        deny all;
    }
}

相容環境

環境 支援情況
Linux (Ubuntu/Debian/CentOS) 完全支援
macOS 完全支援
Windows 完全支援
Python 3.8 完全支援
Python 3.9 完全支援
Python 3.10 完全支援
Python 3.11 完全支援
Python 3.12 完全支援

🤝 貢獻指南

我們非常歡迎社群貢獻!以下是參與貢獻的指南。

提交 Pull Request

  1. Fork 本儲存庫
  2. 建立你的特性分支:git checkout -b feature/your-feature
  3. 提交你的變更:git commit -m 'feat: add your feature'
  4. 推送到分支:git push origin feature/your-feature
  5. 提交 Pull Request

提交 Issue

  • 使用清晰的標題描述問題
  • 附上重現步驟和環境資訊
  • 如果是 Bug 回報,請附上相關日誌
  • 如果是功能建議,請詳細描述使用場景

程式碼規範

  • 遵循 PEP 8 編碼規範
  • 為新增功能撰寫單元測試
  • 保持程式碼註解清晰完整
  • Commit 訊息建議使用 Conventional Commits 格式

開發環境建置

# 克隆儲存庫
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard

# 安裝開發依賴
pip install -e ".[dev]"

# 執行測試
pytest

# 執行自檢
python -m crawlguard.cli test

# 驗證設定
python -m crawlguard.cli start --dry-run

📄 開源協議

本專案基於 MIT License 開源。

MIT License

Copyright (c) 2024 CrawlGuard Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

如果 CrawlGuard 對你有幫助,請給專案一個 Star ⭐

Made with ❤️ by CrawlGuard Team


English


🎉 Project Introduction

CrawlGuard is a lightweight, zero-dependency AI crawler intelligent protection reverse proxy engine written in pure Python. It sits in front of your web server as an intelligent shield, accurately identifying and blocking AI crawlers and malicious bot traffic while allowing legitimate users through seamlessly.

Pain Points It Solves

As AI technology advances at breakneck speed, an increasing number of AI crawlers (such as GPTBot, ClaudeBot, Bytespider, and others) are scraping website content without authorization, leading to:

  • Wasted bandwidth -- AI crawlers often generate far more requests than traditional search engine bots
  • Uncompensated content use -- Your articles and data are used to train AI models without your knowledge or consent
  • Server overload -- Massive automated requests degrade the experience for real users
  • SEO interference -- Excessive low-quality crawling can negatively impact how search engines evaluate your site

What Sets CrawlGuard Apart

Feature CrawlGuard Cloudflare Bot Management robots.txt
Deployment Self-hosted, full control Third-party dependency Honor-based only
Dependencies Zero, pure standard library Account + payment required None
Protection Multi-layer detection + JS challenge + honeypot Cloud-based rules Declarative only
Data privacy Data stays on your server Routed through third party N/A
Customization Fully customizable Limited None
Cost Completely free Expensive paid plans Free

Inspiration

CrawlGuard was born out of a real-world problem we encountered in day-to-day operations: the rampant proliferation of AI crawlers. We noticed that the traditional robots.txt protocol is effectively useless against AI crawlers -- many AI companies' bots simply ignore Disallow rules. Existing commercial bot management solutions are either prohibitively expensive or require routing traffic through a third party, raising serious data privacy concerns. We set out to build an open-source, lightweight, self-hosted AI crawler protection tool that empowers every website owner to safeguard their content with ease.


✨ Core Features

🤖 40+ Known AI Bot Detection Rules

A built-in User-Agent signature database covering major AI crawlers, including but not limited to:

  • OpenAI: GPTBot, ChatGPT-User, OAI-SearchBot
  • Anthropic: ClaudeBot, Claude-Web, Claude-SearchBot
  • Google AI: Google-Extended, GoogleOther
  • ByteDance: Bytespider
  • Meta: FacebookBot, Meta-ExternalAgent
  • Apple AI: Applebot-Extended
  • Common Crawl: CCBot
  • Perplexity AI: PerplexityBot, Perplexity-User
  • Cohere AI: cohere-ai, CohereBot
  • Allen AI: ai2bot, AllenAI
  • Amazon: Amazonbot, Amazonbot-Extended
  • Generic crawlers/tools: curl, wget, Scrapy, python-requests, and more
  • SEO crawlers: AhrefsBot, SemrushBot, MJ12bot, and more

🛡️ Multi-Strategy Detection Engine

  • UA blacklist/whitelist -- Regex-based User-Agent matching with whitelist priority
  • IP reputation scoring -- Dynamically tracks per-IP behavior, accumulates penalty scores, and auto-blocks when thresholds are exceeded
  • Token bucket rate limiting -- Per-minute and per-hour request rate limits to prevent burst scraping
  • Behavioral analysis -- Detects path scanning, sequential enumeration, and other anomalous access patterns
  • Honeypot detection -- Injects invisible links into pages; only bots trigger them, providing precise identification of automated tools
  • Suspicious header detection -- Optionally block requests missing Referer or Accept headers

🔐 JS Crypto Challenge System

  • Automatic browser solving -- Generates hash-based crypto puzzles that real browsers solve automatically via JavaScript
  • Adjustable difficulty -- Difficulty level 1-10, balancing security and user experience
  • Session persistence -- Verified browsers are remembered via cookies to avoid repeated challenges
  • Math CAPTCHA fallback -- Optional text-based math challenge as a degradation strategy

📊 Real-Time Web Dashboard

  • Dark theme design -- Modern dark UI with responsive layout
  • Live statistics -- Total requests, allowed, blocked, challenged, honeypot triggers, and pass rate
  • Leaderboards -- Top blocked IPs and top blocked bot types
  • Activity log -- Detailed records of the last 50 requests (time, action, IP, path, UA, reason)
  • Auto-refresh -- 10-second auto-refresh with manual refresh option

📦 Zero External Dependencies

  • Pure Python standard library -- Only uses built-in modules like http.server, urllib, threading, hashlib, etc.
  • No third-party packages required -- Ready to run right after pip install
  • Minimal resource footprint -- Suitable for deployment on resource-constrained VPS instances

⚙️ YAML-Driven Configuration

All features are configurable through config.yaml -- adjust detection strategies, rate limiting parameters, blacklists/whitelists, and more without touching a single line of code.

🔌 REST API Endpoints

A full REST API is available for monitoring integration:

Endpoint Description
GET /api/stats Get comprehensive statistics
GET /api/blocked Get blocked IPs and bot list
GET /api/activity Get recent activity log
GET /api/health Health check endpoint

🔄 Graceful Shutdown

Supports SIGINT (Ctrl+C) and SIGTERM signals, ensuring in-flight requests complete before the server shuts down to prevent data loss.


🚀 Quick Start

Requirements

  • Python 3.8+ (supports 3.8, 3.9, 3.10, 3.11, 3.12)
  • No third-party dependencies

Installation

Option 1: Install from PyPI (recommended)

pip install crawlguard

Option 2: Install from source

git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .

Get Started

# Clone the repository
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard

# Copy the configuration file
cp config.example.yaml config.yaml

# Start the reverse proxy (listens on port 8080, forwards to localhost:3000)
python -m crawlguard.cli start --target http://localhost:3000

# Start with a custom port
python -m crawlguard.cli start --target http://localhost:3000 --port 8080

# Start the monitoring dashboard (default port 8081)
python -m crawlguard.cli dashboard

# Run self-tests
python -m crawlguard.cli test

# Check current status
python -m crawlguard.cli status

After pip install

# Start the proxy
crawlguard start --target http://localhost:3000

# Start the dashboard
crawlguard dashboard

# Run self-tests
crawlguard test

# Show version
crawlguard version

📖 Detailed Usage Guide

Configuration File Reference

CrawlGuard uses a YAML configuration file. Below are the key configuration options:

# Server settings
server:
  listen_host: "0.0.0.0"              # Address to bind to
  listen_port: 8080                    # Proxy listen port
  target_host: "http://localhost:3000" # Upstream server URL
  workers: 1                           # Number of worker threads

# Dashboard settings
dashboard:
  enabled: true                        # Enable web dashboard
  port: 8081                           # Dashboard port (must differ from proxy port)
  host: "0.0.0.0"                      # Dashboard bind address

# Challenge settings
challenges:
  js_challenge: true                   # Enable JavaScript challenge
  rate_limit: true                     # Enable rate limiting
  honeypot: true                       # Enable honeypot detection
  user_agent_check: true               # Enable UA blacklist/whitelist
  captcha_fallback: false              # Enable math CAPTCHA fallback
  js_difficulty: 3                     # JS puzzle difficulty (1-10)
  challenge_timeout: 300               # Challenge expiration in seconds

# Rate limiting
rate_limit:
  enabled: true
  requests_per_minute: 60              # Max requests per minute per IP
  requests_per_hour: 1000              # Max requests per hour per IP
  burst_size: 10                       # Token bucket burst size

# IP reputation system
ip_reputation:
  enabled: true
  max_score: 100                       # Maximum reputation score
  block_threshold: 80                  # Score threshold to block an IP
  decay_interval: 3600                 # Score decay interval in seconds

# Security settings
security:
  secret_key: "change-me-to-a-random-secret-string"  # Secret for token generation
  allowed_methods:                     # Allowed HTTP methods
    - GET
    - POST
    - HEAD
    - OPTIONS

Challenge Types

Challenge Type Description Best For
JS Challenge Generates a crypto puzzle solved automatically by browsers Most scenarios, transparent to users
Rate Limiting Token bucket algorithm, returns 429 when exceeded Preventing burst scraping
Honeypot Injects hidden links that only bots trigger Precise identification of automated tools
UA Check User-Agent blacklist/whitelist matching Fast blocking of known bots
Math CAPTCHA Simple arithmetic problems solvable by humans Fallback when JS challenges are unavailable

Dashboard and API

Once the proxy is running, visit http://localhost:8081 to open the monitoring dashboard.

API integration examples:

# Get statistics
curl http://localhost:8081/api/stats

# Get blocked IPs and bots
curl http://localhost:8081/api/blocked

# Get recent activity
curl http://localhost:8081/api/activity

# Health check
curl http://localhost:8081/api/health

Whitelist/Blacklist Configuration

# Whitelist -- search engine crawlers that should ALWAYS be allowed
whitelist:
  bots:
    - Googlebot       # Google search crawler
    - Bingbot         # Bing search crawler
    - Baiduspider     # Baidu search crawler
    - YandexBot       # Yandex search crawler
    - DuckDuckBot     # DuckDuckGo search crawler
  ips: []             # Whitelisted IP addresses
  user_agents: []     # Additional whitelisted UA patterns

# Blacklist -- AI crawlers that should ALWAYS be blocked
blacklist:
  bots:
    - GPTBot          # OpenAI GPT crawler
    - ClaudeBot       # Anthropic Claude crawler
    - Google-Extended # Google AI training crawler
    - Bytespider      # ByteDance crawler
    - PerplexityBot   # Perplexity AI crawler
  ips: []             # Blacklisted IP addresses
  user_agents: []     # Additional blacklisted UA patterns

Typical Deployment Scenarios

Scenario 1: Protecting a Blog

# Blog runs on localhost:8080, CrawlGuard listens on port 80 for public access
python -m crawlguard.cli start --target http://localhost:8080 --port 80

Scenario 2: Protecting an API Server

# config.yaml
server:
  listen_port: 443
  target_host: "http://api-backend:3000"
challenges:
  js_challenge: false    # Disable JS challenge for API scenarios
  rate_limit: true       # Keep rate limiting
  honeypot: false        # Disable honeypot for API scenarios
rate_limit:
  requests_per_minute: 120
  requests_per_hour: 5000

Scenario 3: Protecting a Static Site

# Static files served by Nginx, CrawlGuard as a front-facing proxy
python -m crawlguard.cli start --target http://localhost:80 --port 8080 --no-dashboard

💡 Design Philosophy & Roadmap

Design Philosophy

  • Lightweight first -- Zero external dependencies, pure Python standard library, deploy and fly
  • Simple to use -- Single command to start, YAML-driven configuration, works out of the box
  • Multi-layered defense -- No reliance on a single detection method; UA matching + behavioral analysis + challenge verification + reputation scoring work together
  • Data sovereignty -- All data and logs stay on your server, never routed through any third party

Why Python

  • High readability -- Clean, understandable code that is easy for the community to audit and contribute to
  • Rapid development -- Fast iteration cycles to respond quickly to new AI crawler threats
  • Rich ecosystem -- Easy to extend with machine learning detection and other advanced features in the future
  • Cross-platform -- Runs on Linux, macOS, and Windows without modification

Roadmap

  • WAF rule engine -- Support for SQL injection, XSS, and other common web attack detection
  • Geo-blocking -- Access control based on IP geolocation
  • ML-based detection -- Train models on request features to identify unknown bots
  • Docker support -- Official Docker images and docker-compose configurations
  • Plugin system -- Support for custom detection plugins and response actions
  • WebSocket support -- Protect WebSocket connections
  • Cluster mode -- Multi-instance state sharing (Redis backend)
  • Alert notifications -- Webhook / Email / DingTalk alert integration

Ways to Contribute

We welcome contributions in the following areas:

  • Adding new AI bot detection rules
  • Improving detection algorithms and strategies
  • Writing documentation and tutorials
  • Submitting bug reports and fixes
  • Sharing deployment experiences and best practices

📦 Packaging & Deployment Guide

Direct Deployment

# Clone and install
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard
pip install -e .

# Configure
cp config.example.yaml config.yaml
vim config.yaml  # Edit configuration

# Start
crawlguard start --target http://your-backend:3000

Docker Deployment

FROM python:3.11-slim

WORKDIR /app
COPY . .

RUN pip install --no-cache-dir -e .

EXPOSE 8080 8081

COPY config.example.yaml config.yaml

CMD ["python", "-m", "crawlguard.cli", "start", "--target", "http://host.docker.internal:3000"]
# Build the image
docker build -t crawlguard .

# Run the container
docker run -d \
  --name crawlguard \
  -p 8080:8080 \
  -p 8081:8081 \
  -v $(pwd)/config.yaml:/app/config.yaml \
  crawlguard

systemd Service Configuration

[Unit]
Description=CrawlGuard AI Crawler Protection
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/crawlguard
ExecStart=/usr/bin/python3 -m crawlguard.cli start --config /opt/crawlguard/config.yaml
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target
# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable crawlguard
sudo systemctl start crawlguard

# View logs
sudo journalctl -u crawlguard -f

Nginx Reverse Proxy Configuration

upstream crawlguard {
    server 127.0.0.1:8080;
}

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://crawlguard;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# Monitoring dashboard (recommend restricting access)
server {
    listen 8081;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8081;
        allow 127.0.0.1;
        allow your-office-ip;
        deny all;
    }
}

Compatible Environments

Environment Support
Linux (Ubuntu/Debian/CentOS) Fully supported
macOS Fully supported
Windows Fully supported
Python 3.8 Fully supported
Python 3.9 Fully supported
Python 3.10 Fully supported
Python 3.11 Fully supported
Python 3.12 Fully supported

🤝 Contributing Guide

Community contributions are highly welcome! Here is how you can get involved.

Submitting a Pull Request

  1. Fork this repository
  2. Create your feature branch: git checkout -b feature/your-feature
  3. Commit your changes: git commit -m 'feat: add your feature'
  4. Push to the branch: git push origin feature/your-feature
  5. Submit a Pull Request

Submitting an Issue

  • Use a clear, descriptive title
  • Include steps to reproduce and environment details
  • For bug reports, attach relevant logs
  • For feature requests, describe the use case in detail

Code Style

  • Follow PEP 8 coding conventions
  • Write unit tests for new features
  • Keep code comments clear and comprehensive
  • Use Conventional Commits format for commit messages

Development Setup

# Clone the repository
git clone https://github.com/gitstq/CrawlGuard.git
cd CrawlGuard

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run self-tests
python -m crawlguard.cli test

# Validate configuration
python -m crawlguard.cli start --dry-run

📄 License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2024 CrawlGuard Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

If CrawlGuard has been helpful to you, please give the project a Star ⭐

Made with ❤️ by CrawlGuard Team

About

🛡️ Lightweight AI Crawler Intelligent Protection Engine | 轻量级AI爬虫智能防护引擎 - Zero dependencies, multi-strategy detection, JS challenge, honeypot, rate limiting, IP reputation, real-time dashboard

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors