An easily scalable, distributed web scraping API built with FastAPI and Docker, featuring proxy rotation (from Webshare.io) and multiple scraping methods leveraging advanced stealth techniques.
- Distributed Architecture: Separate distributor and runner services
- Multiple Scraping Methods:
- Simple (aiohttp) for basic scraping
- Advanced (Playwright) for JavaScript-heavy sites
- Proxy Management (using Webshare.io):
- Automatic proxy rotation
- Health monitoring
- Success rate tracking
- Stealth Features:
- Browser fingerprint randomization
- Header rotation
- User agent spoofing
- Health Monitoring:
- Service health checks
- Runner registration system
- Proxy performance tracking
┌─────────────┐ ┌──────────────┐ ┌──────────┐
│ Client │────▶│ Distributor │────▶│ Runner 1 │
└─────────────┘ │ Service │ └──────────┘
│ │ ┌──────────┐
│ │────▶│ Runner 2 │
└──────────────┘ └──────────┘
│ ┌──────────┐
└─────────────▶│ Runner N │
└──────────┘
- Docker and Docker Compose
- Python 3.11+
- Webshare.io API token for proxies
- Clone the repository:
git clone https://github.com/PeerZ0/ScrapeEngine.git
cd ScrapeEngine- Create
.envfile:
WEBSHARE_TOKEN=your_webshare_token
AUTH_TOKEN=your_auth_token
DEBUG=false- Start the services:
docker-compose up -d- If you want to deploy more than one runner:
docker-compose up -d --scale runner=3Base URL: http://localhost:8080
All protected endpoints require Bearer token authentication set in the env variable AUTH_TOKEN:
Authorization: Bearer <AUTH_TOKEN>- POST
/api/scrape- Initiates a scraping task
- Request body:
{
"url": "https://example.com",
"full_content": true,
"stealth": true,
"method": "aiohttp",
"cache": true,
"parse": true
}Method Options:
-
"aiohttp"- Fast HTTP-only scraping for static content -
"playwright"- JavaScript rendering for dynamic content (returns rendered page after JS execution) -
GET
/health/public- Public health check endpoint
-
GET
/api/debug/proxies- View proxy status (protected)
-
GET
/api/debug/runners- View runner status (protected)
Is only available inside the docker network. Requested by the distributor service.
Base URL: http://localhost:8000
- POST
/scrape- Internal endpoint for scraping tasks
- GET
/health- Health check endpoint
WEBSHARE_TOKEN: Webshare.io API tokenAUTH_TOKEN: Authentication token for API accessDEBUG: Enable debug logging (true/false)
The system uses Docker Compose for orchestration. Key configurations:
services:
distributor:
ports:
- "8080:8080"
environment:
- PYTHONUNBUFFERED=1
runner:
environment:
- PYTHONUNBUFFERED=1
- RUNNER_ID=runner-${HOSTNAME:-runner}
- DISTRIBUTOR_URL=http://distributor:8080├── Distributor/
│ ├── app/
│ │ ├── services/
│ │ ├── config/
│ │ └── models.py
│ │ └── main.py
│ └── Dockerfile
├── Runner/
│ ├── app/
│ │ ├── services/
│ │ ├── config/
│ │ └── models.py
│ │ └── main.py
│ └── Dockerfile
└── docker-compose.yml
The system automatically scales with additional runner instances. To add more runners:
docker-compose up -d --scale runner=3- Monitor service health via
/healthendpoints - Check proxy status and available proxies via
/api/debug/proxies - View runner status and registered runners via
/api/debug/runners
The system includes:
- Automatic retry mechanisms for failed requests
- Proxy rotation on failures
- Runner health monitoring
- Detailed logging