Skip to content

peervw/ScrapeEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

101 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapeEngine

An easily scalable, distributed web scraping API built with FastAPI and Docker, featuring proxy rotation (from Webshare.io) and multiple scraping methods leveraging advanced stealth techniques.

🚀 Features

  • Distributed Architecture: Separate distributor and runner services
  • Multiple Scraping Methods:
    • Simple (aiohttp) for basic scraping
    • Advanced (Playwright) for JavaScript-heavy sites
  • Proxy Management (using Webshare.io):
    • Automatic proxy rotation
    • Health monitoring
    • Success rate tracking
  • Stealth Features:
    • Browser fingerprint randomization
    • Header rotation
    • User agent spoofing
  • Health Monitoring:
    • Service health checks
    • Runner registration system
    • Proxy performance tracking

Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────┐
│   Client    │────▶│  Distributor │────▶│ Runner 1 │
└─────────────┘     │   Service    │     └──────────┘
                    │              │     ┌──────────┐
                    │              │────▶│ Runner 2 │
                    └──────────────┘     └──────────┘
                          │              ┌──────────┐
                          └─────────────▶│ Runner N │
                                         └──────────┘

📋Prerequisites

  • Docker and Docker Compose
  • Python 3.11+
  • Webshare.io API token for proxies

🛠️ Quick Start

  1. Clone the repository:
git clone https://github.com/PeerZ0/ScrapeEngine.git
cd ScrapeEngine
  1. Create .env file:
WEBSHARE_TOKEN=your_webshare_token
AUTH_TOKEN=your_auth_token
DEBUG=false
  1. Start the services:
docker-compose up -d
  1. If you want to deploy more than one runner:
docker-compose up -d --scale runner=3

🔍 API Endpoints

Distributor Service

Base URL: http://localhost:8080

Authentication

All protected endpoints require Bearer token authentication set in the env variable AUTH_TOKEN:

Authorization: Bearer <AUTH_TOKEN>

Endpoints

  • POST /api/scrape
    • Initiates a scraping task
    • Request body:
{
    "url": "https://example.com",
    "full_content": true,
    "stealth": true,
    "method": "aiohttp",
    "cache": true,
    "parse": true
}

Method Options:

  • "aiohttp" - Fast HTTP-only scraping for static content

  • "playwright" - JavaScript rendering for dynamic content (returns rendered page after JS execution)

  • GET /health/public

    • Public health check endpoint
  • GET /api/debug/proxies

    • View proxy status (protected)
  • GET /api/debug/runners

    • View runner status (protected)

Runner Service

Is only available inside the docker network. Requested by the distributor service.

Base URL: http://localhost:8000

  • POST /scrape
    • Internal endpoint for scraping tasks
  • GET /health
    • Health check endpoint

🛠️ Configuration

Environment Variables

  • WEBSHARE_TOKEN: Webshare.io API token
  • AUTH_TOKEN: Authentication token for API access
  • DEBUG: Enable debug logging (true/false)

Docker Compose Configuration

The system uses Docker Compose for orchestration. Key configurations:

services:
  distributor:
    ports:
      - "8080:8080"
    environment:
      - PYTHONUNBUFFERED=1
    
  runner:
    environment:
      - PYTHONUNBUFFERED=1
      - RUNNER_ID=runner-${HOSTNAME:-runner}
      - DISTRIBUTOR_URL=http://distributor:8080

Development

Project Structure

├── Distributor/
│   ├── app/
│   │   ├── services/
│   │   ├── config/
│   │   └── models.py
│   │   └── main.py
│   └── Dockerfile
├── Runner/
│   ├── app/
│   │   ├── services/
│   │   ├── config/
│   │   └── models.py
│   │   └── main.py
│   └── Dockerfile
└── docker-compose.yml

Adding New Runners

The system automatically scales with additional runner instances. To add more runners:

docker-compose up -d --scale runner=3

Monitoring

  • Monitor service health via /health endpoints
  • Check proxy status and available proxies via /api/debug/proxies
  • View runner status and registered runners via /api/debug/runners

Error Handling

The system includes:

  • Automatic retry mechanisms for failed requests
  • Proxy rotation on failures
  • Runner health monitoring
  • Detailed logging

About

A distributed web scraping system built with FastAPI and Docker, featuring proxy rotation and advanced stealth techniques. The system is designed to be easily scalable and resilient, with separate distributor and runner services.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors