A production-ready Python web scraper with anti-blocking, browser automation, database persistence, and real-time monitoring.
- β Intelligent Scraping - Auto-detects best method (requests vs browser)
- β Anti-Blocking - Proxy rotation, user-agent rotation, request delays
- β Browser Automation - Selenium-based scraping for JavaScript-heavy sites
- β Modular Parsers - Easy to add custom site parsers
- β Database Support - SQLite/PostgreSQL with SQLAlchemy ORM
- β Price History - Track all price changes over time
- β Data Export - CSV export functionality
- β Price Alerts - Automatic notifications when prices drop
- β Web Dashboard - Real-time monitoring interface
- β Email Notifications - HTML-formatted price drop alerts
- β Price Statistics - Min/max/average price analysis
- β Caching - TTL-based in-memory cache
- β Rate Limiting - Configurable request throttling
- β Request Throttling - Avoid overwhelming servers
- β CLI Tool - 9 commands for complete control
- β Configuration - YAML-based site configuration
- β Logging - Comprehensive logging system
price-scraper/
βββ config/
β βββ sites.yaml # Site configurations
βββ src/
β βββ __init__.py
β βββ models.py # Data models
β βββ utils.py # Utility functions
β βββ api_client.py # API client
β βββ scraper.py # Main scraper
β βββ database.py # Database models & operations
β βββ alerts.py # Price alerts & comparison
β βββ dashboard.py # Web dashboard (Flask)
β βββ cache.py # Caching & rate limiting
β βββ notifications.py # Email notifications
β βββ anti_blocking.py # Anti-blocking mechanisms
β βββ browser_scraper.py # Browser-based scraping
β βββ intelligent_scraper.py # Smart scraper selection
β βββ cli.py # CLI tool
β βββ parsers/
β βββ __init__.py
β βββ base.py # Base parser
β βββ generic.py # Generic parser
β βββ browser.py # Browser parser
βββ logs/ # Log files
βββ .env # Environment variables
βββ .env.example # Example env
βββ requirements.txt # Dependencies
βββ main.py # Entry point
βββ README.md # This file
βββ ANTI_BLOCKING_GUIDE.md # Anti-blocking guide
pip install -r requirements.txtcp .env.example .env
# Edit .env with your settingsEdit config/sites.yaml:
sites:
- name: "Amazon"
url: "https://amazon.com/s?k=laptop"
enabled: true
parser: "browser"
selectors:
product: ".s-result-item"
price: ".a-price-whole"
title: "h2 a span"# Create price alert
python -m src.cli create-alert --site amazon.com --product laptop --price 800
# List all alerts
python -m src.cli list-alerts
# Get price statistics
python -m src.cli stats --site amazon.com --product laptop --days 30
# Check alerts and trigger notifications
python -m src.cli check-alerts
# Start web dashboard
python -m src.cli dashboard --port 5000
# Export data to CSV
python -m src.cli export-data
# Show system status
python -m src.cli status
# Send test email
python -m src.cli send-test-email --email user@example.com
# Delete alert
python -m src.cli delete-alert --alert-id 1from src.intelligent_scraper import IntelligentScraper
from src.database import Database
from src.alerts import AlertManager
# Scrape with auto-detection
scraper = IntelligentScraper(use_proxies=True, proxy_list=['proxy1', 'proxy2'])
html = scraper.scrape('https://amazon.com/product')
# Manage alerts
db = Database()
alert_manager = AlertManager(db)
alert = alert_manager.create_alert('amazon.com', 'laptop', 800)
# Check for triggered alerts
triggered = alert_manager.check_alerts()from src.anti_blocking import ProxyRotator
rotator = ProxyRotator(['http://proxy1:8080', 'http://proxy2:8080'])
proxy = rotator.get_proxy_dict()from src.anti_blocking import UserAgentRotator
ua_rotator = UserAgentRotator()
headers = {'User-Agent': ua_rotator.get_random_user_agent()}from src.browser_scraper import BrowserScraper
with BrowserScraper(headless=True) as scraper:
html = scraper.scrape('https://amazon.com', wait_selector='.product')
prices = scraper.get_all_elements_text('.price')from src.intelligent_scraper import IntelligentScraper
# Automatically uses browser for Amazon, requests for others
scraper = IntelligentScraper()
html = scraper.scrape('https://amazon.com/product')See ANTI_BLOCKING_GUIDE.md for detailed guide.
PriceHistory - Track all price changes
db.add_price_history(
site_name='amazon.com',
product_name='laptop',
product_url='https://...',
price=999.99,
currency='USD'
)PriceAlert - Manage price alerts
alert = db.create_alert('amazon.com', 'laptop', 800)# Get price history
history = db.get_price_history('amazon.com', 'laptop', limit=100)
# Get latest price
latest = db.get_latest_price('amazon.com', 'laptop')
# Get active alerts
alerts = db.get_active_alerts()Start dashboard:
python -m src.cli dashboard --port 5000Access at: http://localhost:5000
Features:
- Real-time price monitoring
- Active alerts display
- Price statistics
- Auto-refresh every 30 seconds
Configure SMTP in .env:
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
SENDER_EMAIL=your-email@gmail.com
SENDER_PASSWORD=your-app-password
Send alerts:
from src.notifications import EmailNotifier
notifier = EmailNotifier()
notifier.send_alert(
'user@example.com',
'laptop',
target_price=800,
current_price=699,
site_name='amazon.com'
)# Database
DATABASE_URL=sqlite:///price_scraper.db
# API
API_BASE_URL=https://api.example.com
API_KEY=your-api-key
# Logging
LOG_LEVEL=INFO
# Scraping
MIN_DELAY=2
MAX_DELAY=8
HEADLESS=true
# Email
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
SENDER_EMAIL=your-email@gmail.com
SENDER_PASSWORD=your-password
# Proxies
PROXIES=http://proxy1:8080,http://proxy2:8080- Always use delays - Minimum 2-5 seconds between requests
- Rotate user agents - Appear as different browsers
- Use proxies - For high-volume scraping
- Respect robots.txt - Check before scraping
- Use browser for JS - When content is dynamically loaded
- Monitor rate limits - Adjust delays if getting 429 errors
- Rotate IPs - Use proxy rotation for blocking-prone sites
| Method | Speed | Reliability | Resource |
|---|---|---|---|
| Simple Requests | Fast | Low | Low |
| Requests + Rotation | Medium | Medium | Low |
| Browser | Slow | High | High |
| Browser + Proxy | Slow | Very High | High |
- Add delays between requests
- Rotate user agents
- Use proxies
- Switch to browser scraping
- Increase delays
- Use more proxies
- Reduce concurrent requests
- Use browser scraper
- Increase wait time
- Check wait_selector
- ANTI_BLOCKING_GUIDE.md - Detailed anti-blocking guide
- requirements.txt - All dependencies
from src.parsers.base import BaseParser
from src.models import Product
class MyCustomParser(BaseParser):
def parse(self, url: str, site_name: str):
soup = self.fetch_page(url)
products = []
for item in soup.find_all('.product'):
products.append(self.extract_product_data(item))
return products
def extract_product_data(self, element):
return Product(
name=element.find('.title').text,
price=float(element.find('.price').text.replace('$', '')),
url=element.find('a')['href']
)Register in main.py:
scraper.register_parser('my_site', MyCustomParser())MIT
# 1. Install
pip install -r requirements.txt
# 2. Configure
cp .env.example .env
# Edit .env
# 3. Create alert
python -m src.cli create-alert --site amazon.com --product laptop --price 800
# 4. Start dashboard
python -m src.cli dashboard
# 5. Check alerts
python -m src.cli check-alertsStatus: Production Ready β
Version: 2.0 (Enterprise Edition)
Last Updated: 2026-05-31