Skip to content

jonyszone/price-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Price Scraper - Enterprise Edition

A production-ready Python web scraper with anti-blocking, browser automation, database persistence, and real-time monitoring.

πŸš€ Features

Core Scraping

  • βœ… Intelligent Scraping - Auto-detects best method (requests vs browser)
  • βœ… Anti-Blocking - Proxy rotation, user-agent rotation, request delays
  • βœ… Browser Automation - Selenium-based scraping for JavaScript-heavy sites
  • βœ… Modular Parsers - Easy to add custom site parsers

Data Management

  • βœ… Database Support - SQLite/PostgreSQL with SQLAlchemy ORM
  • βœ… Price History - Track all price changes over time
  • βœ… Data Export - CSV export functionality

Monitoring & Alerts

  • βœ… Price Alerts - Automatic notifications when prices drop
  • βœ… Web Dashboard - Real-time monitoring interface
  • βœ… Email Notifications - HTML-formatted price drop alerts
  • βœ… Price Statistics - Min/max/average price analysis

Performance

  • βœ… Caching - TTL-based in-memory cache
  • βœ… Rate Limiting - Configurable request throttling
  • βœ… Request Throttling - Avoid overwhelming servers

Management

  • βœ… CLI Tool - 9 commands for complete control
  • βœ… Configuration - YAML-based site configuration
  • βœ… Logging - Comprehensive logging system

πŸ“‹ Project Structure

price-scraper/
β”œβ”€β”€ config/
β”‚   └── sites.yaml                 # Site configurations
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py                  # Data models
β”‚   β”œβ”€β”€ utils.py                   # Utility functions
β”‚   β”œβ”€β”€ api_client.py              # API client
β”‚   β”œβ”€β”€ scraper.py                 # Main scraper
β”‚   β”œβ”€β”€ database.py                # Database models & operations
β”‚   β”œβ”€β”€ alerts.py                  # Price alerts & comparison
β”‚   β”œβ”€β”€ dashboard.py               # Web dashboard (Flask)
β”‚   β”œβ”€β”€ cache.py                   # Caching & rate limiting
β”‚   β”œβ”€β”€ notifications.py           # Email notifications
β”‚   β”œβ”€β”€ anti_blocking.py           # Anti-blocking mechanisms
β”‚   β”œβ”€β”€ browser_scraper.py         # Browser-based scraping
β”‚   β”œβ”€β”€ intelligent_scraper.py     # Smart scraper selection
β”‚   β”œβ”€β”€ cli.py                     # CLI tool
β”‚   └── parsers/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ base.py                # Base parser
β”‚       β”œβ”€β”€ generic.py             # Generic parser
β”‚       └── browser.py             # Browser parser
β”œβ”€β”€ logs/                          # Log files
β”œβ”€β”€ .env                           # Environment variables
β”œβ”€β”€ .env.example                   # Example env
β”œβ”€β”€ requirements.txt               # Dependencies
β”œβ”€β”€ main.py                        # Entry point
β”œβ”€β”€ README.md                      # This file
└── ANTI_BLOCKING_GUIDE.md        # Anti-blocking guide

πŸ”§ Setup

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env with your settings

3. Configure Sites

Edit config/sites.yaml:

sites:
  - name: "Amazon"
    url: "https://amazon.com/s?k=laptop"
    enabled: true
    parser: "browser"
    selectors:
      product: ".s-result-item"
      price: ".a-price-whole"
      title: "h2 a span"

πŸ“– Usage

Command Line Interface

# Create price alert
python -m src.cli create-alert --site amazon.com --product laptop --price 800

# List all alerts
python -m src.cli list-alerts

# Get price statistics
python -m src.cli stats --site amazon.com --product laptop --days 30

# Check alerts and trigger notifications
python -m src.cli check-alerts

# Start web dashboard
python -m src.cli dashboard --port 5000

# Export data to CSV
python -m src.cli export-data

# Show system status
python -m src.cli status

# Send test email
python -m src.cli send-test-email --email user@example.com

# Delete alert
python -m src.cli delete-alert --alert-id 1

Python API

from src.intelligent_scraper import IntelligentScraper
from src.database import Database
from src.alerts import AlertManager

# Scrape with auto-detection
scraper = IntelligentScraper(use_proxies=True, proxy_list=['proxy1', 'proxy2'])
html = scraper.scrape('https://amazon.com/product')

# Manage alerts
db = Database()
alert_manager = AlertManager(db)
alert = alert_manager.create_alert('amazon.com', 'laptop', 800)

# Check for triggered alerts
triggered = alert_manager.check_alerts()

πŸ›‘οΈ Anti-Blocking Features

Proxy Rotation

from src.anti_blocking import ProxyRotator

rotator = ProxyRotator(['http://proxy1:8080', 'http://proxy2:8080'])
proxy = rotator.get_proxy_dict()

User-Agent Rotation

from src.anti_blocking import UserAgentRotator

ua_rotator = UserAgentRotator()
headers = {'User-Agent': ua_rotator.get_random_user_agent()}

Browser Scraping (Like MCP)

from src.browser_scraper import BrowserScraper

with BrowserScraper(headless=True) as scraper:
    html = scraper.scrape('https://amazon.com', wait_selector='.product')
    prices = scraper.get_all_elements_text('.price')

Intelligent Selection

from src.intelligent_scraper import IntelligentScraper

# Automatically uses browser for Amazon, requests for others
scraper = IntelligentScraper()
html = scraper.scrape('https://amazon.com/product')

See ANTI_BLOCKING_GUIDE.md for detailed guide.

πŸ’Ύ Database

Models

PriceHistory - Track all price changes

db.add_price_history(
    site_name='amazon.com',
    product_name='laptop',
    product_url='https://...',
    price=999.99,
    currency='USD'
)

PriceAlert - Manage price alerts

alert = db.create_alert('amazon.com', 'laptop', 800)

Queries

# Get price history
history = db.get_price_history('amazon.com', 'laptop', limit=100)

# Get latest price
latest = db.get_latest_price('amazon.com', 'laptop')

# Get active alerts
alerts = db.get_active_alerts()

πŸ“Š Web Dashboard

Start dashboard:

python -m src.cli dashboard --port 5000

Access at: http://localhost:5000

Features:

  • Real-time price monitoring
  • Active alerts display
  • Price statistics
  • Auto-refresh every 30 seconds

πŸ“§ Email Notifications

Configure SMTP in .env:

SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
SENDER_EMAIL=your-email@gmail.com
SENDER_PASSWORD=your-app-password

Send alerts:

from src.notifications import EmailNotifier

notifier = EmailNotifier()
notifier.send_alert(
    'user@example.com',
    'laptop',
    target_price=800,
    current_price=699,
    site_name='amazon.com'
)

βš™οΈ Configuration

Environment Variables

# Database
DATABASE_URL=sqlite:///price_scraper.db

# API
API_BASE_URL=https://api.example.com
API_KEY=your-api-key

# Logging
LOG_LEVEL=INFO

# Scraping
MIN_DELAY=2
MAX_DELAY=8
HEADLESS=true

# Email
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
SENDER_EMAIL=your-email@gmail.com
SENDER_PASSWORD=your-password

# Proxies
PROXIES=http://proxy1:8080,http://proxy2:8080

🎯 Best Practices

  1. Always use delays - Minimum 2-5 seconds between requests
  2. Rotate user agents - Appear as different browsers
  3. Use proxies - For high-volume scraping
  4. Respect robots.txt - Check before scraping
  5. Use browser for JS - When content is dynamically loaded
  6. Monitor rate limits - Adjust delays if getting 429 errors
  7. Rotate IPs - Use proxy rotation for blocking-prone sites

πŸ“ˆ Performance

Method Speed Reliability Resource
Simple Requests Fast Low Low
Requests + Rotation Medium Medium Low
Browser Slow High High
Browser + Proxy Slow Very High High

πŸ” Troubleshooting

Getting 403 Forbidden

  • Add delays between requests
  • Rotate user agents
  • Use proxies
  • Switch to browser scraping

Getting 429 Too Many Requests

  • Increase delays
  • Use more proxies
  • Reduce concurrent requests

JavaScript not loading

  • Use browser scraper
  • Increase wait time
  • Check wait_selector

πŸ“š Documentation

🀝 Creating Custom Parsers

from src.parsers.base import BaseParser
from src.models import Product

class MyCustomParser(BaseParser):
    def parse(self, url: str, site_name: str):
        soup = self.fetch_page(url)
        products = []
        for item in soup.find_all('.product'):
            products.append(self.extract_product_data(item))
        return products
    
    def extract_product_data(self, element):
        return Product(
            name=element.find('.title').text,
            price=float(element.find('.price').text.replace('$', '')),
            url=element.find('a')['href']
        )

Register in main.py:

scraper.register_parser('my_site', MyCustomParser())

πŸ“ License

MIT

πŸš€ Quick Start

# 1. Install
pip install -r requirements.txt

# 2. Configure
cp .env.example .env
# Edit .env

# 3. Create alert
python -m src.cli create-alert --site amazon.com --product laptop --price 800

# 4. Start dashboard
python -m src.cli dashboard

# 5. Check alerts
python -m src.cli check-alerts

Status: Production Ready βœ…
Version: 2.0 (Enterprise Edition)
Last Updated: 2026-05-31

About

A production-ready Python web scraper with anti-blocking, browser automation, database persistence, and real-time monitoring.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages