Skip to content

ddk311/scrappers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Motorcycle Parts Data Scrapers

A collection of web scrapers for gathering motorcycle parts data from various manufacturers' websites. This project includes scrapers for JT Sprockets and BS Battery websites, designed to extract comprehensive motorcycle model and parts information.

πŸ“‹ Project Overview

This repository contains two main scraping projects:

  1. JT Sprockets Scraper - Extracts motorcycle sprocket and chain specifications
  2. BS Battery Scraper - Extracts motorcycle battery compatibility and specifications

Both projects are designed for data collection, analysis, and integration with motorcycle parts databases.

πŸ—οΈ Project Structure

scrappers/
β”œβ”€β”€ jt-scrapper/                    # JT Sprockets data extraction
β”‚   β”œβ”€β”€ scraper.py                 # Main scraper for model links
β”‚   β”œβ”€β”€ model_data_scraper.py      # Detailed model data extraction
β”‚   β”œβ”€β”€ analyze.py                 # Data analysis utilities
β”‚   β”œβ”€β”€ consolidate_years.py       # Year consolidation script
β”‚   β”œβ”€β”€ fix_brands.py              # Brand name correction
β”‚   β”œβ”€β”€ get_ccm.py                 # Engine displacement extraction
β”‚   β”œβ”€β”€ requirements.txt           # Python dependencies
β”‚   β”œβ”€β”€ README.md                  # JT-specific documentation
β”‚   β”œβ”€β”€ MODEL_DATA_SCRAPER_README.md
β”‚   β”œβ”€β”€ model_links.txt            # Generated model URLs
β”‚   β”œβ”€β”€ scraper_progress.json      # Progress tracking
β”‚   └── sprocket_data_consolidated_fixed.csv  # Output data
β”‚
β”œβ”€β”€ bs-battery-scrapper/           # BS Battery data extraction
β”‚   β”œβ”€β”€ fetch_bs_battery_motorcycles.py    # Motorcycle list extraction
β”‚   β”œβ”€β”€ fetch_all_oem.py           # OEM battery comparison
β”‚   β”œβ”€β”€ fetch_bs_battery_polarity.py       # Battery polarity extraction
β”‚   β”œβ”€β”€ bs_battery_full_data_updated.csv   # Output data
β”‚   β”œβ”€β”€ BS-BATTERY-README.md       # Quick start guide
β”‚   └── BS-BATTERY-SCRAPER-GUIDE.md        # Detailed documentation
β”‚
└── README.md                      # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.7 or higher
  • Chrome browser (for BS Battery scraper)
  • Internet connection

Installation

  1. Clone or download the project
git clone <repository-url>
cd scrappers
  1. Install dependencies for JT Sprockets scraper
cd jt-scrapper
pip install -r requirements.txt
  1. Install dependencies for BS Battery scraper
cd ../bs-battery-scrapper
pip install requests selenium

πŸ”§ JT Sprockets Scraper

Overview

Extracts detailed motorcycle sprocket and chain specifications from JT Sprockets catalog.

Usage

  1. Extract model links:
cd jt-scrapper
python scraper.py
  • Generates: model_links.txt, missing_manufacturers.txt
  1. Extract detailed model data:
python model_data_scraper.py
  • Generates: sprocket_data.csv, model_data_scraper.log

Features

  • Scrapes 64 motorcycle manufacturers
  • Extracts sprocket models, sizes, and chain specifications
  • Progress tracking with resume capability
  • Rate limiting for respectful scraping
  • Comprehensive error handling

Output Data

  • Manufacturer, model, and year information
  • Front and rear sprocket specifications
  • Chain type and length data
  • Available size options

πŸ”‹ BS Battery Scraper

Overview

Extracts motorcycle battery compatibility and specifications from BS Battery website.

Usage

  1. Extract motorcycle list:
cd bs-battery-scrapper
python fetch_bs_battery_motorcycles.py
  • Generates: bs_battery_motorcycles.csv
  1. Extract battery specifications:
python fetch_bs_battery_specs.py  # Note: This file needs to be created
  • Generates: bs_battery_full_data.csv

Features

  • Multi-step data collection process
  • Progress tracking with resume capability
  • OEM battery comparison functionality
  • Polarity detection for batteries
  • Selenium-based browser automation

Output Data

  • Motorcycle manufacturer, model, and year ranges
  • Battery article numbers and types
  • Technical specifications (capacity, CCA, dimensions)
  • Polarity information

βš™οΈ Configuration

JT Sprockets Configuration

Modify constants in the Python files:

# In scraper.py
RATE_LIMIT_DELAY = 1.5  # Seconds between requests
TARGET_MANUFACTURERS = [...]  # List of manufacturers

# In model_data_scraper.py
RATE_LIMIT_DELAY = 2.5  # Seconds between requests
REQUEST_TIMEOUT = 30     # Request timeout

BS Battery Configuration

Update API endpoints based on website structure:

# API endpoints may need adjustment based on website updates
BASE_URL = "https://bs-battery.com"
API_ENDPOINTS = {
    'manufacturers': '/wp-admin/admin-ajax.php?action=bf_get_manufacturers',
    'models': '/wp-admin/admin-ajax.php?action=bf_get_models',
    # ... etc
}

πŸ“Š Data Analysis Tools

JT Sprockets Analysis

  • analyze.py - Data analysis and validation
  • consolidate_years.py - Year range consolidation
  • fix_brands.py - Brand name standardization
  • get_ccm.py - Engine displacement extraction

BS Battery Analysis

  • fetch_all_oem.py - OEM battery comparison with existing data
  • fetch_bs_battery_polarity.py - Battery polarity detection

πŸ”„ Progress Tracking

Both scrapers implement robust progress tracking:

  • JT Sprockets: scraper_progress.json
  • BS Battery: bs_battery_progress.json, bs_battery_specs_progress.json

Scrapers can be interrupted and resumed without losing progress.

πŸ› οΈ Troubleshooting

Common Issues

  1. Rate Limiting Errors

    • Increase delay between requests
    • Check website's robots.txt for guidelines
  2. Website Structure Changes

    • Update HTML selectors in extraction methods
    • Verify API endpoints are current
  3. Network Issues

    • Check internet connection
    • Verify website availability
    • Adjust timeout settings

Debug Mode

Enable debug logging by modifying logging configuration:

logging.basicConfig(level=logging.DEBUG, ...)

πŸ“ˆ Performance

JT Sprockets

  • ~2.5 seconds per URL (rate limited)
  • 1,000 URLs: ~42 minutes
  • 5,000 URLs: ~3.5 hours

BS Battery

  • Variable based on website response times
  • Progress saved after each manufacturer/model

πŸ”’ Legal and Ethical Considerations

  • Respect website terms of service
  • Implement rate limiting to avoid overloading servers
  • Use scraped data responsibly and in compliance with applicable laws
  • Consider caching and data retention policies

🀝 Contributing

When contributing to this project:

  1. Follow existing code style and structure
  2. Add comprehensive documentation for new features
  3. Test scrapers with small datasets first
  4. Update configuration files as needed
  5. Maintain backward compatibility with existing data formats

πŸ“„ License

This project is intended for educational and research purposes. Please ensure compliance with website terms of service and applicable laws when using these scrapers.

πŸ“ž Support

For issues or questions:

  1. Check the specific scraper's README files
  2. Review log files for detailed error information
  3. Verify input file formats and configurations
  4. Test with small datasets before full runs

Last Updated: 2024-01-15
Project Status: Active Development
Compatibility: Python 3.7+

About

different site scrappers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages