A collection of web scrapers for gathering motorcycle parts data from various manufacturers' websites. This project includes scrapers for JT Sprockets and BS Battery websites, designed to extract comprehensive motorcycle model and parts information.
This repository contains two main scraping projects:
- JT Sprockets Scraper - Extracts motorcycle sprocket and chain specifications
- BS Battery Scraper - Extracts motorcycle battery compatibility and specifications
Both projects are designed for data collection, analysis, and integration with motorcycle parts databases.
scrappers/
βββ jt-scrapper/ # JT Sprockets data extraction
β βββ scraper.py # Main scraper for model links
β βββ model_data_scraper.py # Detailed model data extraction
β βββ analyze.py # Data analysis utilities
β βββ consolidate_years.py # Year consolidation script
β βββ fix_brands.py # Brand name correction
β βββ get_ccm.py # Engine displacement extraction
β βββ requirements.txt # Python dependencies
β βββ README.md # JT-specific documentation
β βββ MODEL_DATA_SCRAPER_README.md
β βββ model_links.txt # Generated model URLs
β βββ scraper_progress.json # Progress tracking
β βββ sprocket_data_consolidated_fixed.csv # Output data
β
βββ bs-battery-scrapper/ # BS Battery data extraction
β βββ fetch_bs_battery_motorcycles.py # Motorcycle list extraction
β βββ fetch_all_oem.py # OEM battery comparison
β βββ fetch_bs_battery_polarity.py # Battery polarity extraction
β βββ bs_battery_full_data_updated.csv # Output data
β βββ BS-BATTERY-README.md # Quick start guide
β βββ BS-BATTERY-SCRAPER-GUIDE.md # Detailed documentation
β
βββ README.md # This file
- Python 3.7 or higher
- Chrome browser (for BS Battery scraper)
- Internet connection
- Clone or download the project
git clone <repository-url>
cd scrappers- Install dependencies for JT Sprockets scraper
cd jt-scrapper
pip install -r requirements.txt- Install dependencies for BS Battery scraper
cd ../bs-battery-scrapper
pip install requests seleniumExtracts detailed motorcycle sprocket and chain specifications from JT Sprockets catalog.
- Extract model links:
cd jt-scrapper
python scraper.py- Generates:
model_links.txt,missing_manufacturers.txt
- Extract detailed model data:
python model_data_scraper.py- Generates:
sprocket_data.csv,model_data_scraper.log
- Scrapes 64 motorcycle manufacturers
- Extracts sprocket models, sizes, and chain specifications
- Progress tracking with resume capability
- Rate limiting for respectful scraping
- Comprehensive error handling
- Manufacturer, model, and year information
- Front and rear sprocket specifications
- Chain type and length data
- Available size options
Extracts motorcycle battery compatibility and specifications from BS Battery website.
- Extract motorcycle list:
cd bs-battery-scrapper
python fetch_bs_battery_motorcycles.py- Generates:
bs_battery_motorcycles.csv
- Extract battery specifications:
python fetch_bs_battery_specs.py # Note: This file needs to be created- Generates:
bs_battery_full_data.csv
- Multi-step data collection process
- Progress tracking with resume capability
- OEM battery comparison functionality
- Polarity detection for batteries
- Selenium-based browser automation
- Motorcycle manufacturer, model, and year ranges
- Battery article numbers and types
- Technical specifications (capacity, CCA, dimensions)
- Polarity information
Modify constants in the Python files:
# In scraper.py
RATE_LIMIT_DELAY = 1.5 # Seconds between requests
TARGET_MANUFACTURERS = [...] # List of manufacturers
# In model_data_scraper.py
RATE_LIMIT_DELAY = 2.5 # Seconds between requests
REQUEST_TIMEOUT = 30 # Request timeoutUpdate API endpoints based on website structure:
# API endpoints may need adjustment based on website updates
BASE_URL = "https://bs-battery.com"
API_ENDPOINTS = {
'manufacturers': '/wp-admin/admin-ajax.php?action=bf_get_manufacturers',
'models': '/wp-admin/admin-ajax.php?action=bf_get_models',
# ... etc
}analyze.py- Data analysis and validationconsolidate_years.py- Year range consolidationfix_brands.py- Brand name standardizationget_ccm.py- Engine displacement extraction
fetch_all_oem.py- OEM battery comparison with existing datafetch_bs_battery_polarity.py- Battery polarity detection
Both scrapers implement robust progress tracking:
- JT Sprockets:
scraper_progress.json - BS Battery:
bs_battery_progress.json,bs_battery_specs_progress.json
Scrapers can be interrupted and resumed without losing progress.
-
Rate Limiting Errors
- Increase delay between requests
- Check website's robots.txt for guidelines
-
Website Structure Changes
- Update HTML selectors in extraction methods
- Verify API endpoints are current
-
Network Issues
- Check internet connection
- Verify website availability
- Adjust timeout settings
Enable debug logging by modifying logging configuration:
logging.basicConfig(level=logging.DEBUG, ...)- ~2.5 seconds per URL (rate limited)
- 1,000 URLs: ~42 minutes
- 5,000 URLs: ~3.5 hours
- Variable based on website response times
- Progress saved after each manufacturer/model
- Respect website terms of service
- Implement rate limiting to avoid overloading servers
- Use scraped data responsibly and in compliance with applicable laws
- Consider caching and data retention policies
When contributing to this project:
- Follow existing code style and structure
- Add comprehensive documentation for new features
- Test scrapers with small datasets first
- Update configuration files as needed
- Maintain backward compatibility with existing data formats
This project is intended for educational and research purposes. Please ensure compliance with website terms of service and applicable laws when using these scrapers.
For issues or questions:
- Check the specific scraper's README files
- Review log files for detailed error information
- Verify input file formats and configurations
- Test with small datasets before full runs
Last Updated: 2024-01-15
Project Status: Active Development
Compatibility: Python 3.7+