There are dozens of sites on the internet that sell K-Pop merchandise. Some are the official sites for entertainment companies; others are sites run by private businesses and individuals. The goal of this project is to identify which sites offer the best deals on K-Pop merchandise. The goal is to include all products, from albums to stickers, light sticks, toys, and more.
The sites currently chosen for scraping were selected because of their popularity within the collector circuit (Kpopalbums.com, Musicplaza.com, etc.). Some of these online stores also have a physical presence as well (choicemusicla.com). In version 1, official sites were selected as well (thejypshop.com). While there does not exist an estimation of total amount of money spent on these stores as a quantity or percentage of the total amount of money spent per year, the frequency of these sites in blogs, forums, and comments sections indicate that these sites are among the most popular places where Americans spend money on K-Pop merchandise.
Scraping is orchestrated through Google Cloud Run Jobs and Cloud Scheduler. The pipeline:
- Cloud Scheduler triggers weekly batch jobs
- Cloud Run Jobs execute containerized Selenium scrapers
- Raw data is written to Google Cloud Storage (date-partitioned)
- Data is validated, deduplicated, and loaded to BigQuery
- Cleaned dataset is automatically published to Kaggle
See pipeline/README.md for detailed architecture and setup instructions.
- ✅ Base Docker image with Chrome + Selenium
- 🚧 Refactoring scrapers for Cloud Run
- ⏳ Setting up data validation and deduplication
- ⏳ Kaggle publishing automation