🎵 Hype Check - Predicting Spotify Song Popularity Using Machine Learning

Can we predict whether a song will become popular on Spotify by analyzing how it sounds and where it lives across platforms?

Authors: Hanane Mamalik & Kanak Yadav
Program: Ironhack Data Analytics Bootcamp — Berlin, 2026
Presentation: View Google Slides

📌 Project Overview

Hype Check is a supervised machine learning classification project that predicts whether a song will be popular on Spotify (Popularity ≥ 70, top 30%) using a combination of audio DNA (danceability, speechiness, tempo, etc.) and platform presence signals (Shazam counts, AirPlay spins, YouTube views, TikTok views).

Four ML algorithms were built, tuned, and benchmarked:

Model	Default Accuracy	Tuned Accuracy
🏆 XGBoost	0.7674	0.7637
Ensemble (Voting)	—	0.7637
Decision Tree	0.7473	0.7289
KNN	0.6557	0.5788
Random Forest	0.6374	0.6190

Best model: XGBoost (default) at 76.74% accuracy.

📂 Repository Structure

Spotify-Machine-Learning-Project/
│
├── data/
│   ├── processed/
│   │   ├── final_spotify.csv               # Final merged & feature-engineered dataset
│   │   └── spotify_kaggle_dataset_clean.csv  # Cleaned Kaggle dataset
│   └── raw/
│       ├── audio_features.csv              # Audio features from RapidAPI (Musixae)
│       ├── isrc_to_spotify_id.csv          # ISRC → Spotify ID mapping
│       ├── spotify_dataset.csv             # Original Kaggle dataset
│       └── spotify_with_genres.csv         # Dataset with genre labels
│
├── notebooks/
│   ├── 1_kaggle_data_cleaning.ipynb        # Data cleaning & preprocessing
│   ├── 2_data_load_api_preprocessing.ipynb # API data collection & merging
│   └── 3_modeling.ipynb                    # Feature engineering, model building, tuning & evaluation
│
└── README.md

🗃️ Data Sources

Dataset 1 : Most Streamed Spotify Songs 2024 (Kaggle)

~4,600 tracks with:

Spotify streams, playlist & chart counts
YouTube views & likes, TikTok views & likes
Shazam counts, AirPlay spins
Genre, ISRC code

Dataset 2 : Audio Features via RapidAPI (Musixae)

Fetched using ISRC → Spotify ID → audio features pipeline:

Danceability, Energy, Loudness, Valence, Tempo
Speechiness, Acousticness, Instrumentalness, Liveness
Key, Mode, Time Signature, Duration (ms)

Note: Spotify's native audio features endpoint was deprecated in November 2024. Musixae API was used as the alternative - see this article for how we found and connected to it.

Final Dataset

2,727 tracks × 28 features after merging and cleaning
Target: is_popular (binary) — 1,957 not popular / 770 popular
Train/test split: 80/20 stratified

🛠️ Tech Stack

Category	Tools
Language	Python 3.x
Data manipulation	Pandas, NumPy
Visualisation	Matplotlib, Seaborn
Machine Learning	Scikit-learn (KNN, DT, RF, Voting Ensemble), XGBoost
Hyperparameter tuning	GridSearchCV with 5-Fold Cross-Validation
Data collection	Spotify API, Musixae/RapidAPI
Notebook environment	Jupyter Notebook

⚙️ Feature Engineering

18 features across two layers:

Audio DNA (how it sounds)
Danceability, Energy, Valence, Loudness, Tempo, Speechiness, Acousticness, Instrumentalness, Liveness, Key, Mode, Time_signature, Duration_ms

Platform Presence (where it lives)
YouTube_Views, TikTok_Views, Shazam_Counts, AirPlay_Spins, Genre (label-encoded), Artist_Track_Count (engineered)

Target Variable
is_popular = 1 if spotify_popularity ≥ 70, else 0

🔍 Key Findings

What drives song popularity?

Feature importance averaged across Decision Tree, Random Forest, and XGBoost:

Rank	Feature	Importance Score
1	Shazam Counts	0.4546
2	AirPlay Spins	0.1476
3	Speechiness	0.1441
4	Danceability	0.0841
5	Duration_ms	0.0451

Where a song lives matters more than how it sounds. Platform discoverability (Shazam, AirPlay) is by far the dominant predictor — but of audio features, Speechiness and Danceability still hold meaningful predictive power.

Why did tuning hurt?

GridSearchCV with 5-fold CV was applied to all four models. Tuning decreased accuracy across the board. The default XGBoost (76.7%) outperformed every tuned version — suggesting the default hyperparameters were already near-optimal for this dataset, and tuning caused slight overfitting to the CV folds.

Model limitations

Low recall on the Popular class (~28–34%): models are conservative — precise when they call a song popular, but miss many actual hits
Survivorship bias: the dataset skews toward charting/mainstream songs, underrepresenting indie and niche music
Popularity ≠ quality — the model predicts commercial success metrics, not artistic merit

🚀 Getting Started

1. Clone the repository

git clone https://github.com/Kanak2208/Spotify-Machine-Learning-Project.git
cd Spotify-Machine-Learning-Project

2. Install dependencies

pip install pandas numpy matplotlib seaborn scikit-learn xgboost jupyter requests

3. Set up data

Download the Most Streamed Spotify Songs 2024 dataset from Kaggle and place the CSV in data/.

Audio features were collected via RapidAPI (Musixae). You'll need a RapidAPI key to reproduce the collection step — or use the pre-processed file already in data/.

4. Run the notebooks

jupyter notebook

Open the notebooks in order:

1_kaggle_data_cleaning.ipynb — clean and explore the Kaggle dataset
2_data_load_api_preprocessing.ipynb — fetch audio features via API and merge datasets
3_modeling.ipynb — feature engineering, model training, tuning, and evaluation

If you're using the pre-processed data, you can jump straight to 3_modeling.ipynb.

📊 Evaluation Metrics Used

Accuracy — overall % of correct predictions
Precision — when the model says "popular", how often is it right?
Recall — of all actual popular songs, how many did it catch?
F1 Score — harmonic mean of precision and recall
5-Fold Cross-Validation — used during GridSearchCV for reliable generalisation estimates
Confusion Matrix — per-model breakdown of true/false positives and negatives

🔮 Future Work

Address low recall — explore SMOTE, class weighting, or threshold tuning to better identify popular songs
Larger & more balanced dataset — include more independent/underground artists to reduce survivorship bias
NLP on lyrics — add sentiment analysis and lyrical complexity features
Time-series analysis — predict not just if a song will pop, but when

🌐 Connect

Kanak Yadav

GitHub: Kanak2208
LinkedIn: kanakyadav22

Hanane Mamalik

GitHub: mhananem
LinkedIn: hanane-mamalik

📄 License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
notebooks		notebooks
outputs		outputs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎵 Hype Check - Predicting Spotify Song Popularity Using Machine Learning

📌 Project Overview

📂 Repository Structure

🗃️ Data Sources

Dataset 1 : Most Streamed Spotify Songs 2024 (Kaggle)

Dataset 2 : Audio Features via RapidAPI (Musixae)

Final Dataset

🛠️ Tech Stack

⚙️ Feature Engineering

🔍 Key Findings

What drives song popularity?

Why did tuning hurt?

Model limitations

🚀 Getting Started

1. Clone the repository

2. Install dependencies

3. Set up data

4. Run the notebooks

📊 Evaluation Metrics Used

🔮 Future Work

🌐 Connect

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎵 Hype Check - Predicting Spotify Song Popularity Using Machine Learning

📌 Project Overview

📂 Repository Structure

🗃️ Data Sources

Dataset 1 : Most Streamed Spotify Songs 2024 (Kaggle)

Dataset 2 : Audio Features via RapidAPI (Musixae)

Final Dataset

🛠️ Tech Stack

⚙️ Feature Engineering

🔍 Key Findings

What drives song popularity?

Why did tuning hurt?

Model limitations

🚀 Getting Started

1. Clone the repository

2. Install dependencies

3. Set up data

4. Run the notebooks

📊 Evaluation Metrics Used

🔮 Future Work

🌐 Connect

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages