NBA Contract Value Analyzer

Are NBA teams getting fair value from their contracts?
An end-to-end machine learning system that scrapes current player stats, predicts a market-rate salary for each player, and ranks the league's most overpaid and underpaid deals.

Methodology — Report — Deploy Notes

The Question

The 2024-25 NBA salary cap sits at $140.6 million per team. Front offices are making bets worth tens of millions on player performance, and the public data to evaluate those bets is freely available. Is a player's actual contract consistent with what a statistical model trained on four years of contract data would predict?

This project answers that question for every NBA player on a current deal, surfaces the largest discrepancies, and explains what drives each prediction.

How It Works

Scrape current-season per-game and advanced stats from Basketball-Reference.
Engineer features — per-36 stats (normalizes for minutes), advanced metrics (PER, BPM, VORP, Win Shares), age polynomial (career arc), and position dummies (position-specific pay curves).
Train a LightGBM model on four seasons of historical salary data with a time-series validation split: seasons 2022-2024 train the model, the held-out 2025 season evaluates it.
Rank players by the gap between predicted and actual salary.
Surface results via an interactive Streamlit dashboard.

Results

Honest disclosure. The numbers below are from a 1,400-row synthetic-data demo run — the reproducibility fallback the pipeline uses when Basketball-Reference is unavailable or when the salary sources (Spotrac, HoopsHype) can't be reached. Both salary sources sit behind Cloudflare bot protection; bypassing them needs residential proxies or headless browser + CAPTCHA solving, neither of which belongs in a public portfolio project. On the real-data path, the pipeline scrapes Basketball-Reference for stats and joins them against a hand-curated ~75-row salary dataset; development runs in that mode land in the R² 0.68 – 0.74 range with higher dollar MAE due to greater variance in real NBA contracts. Methodology is identical between paths; only the data source differs. The synthetic vs. real-data tradeoff is documented in METHODOLOGY.md §2 and REPORT.md §Data.

Synthetic-run performance on the held-out 2025 season (350 players):

Metric	Value
R² (log salary)	0.741
MAE (log scale)	0.293
MAE (dollars)	$1,390,130
Training rounds	90 (early stopping)

The top five predictors by LightGBM feature gain: Win Shares, Age, VORP, PER, BPM. Composite efficiency stats outperform raw counting stats — consistent with how front offices actually evaluate players in contract negotiations.

The ~26% of unexplained variance reflects real-world factors the model cannot observe: contract timing relative to cap growth, market-size premiums, injury history, and negotiating leverage.

Project Structure

nba-contract-value/
├── src/
│   ├── scraper/                # Basketball-Reference scraper (polite, cached, retry logic)
│   ├── pipeline/               # Feature engineering: per-36, position dummies, age poly
│   ├── model/                  # LightGBM training, Optuna tuning, evaluation
│   └── app/                    # Streamlit dashboard
├── data/
│   ├── raw/                    # Scraped HTML cache and parsed CSVs
│   └── processed/              # Feature table, trained model, predictions CSV
├── notebooks/                  # End-to-end pipeline notebook with embedded outputs
├── tests/                      # 15 pytest unit tests over the feature pipeline
├── docs/                       # Methodology writeup and evaluation report
├── website/                    # Standalone HTML/CSS/JS site
├── huggingface-spaces/         # HF Spaces deploy artifacts (app.py, requirements.txt, README, notes)
├── .github/workflows/ci.yml    # Lint (ruff) + tests (pytest) on every push / PR to main
├── Makefile                    # make all / make scrape / make train / make app / make test
├── LICENSE                     # MIT
├── ruff.toml                   # Lint config — respects existing compact-style code
└── requirements.txt

Quick Start

git clone https://github.com/bass990/nba-contract-value
cd nba-contract-value
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Full pipeline (scrape -> features -> train -> save)
make all

# Dashboard
make app
# Opens at http://localhost:8501

Or run the notebook: notebooks/NBA_Contract_Value_END_TO_END.ipynb

Modeling Decisions

Decision	What	Why
Target	log(annual_salary)	Right-skewed distribution; log makes MAE symmetric across the salary range
Algorithm	LightGBM	Best tabular performance at this dataset size; handles missingness natively
Validation	Time-series split	Random CV leaks future cap information into past training
Features	Per-36, advanced stats, age polynomial, position dummies	See `docs/METHODOLOGY.md` for full reasoning
Metric	R² + MAE in dollars	R² for model quality narrative; MAE for business interpretation

Full reasoning, including alternatives considered, is in docs/METHODOLOGY.md.

Limitations

The model predicts based on on-court production. Teams also pay for market size, leadership, injury risk, and contract negotiating dynamics — none of which appear in the stats.
Contracts signed years ago compare against current market rates. The age and cap-inflation features partially correct for this.
Scraping is inherently fragile. Basketball-Reference's HTML can change; the scraper caches pages to disk to reduce re-scrape frequency.
Salary data coverage is limited to the curated ~75 player-season dataset or a user-provided HoopsHype CSV. Full-league coverage requires a once-per-season manual download.

Tech Stack

Python 3.11+ — requests + beautifulsoup4 for scraping — pandas for ETL — lightgbm + scikit-learn + optuna for modeling — streamlit + plotly for the dashboard

License

MIT. Basketball-Reference data is the property of Sports Reference LLC and is used for personal, non-commercial purposes; NBA team and player names are trademarks of their respective owners.

Author

Mamadou Bassirou Diallo
MS Business Analytics & AI (Data Science), UT Dallas
LinkedIn — GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Contract Value Analyzer

The Question

How It Works

Results

Project Structure

Quick Start

Modeling Decisions

Limitations

Tech Stack

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
data		data
docs		docs
huggingface-spaces		huggingface-spaces
notebooks		notebooks
src		src
tests		tests
website		website
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
ruff.toml		ruff.toml
runtime.txt		runtime.txt
setup.ps1		setup.ps1

Folders and files

Latest commit

History

Repository files navigation

NBA Contract Value Analyzer

The Question

How It Works

Results

Project Structure

Quick Start

Modeling Decisions

Limitations

Tech Stack

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages