Skip to content

bass990/NBA-Contract-Value-Analyzer

Repository files navigation

NBA Contract Value Analyzer

CI License: MIT Python 3.11+

Are NBA teams getting fair value from their contracts?
An end-to-end machine learning system that scrapes current player stats, predicts a market-rate salary for each player, and ranks the league's most overpaid and underpaid deals.

MethodologyReportDeploy Notes


The Question

The 2024-25 NBA salary cap sits at $140.6 million per team. Front offices are making bets worth tens of millions on player performance, and the public data to evaluate those bets is freely available. Is a player's actual contract consistent with what a statistical model trained on four years of contract data would predict?

This project answers that question for every NBA player on a current deal, surfaces the largest discrepancies, and explains what drives each prediction.

How It Works

  1. Scrape current-season per-game and advanced stats from Basketball-Reference.
  2. Engineer features — per-36 stats (normalizes for minutes), advanced metrics (PER, BPM, VORP, Win Shares), age polynomial (career arc), and position dummies (position-specific pay curves).
  3. Train a LightGBM model on four seasons of historical salary data with a time-series validation split: seasons 2022-2024 train the model, the held-out 2025 season evaluates it.
  4. Rank players by the gap between predicted and actual salary.
  5. Surface results via an interactive Streamlit dashboard.

Results

Honest disclosure. The numbers below are from a 1,400-row synthetic-data demo run — the reproducibility fallback the pipeline uses when Basketball-Reference is unavailable or when the salary sources (Spotrac, HoopsHype) can't be reached. Both salary sources sit behind Cloudflare bot protection; bypassing them needs residential proxies or headless browser + CAPTCHA solving, neither of which belongs in a public portfolio project. On the real-data path, the pipeline scrapes Basketball-Reference for stats and joins them against a hand-curated ~75-row salary dataset; development runs in that mode land in the R² 0.68 – 0.74 range with higher dollar MAE due to greater variance in real NBA contracts. Methodology is identical between paths; only the data source differs. The synthetic vs. real-data tradeoff is documented in METHODOLOGY.md §2 and REPORT.md §Data.

Synthetic-run performance on the held-out 2025 season (350 players):

Metric Value
R² (log salary) 0.741
MAE (log scale) 0.293
MAE (dollars) $1,390,130
Training rounds 90 (early stopping)

The top five predictors by LightGBM feature gain: Win Shares, Age, VORP, PER, BPM. Composite efficiency stats outperform raw counting stats — consistent with how front offices actually evaluate players in contract negotiations.

The ~26% of unexplained variance reflects real-world factors the model cannot observe: contract timing relative to cap growth, market-size premiums, injury history, and negotiating leverage.

Project Structure

nba-contract-value/
├── src/
│   ├── scraper/                # Basketball-Reference scraper (polite, cached, retry logic)
│   ├── pipeline/               # Feature engineering: per-36, position dummies, age poly
│   ├── model/                  # LightGBM training, Optuna tuning, evaluation
│   └── app/                    # Streamlit dashboard
├── data/
│   ├── raw/                    # Scraped HTML cache and parsed CSVs
│   └── processed/              # Feature table, trained model, predictions CSV
├── notebooks/                  # End-to-end pipeline notebook with embedded outputs
├── tests/                      # 15 pytest unit tests over the feature pipeline
├── docs/                       # Methodology writeup and evaluation report
├── website/                    # Standalone HTML/CSS/JS site
├── huggingface-spaces/         # HF Spaces deploy artifacts (app.py, requirements.txt, README, notes)
├── .github/workflows/ci.yml    # Lint (ruff) + tests (pytest) on every push / PR to main
├── Makefile                    # make all / make scrape / make train / make app / make test
├── LICENSE                     # MIT
├── ruff.toml                   # Lint config — respects existing compact-style code
└── requirements.txt

Quick Start

git clone https://github.com/bass990/nba-contract-value
cd nba-contract-value
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Full pipeline (scrape -> features -> train -> save)
make all

# Dashboard
make app
# Opens at http://localhost:8501

Or run the notebook: notebooks/NBA_Contract_Value_END_TO_END.ipynb

Modeling Decisions

Decision What Why
Target log(annual_salary) Right-skewed distribution; log makes MAE symmetric across the salary range
Algorithm LightGBM Best tabular performance at this dataset size; handles missingness natively
Validation Time-series split Random CV leaks future cap information into past training
Features Per-36, advanced stats, age polynomial, position dummies See docs/METHODOLOGY.md for full reasoning
Metric R² + MAE in dollars R² for model quality narrative; MAE for business interpretation

Full reasoning, including alternatives considered, is in docs/METHODOLOGY.md.

Limitations

  • The model predicts based on on-court production. Teams also pay for market size, leadership, injury risk, and contract negotiating dynamics — none of which appear in the stats.
  • Contracts signed years ago compare against current market rates. The age and cap-inflation features partially correct for this.
  • Scraping is inherently fragile. Basketball-Reference's HTML can change; the scraper caches pages to disk to reduce re-scrape frequency.
  • Salary data coverage is limited to the curated ~75 player-season dataset or a user-provided HoopsHype CSV. Full-league coverage requires a once-per-season manual download.

Tech Stack

Python 3.11+ — requests + beautifulsoup4 for scraping — pandas for ETL — lightgbm + scikit-learn + optuna for modeling — streamlit + plotly for the dashboard

License

MIT. Basketball-Reference data is the property of Sports Reference LLC and is used for personal, non-commercial purposes; NBA team and player names are trademarks of their respective owners.

Author

Mamadou Bassirou Diallo
MS Business Analytics & AI (Data Science), UT Dallas
LinkedInGitHub

About

LightGBM model predicting NBA market salary from on-court production, with leakage-safe time-series CV (train 2022-24, test 2025). README leads with the honest data note: a 75-row hand-curated real salary set plus a synthetic reproducibility path. 15 pytest tests, GitHub Actions CI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors