End-to-end Big Data ML project: Predicting NYC taxi trip duration with 1.5M+ records using distributed processing and advanced ML techniques
A complete Big Data Machine Learning pipeline for predicting taxi trip duration in New York City:
- 📊 Dataset: 1.5M+ taxi trip records (~200MB)
- 🔧 Processing: Distributed computing with Dask
- 🤖 Models: Multiple ML algorithms (Linear, RF, XGBoost, LightGBM)
- 📈 Analysis: Comprehensive EDA, feature engineering, and model interpretability
This project is designed for a Big Data & Machine Learning course, demonstrating large-scale data handling, complete ML lifecycle, and professional documentation.
portfolio-ml-bigdata/
├── data/
│ ├── raw/ # Original Kaggle data (train.csv, test.csv)
│ ├── processed/ # Generated: train_processed.parquet, val_processed.parquet
│ └── sampled/ # Sample submission
├── notebooks/
│ ├── 01_data_exploration.ipynb # EDA
│ ├── 02_feature_engineering.ipynb # Feature creation
│ ├── 03_modeling.ipynb # Model training
│ └── 04_model_interpretation.ipynb # SHAP analysis
├── app/
│ └── streamlit_app.py # Interactive web demo
├── src/ # Python modules
├── models/ # Saved models (.pkl files)
├── visualizations/ # Generated plots
├── docs/ # Technical documentation
└── requirements.txt
# 1. Clone and setup
git clone https://github.com/your-username/portfolio-ml-bigdata.git
cd portfolio-ml-bigdata
python -m venv venv
.\venv\Scripts\activate # Windows
pip install -r requirements.txt
# 2. Download data from Kaggle
# https://www.kaggle.com/c/nyc-taxi-trip-duration/data
# Place train.csv and test.csv in data/raw/
# 3. Run notebooks in order
jupyter notebook notebooks/See GETTING_STARTED.md for detailed setup instructions.
| Document | Description |
|---|---|
| Technical Report | Complete methodology and analysis |
| Feature Dictionary | All features explained |
| GETTING_STARTED.md | Setup and execution guide |
The technical documentation is in docs/technical_report.md:
| Requirement | Location |
|---|---|
| 3.1 Dataset Description | docs/technical_report.md Section 1 |
| 3.3 EDA | docs/technical_report.md Section 2 + Notebook 01 |
| 3.4 Preprocessing | docs/technical_report.md Section 3 |
| 3.5 Feature Engineering | docs/technical_report.md Section 4 + docs/feature_dictionary.md |
| 3.6 Modeling | docs/technical_report.md Sections 5-6 + Notebook 03 |
| 3.7 Visualizations | visualizations/ folder + notebook outputs |
| 3.8 Conclusions | docs/technical_report.md Section 9 |
- Core: Python 3.9+, scikit-learn, XGBoost, LightGBM
- Big Data: Dask, PyArrow/Parquet
- Visualization: Matplotlib, Seaborn, Plotly, Folium
- Analysis: SHAP, Optuna
| Model | RMSE | MAE | R² | Training Time |
|---|---|---|---|---|
| XGBoost | 0.3053 | 0.2199 | 0.8234 | 56s |
| LightGBM | 0.3216 | 0.2353 | 0.8040 | 27s |
| Random Forest | 0.3299 | 0.2420 | 0.7938 | 23min |
| Gradient Boosting | 0.3309 | 0.2418 | 0.7926 | 76s |
| Ridge (Baseline) | 0.4932 | 0.3799 | 0.5392 | 8s |
Best Model: XGBoost with R² = 0.8234 (explains 82% of trip duration variance)
Key Finding: haversine_distance alone accounts for 78% of feature importance
Only LightGBM model is included in the repository for deployment purposes.
The Streamlit demo uses LightGBM because:
- ✅ Best balance of performance and model size
- ✅ Fast inference time (ideal for web applications)
- ✅ R² = 0.8040 (competitive accuracy)
After cloning the repository:
- Run notebooks 01-03 to train all models locally
- All
.pklfiles will be generated inmodels/- Local deployment will work with all trained models
MIT License - see LICENSE for details.
- Repository: GitHub - NYC Taxi Big Data ML
- Dataset: Kaggle NYC Taxi Trip Duration
- Author: Carlos Manuel Hernández