Skip to content

Saurabh6266/ML_Project1

Repository files navigation

🚗 Predicting Fuel Economy

A Multiple Linear Regression Study on the Auto MPG Dataset

Python Jupyter scikit-learn statsmodels Status


Table of Contents

  1. Project Overview
  2. Dataset
  3. Project Structure
  4. Methodology
  5. Key Results
  6. Installation & Usage
  7. Dependencies
  8. Key Findings
  9. Limitations & Future Work
  10. Acknowledgements

Project Overview

The 1970s oil crisis forced a fundamental rethinking of automotive design. For the first time, fuel efficiency became a regulatory, commercial, and consumer priority — and the data from that era tells a fascinating story.

This project uses the UCI Auto MPG dataset to build an interpretable multiple linear regression model that predicts a car's fuel efficiency (miles per gallon) from its mechanical and physical characteristics. The goal goes beyond prediction accuracy: we want to understand the model, quantify the impact of each feature, and draw conclusions about what actually drove fuel economy improvements during this pivotal decade.

Core question:

Given a car's engine size, weight, horsepower, acceleration, model year, and region of origin — can we accurately predict its fuel efficiency in miles per gallon?

Success criterion: Test R² ≥ 0.80 (the model must explain at least 80% of variance in unseen data).


Dataset

Property Value
Name Auto MPG
Source UCI Machine Learning Repository
Records 398 cars
Period 1970–1982
Target mpg — Miles Per Gallon (continuous)

Feature Descriptions

Feature Type Description
mpg Float Miles per gallon — the prediction target
cylinders Integer Number of engine cylinders (3, 4, 5, 6, or 8)
displacement Float Engine displacement in cubic inches
horsepower Float Engine horsepower (6 values disguised as '?')
weight Integer Vehicle weight in pounds
acceleration Float Time (seconds) from 0 to 60 mph
model year Integer Last two digits of the model year (70–82)
origin Categorical Manufacturing region: USA, Europe, or Japan
car name String Vehicle identifier — dropped before modelling

Data Quality Issues Addressed

  • horsepower was stored as object dtype with 6 missing values encoded as '?'. Fixed via pd.to_numeric(errors='coerce') and mean imputation.
  • origin was encoded as integers (1, 2, 3). Mapped to meaningful labels and cast to category dtype with explicit ordering (USA as reference level for one-hot encoding).

Project Structure

predicting-fuel-economy/
│
├── Predicting_Fuel_Economy.ipynb   # Main analysis notebook
├── auto-mpg.csv                    # Raw dataset
└── README.md                       # This file

Notebook Structure

The notebook is organised into 14 sequential sections that follow a complete machine learning workflow:

 1. Library Imports
 2. Data Loading
 3. Data Cleaning
    ├── 3.1  Fix origin column (int → category with explicit ordering)
    ├── 3.2  Fix horsepower column ('?' → NaN → mean imputation)
    └── 3.3  Final data quality check
 4. Exploratory Data Analysis (EDA)
    ├── 4.1  Summary statistics with interpretation
    ├── 4.2  Target distribution (histogram + box plot)
    ├── 4.3  Feature relationships (pair plot, coloured by origin)
    ├── 4.4  Fuel economy by region of origin
    └── 4.5  Correlation heatmap — answers "which feature correlates most?"
 5. Feature Engineering
    ├── 5.1  Polynomial terms (weight², weight³, hp², hp³, acc², acc³)
    └── 5.2  One-hot encoding for origin (USA as reference level)
 6. Train-Test Split (80/20, random_state=1)
 7. Baseline Model — weight only (OLS, single feature)
 8. Multiple Regression with K-Fold Cross-Validation
    ├── 8.1  5-fold CV on training set
    └── 8.2  Refit final model on full training data
 9. OLS Assumption Checks
    ├── Residuals vs Fitted plot
    └── Normal Q-Q plot
10. Model Summary & Coefficient Interpretation
11. Test Set Evaluation
12. Predicting Fuel Economy for New Cars
13. Ridge Regression (Bonus)
14. Conclusion

Methodology

1. Data Cleaning

Standard pandas cleaning pipeline with deliberate decisions at each step — to_numeric with errors='coerce' to handle the disguised missing values, mean imputation to preserve all 398 rows, and explicit category ordering to control which level becomes the dummy variable reference.

2. Exploratory Data Analysis

Visual and quantitative exploration to answer three questions before touching the model: What does the target distribution look like? Which features have the strongest relationships with mpg? How severe is the multicollinearity between predictors?

3. Feature Engineering

The pair plot reveals clear non-linear (curved) relationships between mpg and weight, horsepower, and acceleration. To capture this curvature within a linear model framework, polynomial terms (squared and cubic) are added for all three features. This allows the model to fit curves while remaining fully interpretable via OLS.

One-hot encoding transforms origin into two binary columns (origin_Europe, origin_Japan), with origin_USA as the reference level — so coefficients are interpreted as mpg differences relative to US-made cars, all else equal.

4. Validation Strategy

5-fold cross-validation on the training set provides a low-variance estimate of generalisation performance before the test set is ever touched. The model is then refit on the full training data, and a single final evaluation is performed on the held-out test set.

5. OLS Assumption Checks

Residuals vs Fitted plot checks for homoscedasticity; Normal Q-Q plot checks for normality of residuals. Both are interpreted in context — mild tail heaviness is noted but doesn't invalidate the model under the Gauss-Markov theorem.

6. Ridge Regression (Bonus)

The large condition number in the OLS summary (~4.84 × 10¹²) signals multicollinearity from the polynomial terms. Ridge regression with RidgeCV addresses this by adding an L2 penalty, with StandardScaler applied beforehand (required because Ridge penalises all coefficients equally — unscaled features would be penalised unfairly).


Key Results

Model Performance Comparison

Model Test R² Test MAE Notes
Baseline (weight only) 0.695 ~3.1 mpg Single-feature OLS
OLS — Multiple Regression 0.877 1.94 mpg 14 features + polynomial terms
Ridge — Multiple Regression 0.886 1.87 mpg Same features, L2 regularisation

The final OLS model exceeds the project target of R² ≥ 0.80, explaining 87.7% of the variance in fuel economy on held-out test data. Ridge provides a marginal improvement (+0.009 R²) but more importantly offers greater coefficient stability due to better handling of multicollinearity.

Cross-Validation Summary

Metric CV Mean CV Std
~0.857 ±0.015
MAE ~2.0 mpg ±0.1 mpg

Low standard deviation across folds confirms the model is stable and not overfitting to any particular subset of the training data.


Installation & Usage

Clone & Install

git clone https://github.com/yourusername/predicting-fuel-economy.git
cd predicting-fuel-economy
pip install -r requirements.txt

Run the Notebook

jupyter notebook Predicting_Fuel_Economy.ipynb

Run all cells from top to bottom — the notebook is designed to be executed sequentially, with each section building on the previous one.

Note: The dataset file auto-mpg.csv must be in the same directory as the notebook. The pd.read_csv("auto-mpg.csv") call in Section 2 expects it there.


Dependencies

numpy>=1.21
pandas>=1.3
matplotlib>=3.4
seaborn>=0.12
scipy>=1.7
scikit-learn>=1.0
statsmodels>=0.13
jupyter>=1.0

Install all at once:

pip install numpy pandas matplotlib seaborn scipy scikit-learn statsmodels jupyter

Developed and tested on Python 3.10+. Compatible with Python 3.8 and above.


Key Findings

Which feature correlates most strongly with MPG?

weight — with a correlation of r = −0.832, it is the single strongest mechanical predictor of fuel economy. Heavier cars require more energy to accelerate and maintain speed, directly reducing efficiency.

What is the impact of model year on MPG?

Each additional model year is associated with approximately +0.86 MPG, holding all other features constant (coefficient = 0.8573, p < 0.001, t = 16.19). This is the most statistically significant coefficient in the model and reflects the decade of regulatory pressure following the 1973 oil crisis — CAFE standards and Clean Air Act amendments were systematically pushing manufacturers toward more efficient designs year over year.

In practical terms: a mechanically identical 1982 car is predicted to get ~10 MPG more than the same car from 1970 (0.8573 × 12 ≈ 10.3 mpg).

Does region of origin matter?

Yes — beyond what the mechanical specs capture, European cars average +1.3 MPG and Japanese cars average +1.1 MPG relative to equivalent US-made cars (both significant at p < 0.05). This likely reflects unmeasured design differences in aerodynamics, transmission efficiency, and manufacturing philosophy.

Is Ridge better than OLS?

Marginally, in terms of predictive accuracy (ΔR² ≈ +0.009). More importantly, Ridge is more stable — the polynomial terms introduce severe multicollinearity (condition number ~4.84 × 10¹²) that inflates OLS coefficient standard errors. Ridge's L2 penalty shrinks correlated coefficients toward each other, producing more reliable estimates without meaningfully sacrificing accuracy.


Limitations & Future Work

Current limitations:

  • The dataset covers only 1970–1982 and 398 cars — results may not generalise to modern vehicles, hybrids, or EVs
  • Mean imputation for the 6 missing horsepower values is simple; k-NN or multiple imputation would be more principled
  • Multicollinearity among engine features (cylinders, displacement, horsepower, weight) makes individual coefficient interpretation unreliable in the OLS model
  • car name was discarded entirely; manufacturer-level effects beyond regional origin are unexplored

Potential extensions:

  • Principal Component Regression (PCR) or Lasso for more robust feature selection under multicollinearity
  • Manufacturer-level fixed effects extracted from car name using regex or fuzzy matching
  • Ensemble methods (Random Forest, Gradient Boosting) for a predictive accuracy comparison at the cost of interpretability
  • Time-series analysis treating model year as a temporal index to better capture the regulatory trend
  • Modern dataset integration with post-1982 EPA fuel economy data to test whether the model year coefficient direction holds

Acknowledgements

  • Dataset: Quinlan, R. (1993). Auto MPG Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/9/auto+mpg
  • Guided project framework: Maven Analytics — Predicting Fuel Economy project brief
  • Built as part of a personal data science portfolio to demonstrate end-to-end regression modelling with interpretability focus

Last updated: March 2026

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors