🚗 Predicting Fuel Economy

A Multiple Linear Regression Study on the Auto MPG Dataset

Project Overview

The 1970s oil crisis forced a fundamental rethinking of automotive design. For the first time, fuel efficiency became a regulatory, commercial, and consumer priority — and the data from that era tells a fascinating story.

This project uses the UCI Auto MPG dataset to build an interpretable multiple linear regression model that predicts a car's fuel efficiency (miles per gallon) from its mechanical and physical characteristics. The goal goes beyond prediction accuracy: we want to understand the model, quantify the impact of each feature, and draw conclusions about what actually drove fuel economy improvements during this pivotal decade.

Core question:

Given a car's engine size, weight, horsepower, acceleration, model year, and region of origin — can we accurately predict its fuel efficiency in miles per gallon?

Success criterion: Test R² ≥ 0.80 (the model must explain at least 80% of variance in unseen data).

Dataset

Property	Value
Name	Auto MPG
Source	UCI Machine Learning Repository
Records	398 cars
Period	1970–1982
Target	`mpg` — Miles Per Gallon (continuous)

Feature Descriptions

Feature	Type	Description
`mpg`	Float	Miles per gallon — the prediction target
`cylinders`	Integer	Number of engine cylinders (3, 4, 5, 6, or 8)
`displacement`	Float	Engine displacement in cubic inches
`horsepower`	Float	Engine horsepower (6 values disguised as `'?'`)
`weight`	Integer	Vehicle weight in pounds
`acceleration`	Float	Time (seconds) from 0 to 60 mph
`model year`	Integer	Last two digits of the model year (70–82)
`origin`	Categorical	Manufacturing region: USA, Europe, or Japan
`car name`	String	Vehicle identifier — dropped before modelling

Data Quality Issues Addressed

horsepower was stored as object dtype with 6 missing values encoded as '?'. Fixed via pd.to_numeric(errors='coerce') and mean imputation.
origin was encoded as integers (1, 2, 3). Mapped to meaningful labels and cast to category dtype with explicit ordering (USA as reference level for one-hot encoding).

Project Structure

predicting-fuel-economy/
│
├── Predicting_Fuel_Economy.ipynb   # Main analysis notebook
├── auto-mpg.csv                    # Raw dataset
└── README.md                       # This file

Notebook Structure

The notebook is organised into 14 sequential sections that follow a complete machine learning workflow:

 1. Library Imports
 2. Data Loading
 3. Data Cleaning
    ├── 3.1  Fix origin column (int → category with explicit ordering)
    ├── 3.2  Fix horsepower column ('?' → NaN → mean imputation)
    └── 3.3  Final data quality check
 4. Exploratory Data Analysis (EDA)
    ├── 4.1  Summary statistics with interpretation
    ├── 4.2  Target distribution (histogram + box plot)
    ├── 4.3  Feature relationships (pair plot, coloured by origin)
    ├── 4.4  Fuel economy by region of origin
    └── 4.5  Correlation heatmap — answers "which feature correlates most?"
 5. Feature Engineering
    ├── 5.1  Polynomial terms (weight², weight³, hp², hp³, acc², acc³)
    └── 5.2  One-hot encoding for origin (USA as reference level)
 6. Train-Test Split (80/20, random_state=1)
 7. Baseline Model — weight only (OLS, single feature)
 8. Multiple Regression with K-Fold Cross-Validation
    ├── 8.1  5-fold CV on training set
    └── 8.2  Refit final model on full training data
 9. OLS Assumption Checks
    ├── Residuals vs Fitted plot
    └── Normal Q-Q plot
10. Model Summary & Coefficient Interpretation
11. Test Set Evaluation
12. Predicting Fuel Economy for New Cars
13. Ridge Regression (Bonus)
14. Conclusion

Methodology

1. Data Cleaning

Standard pandas cleaning pipeline with deliberate decisions at each step — to_numeric with errors='coerce' to handle the disguised missing values, mean imputation to preserve all 398 rows, and explicit category ordering to control which level becomes the dummy variable reference.

2. Exploratory Data Analysis

Visual and quantitative exploration to answer three questions before touching the model: What does the target distribution look like? Which features have the strongest relationships with mpg? How severe is the multicollinearity between predictors?

3. Feature Engineering

The pair plot reveals clear non-linear (curved) relationships between mpg and weight, horsepower, and acceleration. To capture this curvature within a linear model framework, polynomial terms (squared and cubic) are added for all three features. This allows the model to fit curves while remaining fully interpretable via OLS.

One-hot encoding transforms origin into two binary columns (origin_Europe, origin_Japan), with origin_USA as the reference level — so coefficients are interpreted as mpg differences relative to US-made cars, all else equal.

4. Validation Strategy

5-fold cross-validation on the training set provides a low-variance estimate of generalisation performance before the test set is ever touched. The model is then refit on the full training data, and a single final evaluation is performed on the held-out test set.

5. OLS Assumption Checks

Residuals vs Fitted plot checks for homoscedasticity; Normal Q-Q plot checks for normality of residuals. Both are interpreted in context — mild tail heaviness is noted but doesn't invalidate the model under the Gauss-Markov theorem.

6. Ridge Regression (Bonus)

The large condition number in the OLS summary (~4.84 × 10¹²) signals multicollinearity from the polynomial terms. Ridge regression with RidgeCV addresses this by adding an L2 penalty, with StandardScaler applied beforehand (required because Ridge penalises all coefficients equally — unscaled features would be penalised unfairly).

Key Results

Model Performance Comparison

Model	Test R²	Test MAE	Notes
Baseline (weight only)	0.695	~3.1 mpg	Single-feature OLS
OLS — Multiple Regression	0.877	1.94 mpg	14 features + polynomial terms
Ridge — Multiple Regression	0.886	1.87 mpg	Same features, L2 regularisation

The final OLS model exceeds the project target of R² ≥ 0.80, explaining 87.7% of the variance in fuel economy on held-out test data. Ridge provides a marginal improvement (+0.009 R²) but more importantly offers greater coefficient stability due to better handling of multicollinearity.

Cross-Validation Summary

Metric	CV Mean	CV Std
R²	~0.857	±0.015
MAE	~2.0 mpg	±0.1 mpg

Low standard deviation across folds confirms the model is stable and not overfitting to any particular subset of the training data.

Installation & Usage

Clone & Install

git clone https://github.com/yourusername/predicting-fuel-economy.git
cd predicting-fuel-economy
pip install -r requirements.txt

Run the Notebook

jupyter notebook Predicting_Fuel_Economy.ipynb

Run all cells from top to bottom — the notebook is designed to be executed sequentially, with each section building on the previous one.

Note: The dataset file auto-mpg.csv must be in the same directory as the notebook. The pd.read_csv("auto-mpg.csv") call in Section 2 expects it there.

Dependencies

numpy>=1.21
pandas>=1.3
matplotlib>=3.4
seaborn>=0.12
scipy>=1.7
scikit-learn>=1.0
statsmodels>=0.13
jupyter>=1.0

Install all at once:

pip install numpy pandas matplotlib seaborn scipy scikit-learn statsmodels jupyter

Developed and tested on Python 3.10+. Compatible with Python 3.8 and above.

Key Findings

Which feature correlates most strongly with MPG?

weight — with a correlation of r = −0.832, it is the single strongest mechanical predictor of fuel economy. Heavier cars require more energy to accelerate and maintain speed, directly reducing efficiency.

What is the impact of model year on MPG?

Each additional model year is associated with approximately +0.86 MPG, holding all other features constant (coefficient = 0.8573, p < 0.001, t = 16.19). This is the most statistically significant coefficient in the model and reflects the decade of regulatory pressure following the 1973 oil crisis — CAFE standards and Clean Air Act amendments were systematically pushing manufacturers toward more efficient designs year over year.

In practical terms: a mechanically identical 1982 car is predicted to get ~10 MPG more than the same car from 1970 (0.8573 × 12 ≈ 10.3 mpg).

Does region of origin matter?

Yes — beyond what the mechanical specs capture, European cars average +1.3 MPG and Japanese cars average +1.1 MPG relative to equivalent US-made cars (both significant at p < 0.05). This likely reflects unmeasured design differences in aerodynamics, transmission efficiency, and manufacturing philosophy.

Is Ridge better than OLS?

Marginally, in terms of predictive accuracy (ΔR² ≈ +0.009). More importantly, Ridge is more stable — the polynomial terms introduce severe multicollinearity (condition number ~4.84 × 10¹²) that inflates OLS coefficient standard errors. Ridge's L2 penalty shrinks correlated coefficients toward each other, producing more reliable estimates without meaningfully sacrificing accuracy.

Limitations & Future Work

Current limitations:

The dataset covers only 1970–1982 and 398 cars — results may not generalise to modern vehicles, hybrids, or EVs
Mean imputation for the 6 missing horsepower values is simple; k-NN or multiple imputation would be more principled
Multicollinearity among engine features (cylinders, displacement, horsepower, weight) makes individual coefficient interpretation unreliable in the OLS model
car name was discarded entirely; manufacturer-level effects beyond regional origin are unexplored

Potential extensions:

Principal Component Regression (PCR) or Lasso for more robust feature selection under multicollinearity
Manufacturer-level fixed effects extracted from car name using regex or fuzzy matching
Ensemble methods (Random Forest, Gradient Boosting) for a predictive accuracy comparison at the cost of interpretability
Time-series analysis treating model year as a temporal index to better capture the regulatory trend
Modern dataset integration with post-1982 EPA fuel economy data to test whether the model year coefficient direction holds

Acknowledgements

Dataset: Quinlan, R. (1993). Auto MPG Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/9/auto+mpg
Guided project framework: Maven Analytics — Predicting Fuel Economy project brief
Built as part of a personal data science portfolio to demonstrate end-to-end regression modelling with interpretability focus

Last updated: March 2026

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
LICENSE		LICENSE
Predicting Fuel Economy - Project.ipynb		Predicting Fuel Economy - Project.ipynb
README.md		README.md
auto-mpg.csv		auto-mpg.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚗 Predicting Fuel Economy

A Multiple Linear Regression Study on the Auto MPG Dataset

Table of Contents

Project Overview

Dataset

Feature Descriptions

Data Quality Issues Addressed

Project Structure

Notebook Structure

Methodology

1. Data Cleaning

2. Exploratory Data Analysis

3. Feature Engineering

4. Validation Strategy

5. OLS Assumption Checks

6. Ridge Regression (Bonus)

Key Results

Model Performance Comparison

Cross-Validation Summary

Installation & Usage

Clone & Install

Run the Notebook

Dependencies

Key Findings

Which feature correlates most strongly with MPG?

What is the impact of model year on MPG?

Does region of origin matter?

Is Ridge better than OLS?

Limitations & Future Work

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages