- Project Overview
- Dataset
- Project Structure
- Methodology
- Key Results
- Installation & Usage
- Dependencies
- Key Findings
- Limitations & Future Work
- Acknowledgements
The 1970s oil crisis forced a fundamental rethinking of automotive design. For the first time, fuel efficiency became a regulatory, commercial, and consumer priority — and the data from that era tells a fascinating story.
This project uses the UCI Auto MPG dataset to build an interpretable multiple linear regression model that predicts a car's fuel efficiency (miles per gallon) from its mechanical and physical characteristics. The goal goes beyond prediction accuracy: we want to understand the model, quantify the impact of each feature, and draw conclusions about what actually drove fuel economy improvements during this pivotal decade.
Core question:
Given a car's engine size, weight, horsepower, acceleration, model year, and region of origin — can we accurately predict its fuel efficiency in miles per gallon?
Success criterion: Test R² ≥ 0.80 (the model must explain at least 80% of variance in unseen data).
| Property | Value |
|---|---|
| Name | Auto MPG |
| Source | UCI Machine Learning Repository |
| Records | 398 cars |
| Period | 1970–1982 |
| Target | mpg — Miles Per Gallon (continuous) |
| Feature | Type | Description |
|---|---|---|
mpg |
Float | Miles per gallon — the prediction target |
cylinders |
Integer | Number of engine cylinders (3, 4, 5, 6, or 8) |
displacement |
Float | Engine displacement in cubic inches |
horsepower |
Float | Engine horsepower (6 values disguised as '?') |
weight |
Integer | Vehicle weight in pounds |
acceleration |
Float | Time (seconds) from 0 to 60 mph |
model year |
Integer | Last two digits of the model year (70–82) |
origin |
Categorical | Manufacturing region: USA, Europe, or Japan |
car name |
String | Vehicle identifier — dropped before modelling |
horsepowerwas stored asobjectdtype with 6 missing values encoded as'?'. Fixed viapd.to_numeric(errors='coerce')and mean imputation.originwas encoded as integers (1, 2, 3). Mapped to meaningful labels and cast tocategorydtype with explicit ordering (USAas reference level for one-hot encoding).
predicting-fuel-economy/
│
├── Predicting_Fuel_Economy.ipynb # Main analysis notebook
├── auto-mpg.csv # Raw dataset
└── README.md # This file
The notebook is organised into 14 sequential sections that follow a complete machine learning workflow:
1. Library Imports
2. Data Loading
3. Data Cleaning
├── 3.1 Fix origin column (int → category with explicit ordering)
├── 3.2 Fix horsepower column ('?' → NaN → mean imputation)
└── 3.3 Final data quality check
4. Exploratory Data Analysis (EDA)
├── 4.1 Summary statistics with interpretation
├── 4.2 Target distribution (histogram + box plot)
├── 4.3 Feature relationships (pair plot, coloured by origin)
├── 4.4 Fuel economy by region of origin
└── 4.5 Correlation heatmap — answers "which feature correlates most?"
5. Feature Engineering
├── 5.1 Polynomial terms (weight², weight³, hp², hp³, acc², acc³)
└── 5.2 One-hot encoding for origin (USA as reference level)
6. Train-Test Split (80/20, random_state=1)
7. Baseline Model — weight only (OLS, single feature)
8. Multiple Regression with K-Fold Cross-Validation
├── 8.1 5-fold CV on training set
└── 8.2 Refit final model on full training data
9. OLS Assumption Checks
├── Residuals vs Fitted plot
└── Normal Q-Q plot
10. Model Summary & Coefficient Interpretation
11. Test Set Evaluation
12. Predicting Fuel Economy for New Cars
13. Ridge Regression (Bonus)
14. Conclusion
Standard pandas cleaning pipeline with deliberate decisions at each step — to_numeric with errors='coerce' to handle the disguised missing values, mean imputation to preserve all 398 rows, and explicit category ordering to control which level becomes the dummy variable reference.
Visual and quantitative exploration to answer three questions before touching the model: What does the target distribution look like? Which features have the strongest relationships with mpg? How severe is the multicollinearity between predictors?
The pair plot reveals clear non-linear (curved) relationships between mpg and weight, horsepower, and acceleration. To capture this curvature within a linear model framework, polynomial terms (squared and cubic) are added for all three features. This allows the model to fit curves while remaining fully interpretable via OLS.
One-hot encoding transforms origin into two binary columns (origin_Europe, origin_Japan), with origin_USA as the reference level — so coefficients are interpreted as mpg differences relative to US-made cars, all else equal.
5-fold cross-validation on the training set provides a low-variance estimate of generalisation performance before the test set is ever touched. The model is then refit on the full training data, and a single final evaluation is performed on the held-out test set.
Residuals vs Fitted plot checks for homoscedasticity; Normal Q-Q plot checks for normality of residuals. Both are interpreted in context — mild tail heaviness is noted but doesn't invalidate the model under the Gauss-Markov theorem.
The large condition number in the OLS summary (~4.84 × 10¹²) signals multicollinearity from the polynomial terms. Ridge regression with RidgeCV addresses this by adding an L2 penalty, with StandardScaler applied beforehand (required because Ridge penalises all coefficients equally — unscaled features would be penalised unfairly).
| Model | Test R² | Test MAE | Notes |
|---|---|---|---|
| Baseline (weight only) | 0.695 | ~3.1 mpg | Single-feature OLS |
| OLS — Multiple Regression | 0.877 | 1.94 mpg | 14 features + polynomial terms |
| Ridge — Multiple Regression | 0.886 | 1.87 mpg | Same features, L2 regularisation |
The final OLS model exceeds the project target of R² ≥ 0.80, explaining 87.7% of the variance in fuel economy on held-out test data. Ridge provides a marginal improvement (+0.009 R²) but more importantly offers greater coefficient stability due to better handling of multicollinearity.
| Metric | CV Mean | CV Std |
|---|---|---|
| R² | ~0.857 | ±0.015 |
| MAE | ~2.0 mpg | ±0.1 mpg |
Low standard deviation across folds confirms the model is stable and not overfitting to any particular subset of the training data.
git clone https://github.com/yourusername/predicting-fuel-economy.git
cd predicting-fuel-economy
pip install -r requirements.txtjupyter notebook Predicting_Fuel_Economy.ipynbRun all cells from top to bottom — the notebook is designed to be executed sequentially, with each section building on the previous one.
Note: The dataset file
auto-mpg.csvmust be in the same directory as the notebook. Thepd.read_csv("auto-mpg.csv")call in Section 2 expects it there.
numpy>=1.21
pandas>=1.3
matplotlib>=3.4
seaborn>=0.12
scipy>=1.7
scikit-learn>=1.0
statsmodels>=0.13
jupyter>=1.0
Install all at once:
pip install numpy pandas matplotlib seaborn scipy scikit-learn statsmodels jupyterDeveloped and tested on Python 3.10+. Compatible with Python 3.8 and above.
weight — with a correlation of r = −0.832, it is the single strongest mechanical predictor of fuel economy. Heavier cars require more energy to accelerate and maintain speed, directly reducing efficiency.
Each additional model year is associated with approximately +0.86 MPG, holding all other features constant (coefficient = 0.8573, p < 0.001, t = 16.19). This is the most statistically significant coefficient in the model and reflects the decade of regulatory pressure following the 1973 oil crisis — CAFE standards and Clean Air Act amendments were systematically pushing manufacturers toward more efficient designs year over year.
In practical terms: a mechanically identical 1982 car is predicted to get ~10 MPG more than the same car from 1970 (0.8573 × 12 ≈ 10.3 mpg).
Yes — beyond what the mechanical specs capture, European cars average +1.3 MPG and Japanese cars average +1.1 MPG relative to equivalent US-made cars (both significant at p < 0.05). This likely reflects unmeasured design differences in aerodynamics, transmission efficiency, and manufacturing philosophy.
Marginally, in terms of predictive accuracy (ΔR² ≈ +0.009). More importantly, Ridge is more stable — the polynomial terms introduce severe multicollinearity (condition number ~4.84 × 10¹²) that inflates OLS coefficient standard errors. Ridge's L2 penalty shrinks correlated coefficients toward each other, producing more reliable estimates without meaningfully sacrificing accuracy.
Current limitations:
- The dataset covers only 1970–1982 and 398 cars — results may not generalise to modern vehicles, hybrids, or EVs
- Mean imputation for the 6 missing horsepower values is simple; k-NN or multiple imputation would be more principled
- Multicollinearity among engine features (cylinders, displacement, horsepower, weight) makes individual coefficient interpretation unreliable in the OLS model
car namewas discarded entirely; manufacturer-level effects beyond regional origin are unexplored
Potential extensions:
- Principal Component Regression (PCR) or Lasso for more robust feature selection under multicollinearity
- Manufacturer-level fixed effects extracted from
car nameusing regex or fuzzy matching - Ensemble methods (Random Forest, Gradient Boosting) for a predictive accuracy comparison at the cost of interpretability
- Time-series analysis treating model year as a temporal index to better capture the regulatory trend
- Modern dataset integration with post-1982 EPA fuel economy data to test whether the model year coefficient direction holds
- Dataset: Quinlan, R. (1993). Auto MPG Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/9/auto+mpg
- Guided project framework: Maven Analytics — Predicting Fuel Economy project brief
- Built as part of a personal data science portfolio to demonstrate end-to-end regression modelling with interpretability focus
Last updated: March 2026