A complete implementation of Linear Regression from scratch using only NumPy, with a full machine learning pipeline including data ingestion, preprocessing, training, evaluation, and visualization.
- Overview
- Features
- Project Structure
- Installation
- Usage
- Implementation Details
- Pipeline Architecture
- Results
- Mathematical Foundation
- Contributing
- License
This project demonstrates a complete end-to-end machine learning pipeline for Linear Regression, built entirely from scratch using Python and NumPy. Unlike using pre-built libraries, this implementation provides deep insights into:
- How gradient descent optimization works
- The mathematics behind linear regression
- Building production-ready ML pipelines
- Best practices in code organization and documentation
Dataset: Boston Housing Dataset (506 samples, 13 features)
- Target: Median house value (MEDV)
- Key Features: RM (rooms), LSTAT (population status), PTRATIO (pupil-teacher ratio), and more
- β Linear Regression from Scratch: No sklearn for model training, pure NumPy implementation
- β Gradient Descent Optimization: Custom implementation with learning curve tracking
- β Complete Data Pipeline: Ingestion β Preprocessing β Training β Evaluation
- β Feature Scaling: StandardScaler for normalization
- β Comprehensive Metrics: MSE, RMSE, MAE, RΒ² Score
- β Rich Visualizations: Learning curves, residual plots, prediction vs actual
- β Modular Design: Clean, reusable, well-documented code
- π Multiple visualization types for model analysis
- π§ Configurable hyperparameters (learning rate, iterations)
- π Training progress tracking with cost history
- π¨ Professional-grade plots with seaborn styling
- π Extensive documentation and docstrings
LinearRegressionModel/
βββ config/
β βββ config.yaml # Configuration parameters
βββ src/
β βββ __init__.py
β βββ linear_regression.py # Core Linear Regression implementation
β βββ data_ingestion.py # Data loading and sanity checks
β βββ data_preprocessing.py # Train/test split and scaling
β βββ model_training.py # Model training orchestration
β βββ model_evaluation.py # Performance metrics calculation
β βββ prediction.py # Prediction utilities
β βββ visualise.py # Visualization functions
βββ notebooks/
β βββ LinearRegressionModel.ipynb # Jupyter notebook version
βββ main.py # Main pipeline execution script
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Python 3.8 or higher
- pip package manager
git clone https://github.com/iamhero2709/LinearRegressionModel.git
cd LinearRegressionModel# On Linux/Mac
python -m venv venv
source venv/bin/activate
# On Windows
python -m venv venv
venv\Scripts\activatepip install -r requirements.txtExecute the entire end-to-end pipeline with a single command:
python main.pyThis will:
- β Load the Boston Housing dataset
- β Perform data sanity checks
- β Preprocess and split data (train/test)
- β Train the Linear Regression model
- β Evaluate performance metrics
- β Generate visualizations
- β Display predictions
================================================================================
LINEAR REGRESSION FROM SCRATCH - END-TO-END PIPELINE
================================================================================
STEP 1: DATA INGESTION
--------------------------------------------------------------------------------
================================================================================
DATA SANITY CHECK
================================================================================
...
STEP 2: DATA PREPROCESSING
--------------------------------------------------------------------------------
Training set size: 404 samples
Testing set size: 102 samples
β Features scaled using StandardScaler
...
STEP 3: MODEL TRAINING
--------------------------------------------------------------------------------
β Model training completed!
Final cost: 10.8234
...
STEP 4: MODEL EVALUATION
--------------------------------------------------------------------------------
Training Set Performance:
MSE : 21.6468
RMSE : 4.6525
MAE : 3.2891
R2 : 0.7408
Test Set Performance:
MSE : 24.2910
RMSE : 4.9286
MAE : 3.3411
R2 : 0.6685
...
π Final Results Summary:
Test RΒ² Score: 0.6685
Test RMSE: 4.9286
Test MAE: 3.3411
Alternatively, explore the implementation interactively:
jupyter notebook notebooks/LinearRegressionModel.ipynbThe core implementation uses Gradient Descent to learn optimal parameters.
class LinearRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
self.cost_history = []
def fit(self, X, y):
"""Train the model using gradient descent"""
# Initialize parameters
# Perform gradient descent
# Track cost history
def predict(self, X):
"""Make predictions"""
return X @ self.weights + self.biasKey Methods:
fit(X, y): Trains the model using gradient descentpredict(X): Makes predictions on new datacompute_cost(y_true, y_pred): Calculates MSE cost
- Fetches Boston Housing dataset from OpenML
- Performs comprehensive sanity checks
- Validates data integrity
- Splits features and target variable
- Creates train/test split (80/20 by default)
- Applies StandardScaler normalization
- Ensures reproducibility with random seed
- Orchestrates the training process
- Configurable hyperparameters
- Tracks and displays training progress
- Calculates multiple metrics (MSE, RMSE, MAE, RΒ²)
- Evaluates both training and test sets
- Detects overfitting automatically
Generates professional-quality plots:
- Learning Curve: Cost vs iterations
- Predictions vs Actual: Scatter plot with perfect prediction line
- Residual Analysis: Residual plot and distribution
βββββββββββββββββββ
β Data Ingestion β
β (Boston Data) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Data Inspection β
β (Sanity Check) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Preprocessing β
β β’ Train/Test β
β β’ Scaling β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Model Training β
β β’ Initialize ΞΈ β
β β’ Grad Descent β
β β’ Cost Tracking β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Evaluation β
β β’ MSE, RMSE β
β β’ MAE, RΒ² β
β β’ Overfitting β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Visualization β
β β’ Learning Curveβ
β β’ Pred vs Act β
β β’ Residuals β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Predictions β
β (New Samples) β
βββββββββββββββββββ
| Metric | Training Set | Test Set |
|---|---|---|
| MSE | 21.65 | 24.29 |
| RMSE | 4.65 | 4.93 |
| MAE | 3.29 | 3.34 |
| RΒ² Score | 0.74 | 0.67 |
- β Good RΒ² Score (0.67): The model explains ~67% of variance in test data
- β No Severe Overfitting: Training and test metrics are similar
- β Reasonable Error: RMSE of ~4.93 on housing prices (in $1000s)
β οΈ Improvement Possible: Could benefit from feature engineering or polynomial features
h(x) = ΞΈβ + ΞΈβxβ + ΞΈβxβ + ... + ΞΈβxβ
= ΞΈα΅x
Where:
ΞΈ= parameters (weights + bias)x= input features
J(ΞΈ) = (1/2m) Ξ£(hΞΈ(xβ±) - yβ±)Β²
Where:
m= number of training exampleshΞΈ(xβ±)= predicted valueyβ±= actual value
ΞΈβ±Ό := ΞΈβ±Ό - Ξ± Γ (βJ(ΞΈ)/βΞΈβ±Ό)
ΞΈβ±Ό := ΞΈβ±Ό - Ξ± Γ (1/m) Ξ£(hΞΈ(xβ±) - yβ±) Γ xβ±Όβ±
Where:
Ξ±= learning rateβJ(ΞΈ)/βΞΈβ±Ό= gradient of cost function
x_scaled = (x - ΞΌ) / Ο
Where:
ΞΌ= mean of featureΟ= standard deviation of feature
- β PEP 8 Compliant: Follows Python style guidelines
- β Comprehensive Docstrings: Every function documented
- β Type Hints: Clear parameter and return types
- β Modular Design: Separation of concerns
- β Error Handling: Robust exception management
- β Clean Code: Readable and maintainable
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Add support for polynomial features
- Implement regularization (Ridge, Lasso)
- Add more visualization types
- Improve documentation
- Add unit tests
- Support for other datasets
This project is licensed under the MIT License - see the LICENSE file for details.
iamhero2709
- GitHub: @iamhero2709
- Boston Housing Dataset: Harrison, D. and Rubinfeld, D.L. (1978)
- OpenML: For providing easy access to datasets
- NumPy: For numerical computing capabilities
- scikit-learn: For preprocessing utilities and metrics
- Andrew Ng - Machine Learning Course (Coursera)
- "Pattern Recognition and Machine Learning" - Christopher Bishop
- "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman
β Star this repo if you find it helpful!
Made with β€οΈ by iamhero2709