Skip to content

iamhero2709/LinearRegressionModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Linear Regression from Scratch - Complete End-to-End Pipeline

Python NumPy scikit-learn License

A complete implementation of Linear Regression from scratch using only NumPy, with a full machine learning pipeline including data ingestion, preprocessing, training, evaluation, and visualization.


πŸ“‹ Table of Contents


🎯 Overview

This project demonstrates a complete end-to-end machine learning pipeline for Linear Regression, built entirely from scratch using Python and NumPy. Unlike using pre-built libraries, this implementation provides deep insights into:

  • How gradient descent optimization works
  • The mathematics behind linear regression
  • Building production-ready ML pipelines
  • Best practices in code organization and documentation

Dataset: Boston Housing Dataset (506 samples, 13 features)

  • Target: Median house value (MEDV)
  • Key Features: RM (rooms), LSTAT (population status), PTRATIO (pupil-teacher ratio), and more

✨ Features

Core Components

  • βœ… Linear Regression from Scratch: No sklearn for model training, pure NumPy implementation
  • βœ… Gradient Descent Optimization: Custom implementation with learning curve tracking
  • βœ… Complete Data Pipeline: Ingestion β†’ Preprocessing β†’ Training β†’ Evaluation
  • βœ… Feature Scaling: StandardScaler for normalization
  • βœ… Comprehensive Metrics: MSE, RMSE, MAE, RΒ² Score
  • βœ… Rich Visualizations: Learning curves, residual plots, prediction vs actual
  • βœ… Modular Design: Clean, reusable, well-documented code

Additional Features

  • πŸ“Š Multiple visualization types for model analysis
  • πŸ”§ Configurable hyperparameters (learning rate, iterations)
  • πŸ“ˆ Training progress tracking with cost history
  • 🎨 Professional-grade plots with seaborn styling
  • πŸ“ Extensive documentation and docstrings

πŸ“ Project Structure

LinearRegressionModel/
β”œβ”€β”€ config/
β”‚   └── config.yaml              # Configuration parameters
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ linear_regression.py     # Core Linear Regression implementation
β”‚   β”œβ”€β”€ data_ingestion.py        # Data loading and sanity checks
β”‚   β”œβ”€β”€ data_preprocessing.py    # Train/test split and scaling
β”‚   β”œβ”€β”€ model_training.py        # Model training orchestration
β”‚   β”œβ”€β”€ model_evaluation.py      # Performance metrics calculation
β”‚   β”œβ”€β”€ prediction.py            # Prediction utilities
β”‚   └── visualise.py             # Visualization functions
β”œβ”€β”€ notebooks/
β”‚   └── LinearRegressionModel.ipynb  # Jupyter notebook version
β”œβ”€β”€ main.py                      # Main pipeline execution script
β”œβ”€β”€ requirements.txt             # Python dependencies
└── README.md                    # This file

πŸ”§ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Step 1: Clone the Repository

git clone https://github.com/iamhero2709/LinearRegressionModel.git
cd LinearRegressionModel

Step 2: Create Virtual Environment (Recommended)

# On Linux/Mac
python -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

πŸš€ Usage

Run the Complete Pipeline

Execute the entire end-to-end pipeline with a single command:

python main.py

This will:

  1. βœ… Load the Boston Housing dataset
  2. βœ… Perform data sanity checks
  3. βœ… Preprocess and split data (train/test)
  4. βœ… Train the Linear Regression model
  5. βœ… Evaluate performance metrics
  6. βœ… Generate visualizations
  7. βœ… Display predictions

Expected Output

================================================================================
          LINEAR REGRESSION FROM SCRATCH - END-TO-END PIPELINE          
================================================================================

STEP 1: DATA INGESTION
--------------------------------------------------------------------------------
================================================================================
DATA SANITY CHECK
================================================================================
...

STEP 2: DATA PREPROCESSING
--------------------------------------------------------------------------------
Training set size: 404 samples
Testing set size: 102 samples
βœ“ Features scaled using StandardScaler
...

STEP 3: MODEL TRAINING
--------------------------------------------------------------------------------
βœ“ Model training completed!
  Final cost: 10.8234
...

STEP 4: MODEL EVALUATION
--------------------------------------------------------------------------------
Training Set Performance:
  MSE     : 21.6468
  RMSE    : 4.6525
  MAE     : 3.2891
  R2      : 0.7408

Test Set Performance:
  MSE     : 24.2910
  RMSE    : 4.9286
  MAE     : 3.3411
  R2      : 0.6685
...

πŸ“Š Final Results Summary:
  Test RΒ² Score: 0.6685
  Test RMSE: 4.9286
  Test MAE: 3.3411

Using the Jupyter Notebook

Alternatively, explore the implementation interactively:

jupyter notebook notebooks/LinearRegressionModel.ipynb

🧠 Implementation Details

1. Linear Regression Class (src/linear_regression.py)

The core implementation uses Gradient Descent to learn optimal parameters.

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def fit(self, X, y):
        """Train the model using gradient descent"""
        # Initialize parameters
        # Perform gradient descent
        # Track cost history
    
    def predict(self, X):
        """Make predictions"""
        return X @ self.weights + self.bias

Key Methods:

  • fit(X, y): Trains the model using gradient descent
  • predict(X): Makes predictions on new data
  • compute_cost(y_true, y_pred): Calculates MSE cost

2. Data Pipeline

Data Ingestion (src/data_ingestion.py)

  • Fetches Boston Housing dataset from OpenML
  • Performs comprehensive sanity checks
  • Validates data integrity

Data Preprocessing (src/data_preprocessing.py)

  • Splits features and target variable
  • Creates train/test split (80/20 by default)
  • Applies StandardScaler normalization
  • Ensures reproducibility with random seed

3. Training & Evaluation

Model Training (src/model_training.py)

  • Orchestrates the training process
  • Configurable hyperparameters
  • Tracks and displays training progress

Model Evaluation (src/model_evaluation.py)

  • Calculates multiple metrics (MSE, RMSE, MAE, RΒ²)
  • Evaluates both training and test sets
  • Detects overfitting automatically

4. Visualization (src/visualise.py)

Generates professional-quality plots:

  • Learning Curve: Cost vs iterations
  • Predictions vs Actual: Scatter plot with perfect prediction line
  • Residual Analysis: Residual plot and distribution

πŸ—οΈ Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Data Ingestion β”‚
β”‚   (Boston Data) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Inspection β”‚
β”‚  (Sanity Check) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Preprocessing   β”‚
β”‚ β€’ Train/Test    β”‚
β”‚ β€’ Scaling       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model Training  β”‚
β”‚ β€’ Initialize ΞΈ  β”‚
β”‚ β€’ Grad Descent  β”‚
β”‚ β€’ Cost Tracking β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Evaluation    β”‚
β”‚ β€’ MSE, RMSE     β”‚
β”‚ β€’ MAE, RΒ²       β”‚
β”‚ β€’ Overfitting   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Visualization   β”‚
β”‚ β€’ Learning Curveβ”‚
β”‚ β€’ Pred vs Act   β”‚
β”‚ β€’ Residuals     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Predictions    β”‚
β”‚ (New Samples)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Results

Performance Metrics

Metric Training Set Test Set
MSE 21.65 24.29
RMSE 4.65 4.93
MAE 3.29 3.34
RΒ² Score 0.74 0.67

Key Insights

  • βœ… Good RΒ² Score (0.67): The model explains ~67% of variance in test data
  • βœ… No Severe Overfitting: Training and test metrics are similar
  • βœ… Reasonable Error: RMSE of ~4.93 on housing prices (in $1000s)
  • ⚠️ Improvement Possible: Could benefit from feature engineering or polynomial features

πŸ“ Mathematical Foundation

1. Hypothesis Function

h(x) = ΞΈβ‚€ + θ₁x₁ + ΞΈβ‚‚xβ‚‚ + ... + ΞΈβ‚™xβ‚™
     = ΞΈα΅€x

Where:

  • ΞΈ = parameters (weights + bias)
  • x = input features

2. Cost Function (Mean Squared Error)

J(θ) = (1/2m) Σ(hθ(xⁱ) - yⁱ)²

Where:

  • m = number of training examples
  • hΞΈ(xⁱ) = predicted value
  • yⁱ = actual value

3. Gradient Descent Update Rule

ΞΈβ±Ό := ΞΈβ±Ό - Ξ± Γ— (βˆ‚J(ΞΈ)/βˆ‚ΞΈβ±Ό)
ΞΈβ±Ό := ΞΈβ±Ό - Ξ± Γ— (1/m) Ξ£(hΞΈ(xⁱ) - yⁱ) Γ— xⱼⁱ

Where:

  • Ξ± = learning rate
  • βˆ‚J(ΞΈ)/βˆ‚ΞΈβ±Ό = gradient of cost function

4. Feature Scaling (Z-score Normalization)

x_scaled = (x - ΞΌ) / Οƒ

Where:

  • ΞΌ = mean of feature
  • Οƒ = standard deviation of feature

πŸ” Code Quality

  • βœ… PEP 8 Compliant: Follows Python style guidelines
  • βœ… Comprehensive Docstrings: Every function documented
  • βœ… Type Hints: Clear parameter and return types
  • βœ… Modular Design: Separation of concerns
  • βœ… Error Handling: Robust exception management
  • βœ… Clean Code: Readable and maintainable

οΏ½οΏ½ Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Ideas for Contributions

  • Add support for polynomial features
  • Implement regularization (Ridge, Lasso)
  • Add more visualization types
  • Improve documentation
  • Add unit tests
  • Support for other datasets

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ‘€ Author

iamhero2709


πŸ™ Acknowledgments

  • Boston Housing Dataset: Harrison, D. and Rubinfeld, D.L. (1978)
  • OpenML: For providing easy access to datasets
  • NumPy: For numerical computing capabilities
  • scikit-learn: For preprocessing utilities and metrics

πŸ“š References

  1. Andrew Ng - Machine Learning Course (Coursera)
  2. "Pattern Recognition and Machine Learning" - Christopher Bishop
  3. "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman

πŸ”— Related Projects


⭐ Star this repo if you find it helpful!

Made with ❀️ by iamhero2709

About

Created Regression model from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors