🚀 Linear Regression from Scratch - Complete End-to-End Pipeline

A complete implementation of Linear Regression from scratch using only NumPy, with a full machine learning pipeline including data ingestion, preprocessing, training, evaluation, and visualization.

📋 Table of Contents

Overview
Features
Project Structure
Installation
Usage
Implementation Details
Pipeline Architecture
Results
Mathematical Foundation
Contributing
License

🎯 Overview

This project demonstrates a complete end-to-end machine learning pipeline for Linear Regression, built entirely from scratch using Python and NumPy. Unlike using pre-built libraries, this implementation provides deep insights into:

How gradient descent optimization works
The mathematics behind linear regression
Building production-ready ML pipelines
Best practices in code organization and documentation

Dataset: Boston Housing Dataset (506 samples, 13 features)

Target: Median house value (MEDV)
Key Features: RM (rooms), LSTAT (population status), PTRATIO (pupil-teacher ratio), and more

✨ Features

Core Components

✅ Linear Regression from Scratch: No sklearn for model training, pure NumPy implementation
✅ Gradient Descent Optimization: Custom implementation with learning curve tracking
✅ Complete Data Pipeline: Ingestion → Preprocessing → Training → Evaluation
✅ Feature Scaling: StandardScaler for normalization
✅ Comprehensive Metrics: MSE, RMSE, MAE, R² Score
✅ Rich Visualizations: Learning curves, residual plots, prediction vs actual
✅ Modular Design: Clean, reusable, well-documented code

Additional Features

📊 Multiple visualization types for model analysis
🔧 Configurable hyperparameters (learning rate, iterations)
📈 Training progress tracking with cost history
🎨 Professional-grade plots with seaborn styling
📝 Extensive documentation and docstrings

📁 Project Structure

LinearRegressionModel/
├── config/
│   └── config.yaml              # Configuration parameters
├── src/
│   ├── __init__.py
│   ├── linear_regression.py     # Core Linear Regression implementation
│   ├── data_ingestion.py        # Data loading and sanity checks
│   ├── data_preprocessing.py    # Train/test split and scaling
│   ├── model_training.py        # Model training orchestration
│   ├── model_evaluation.py      # Performance metrics calculation
│   ├── prediction.py            # Prediction utilities
│   └── visualise.py             # Visualization functions
├── notebooks/
│   └── LinearRegressionModel.ipynb  # Jupyter notebook version
├── main.py                      # Main pipeline execution script
├── requirements.txt             # Python dependencies
└── README.md                    # This file

🔧 Installation

Prerequisites

Python 3.8 or higher
pip package manager

Step 1: Clone the Repository

git clone https://github.com/iamhero2709/LinearRegressionModel.git
cd LinearRegressionModel

Step 2: Create Virtual Environment (Recommended)

# On Linux/Mac
python -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

🚀 Usage

Run the Complete Pipeline

Execute the entire end-to-end pipeline with a single command:

python main.py

This will:

✅ Load the Boston Housing dataset
✅ Perform data sanity checks
✅ Preprocess and split data (train/test)
✅ Train the Linear Regression model
✅ Evaluate performance metrics
✅ Generate visualizations
✅ Display predictions

Expected Output

================================================================================
          LINEAR REGRESSION FROM SCRATCH - END-TO-END PIPELINE          
================================================================================

STEP 1: DATA INGESTION
--------------------------------------------------------------------------------
================================================================================
DATA SANITY CHECK
================================================================================
...

STEP 2: DATA PREPROCESSING
--------------------------------------------------------------------------------
Training set size: 404 samples
Testing set size: 102 samples
✓ Features scaled using StandardScaler
...

STEP 3: MODEL TRAINING
--------------------------------------------------------------------------------
✓ Model training completed!
  Final cost: 10.8234
...

STEP 4: MODEL EVALUATION
--------------------------------------------------------------------------------
Training Set Performance:
  MSE     : 21.6468
  RMSE    : 4.6525
  MAE     : 3.2891
  R2      : 0.7408

Test Set Performance:
  MSE     : 24.2910
  RMSE    : 4.9286
  MAE     : 3.3411
  R2      : 0.6685
...

📊 Final Results Summary:
  Test R² Score: 0.6685
  Test RMSE: 4.9286
  Test MAE: 3.3411

Using the Jupyter Notebook

Alternatively, explore the implementation interactively:

jupyter notebook notebooks/LinearRegressionModel.ipynb

🧠 Implementation Details

1. Linear Regression Class (`src/linear_regression.py`)

The core implementation uses Gradient Descent to learn optimal parameters.

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def fit(self, X, y):
        """Train the model using gradient descent"""
        # Initialize parameters
        # Perform gradient descent
        # Track cost history
    
    def predict(self, X):
        """Make predictions"""
        return X @ self.weights + self.bias

Key Methods:

fit(X, y): Trains the model using gradient descent
predict(X): Makes predictions on new data
compute_cost(y_true, y_pred): Calculates MSE cost

2. Data Pipeline

Data Ingestion (`src/data_ingestion.py`)

Fetches Boston Housing dataset from OpenML
Performs comprehensive sanity checks
Validates data integrity

Data Preprocessing (`src/data_preprocessing.py`)

Splits features and target variable
Creates train/test split (80/20 by default)
Applies StandardScaler normalization
Ensures reproducibility with random seed

3. Training & Evaluation

Model Training (`src/model_training.py`)

Orchestrates the training process
Configurable hyperparameters
Tracks and displays training progress

Model Evaluation (`src/model_evaluation.py`)

Calculates multiple metrics (MSE, RMSE, MAE, R²)
Evaluates both training and test sets
Detects overfitting automatically

4. Visualization (`src/visualise.py`)

Generates professional-quality plots:

Learning Curve: Cost vs iterations
Predictions vs Actual: Scatter plot with perfect prediction line
Residual Analysis: Residual plot and distribution

🏗️ Pipeline Architecture

┌─────────────────┐
│  Data Ingestion │
│   (Boston Data) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Data Inspection │
│  (Sanity Check) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Preprocessing   │
│ • Train/Test    │
│ • Scaling       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Model Training  │
│ • Initialize θ  │
│ • Grad Descent  │
│ • Cost Tracking │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Evaluation    │
│ • MSE, RMSE     │
│ • MAE, R²       │
│ • Overfitting   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Visualization   │
│ • Learning Curve│
│ • Pred vs Act   │
│ • Residuals     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Predictions    │
│ (New Samples)   │
└─────────────────┘

📊 Results

Performance Metrics

Metric	Training Set	Test Set
MSE	21.65	24.29
RMSE	4.65	4.93
MAE	3.29	3.34
R² Score	0.74	0.67

Key Insights

✅ Good R² Score (0.67): The model explains ~67% of variance in test data
✅ No Severe Overfitting: Training and test metrics are similar
✅ Reasonable Error: RMSE of ~4.93 on housing prices (in $1000s)
⚠️ Improvement Possible: Could benefit from feature engineering or polynomial features

📐 Mathematical Foundation

1. Hypothesis Function

h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ
     = θᵀx

Where:

θ = parameters (weights + bias)
x = input features

2. Cost Function (Mean Squared Error)

J(θ) = (1/2m) Σ(hθ(xⁱ) - yⁱ)²

Where:

m = number of training examples
hθ(xⁱ) = predicted value
yⁱ = actual value

3. Gradient Descent Update Rule

θⱼ := θⱼ - α × (∂J(θ)/∂θⱼ)
θⱼ := θⱼ - α × (1/m) Σ(hθ(xⁱ) - yⁱ) × xⱼⁱ

Where:

α = learning rate
∂J(θ)/∂θⱼ = gradient of cost function

4. Feature Scaling (Z-score Normalization)

x_scaled = (x - μ) / σ

Where:

μ = mean of feature
σ = standard deviation of feature

🔍 Code Quality

✅ PEP 8 Compliant: Follows Python style guidelines
✅ Comprehensive Docstrings: Every function documented
✅ Type Hints: Clear parameter and return types
✅ Modular Design: Separation of concerns
✅ Error Handling: Robust exception management
✅ Clean Code: Readable and maintainable

�� Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Ideas for Contributions

Add support for polynomial features
Implement regularization (Ridge, Lasso)
Add more visualization types
Improve documentation
Add unit tests
Support for other datasets

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

iamhero2709

GitHub: @iamhero2709

🙏 Acknowledgments

Boston Housing Dataset: Harrison, D. and Rubinfeld, D.L. (1978)
OpenML: For providing easy access to datasets
NumPy: For numerical computing capabilities
scikit-learn: For preprocessing utilities and metrics

📚 References

Andrew Ng - Machine Learning Course (Coursera)
"Pattern Recognition and Machine Learning" - Christopher Bishop
"The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman

🔗 Related Projects

⭐ Star this repo if you find it helpful!

Made with ❤️ by iamhero2709

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
config		config
notebooks		notebooks
src		src
.gitignore		.gitignore
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
examples.py		examples.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚀 Linear Regression from Scratch - Complete End-to-End Pipeline

📋 Table of Contents

🎯 Overview

✨ Features

Core Components

Additional Features

📁 Project Structure

🔧 Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Create Virtual Environment (Recommended)

Step 3: Install Dependencies

🚀 Usage

Run the Complete Pipeline

Expected Output

Using the Jupyter Notebook

🧠 Implementation Details

1. Linear Regression Class (src/linear_regression.py)

2. Data Pipeline

Data Ingestion (src/data_ingestion.py)

Data Preprocessing (src/data_preprocessing.py)

3. Training & Evaluation

Model Training (src/model_training.py)

Model Evaluation (src/model_evaluation.py)

4. Visualization (src/visualise.py)

🏗️ Pipeline Architecture

📊 Results

Performance Metrics

Key Insights

📐 Mathematical Foundation

1. Hypothesis Function

2. Cost Function (Mean Squared Error)

3. Gradient Descent Update Rule

4. Feature Scaling (Z-score Normalization)

🔍 Code Quality

�� Contributing

Ideas for Contributions

📝 License

👤 Author

🙏 Acknowledgments

📚 References

🔗 Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Linear Regression Class (`src/linear_regression.py`)

Data Ingestion (`src/data_ingestion.py`)

Data Preprocessing (`src/data_preprocessing.py`)

Model Training (`src/model_training.py`)

Model Evaluation (`src/model_evaluation.py`)

4. Visualization (`src/visualise.py`)

Packages