This project represents a deep dive into the "engine" of Machine Learning. Beyond just using libraries, I focused on implementing algorithms from scratch, mastering optimization techniques, and tackling the core ML challenge: overfitting.
- Custom ML Implementation: Building regression models from scratch using NumPy.
- Optimization: Stochastic Gradient Descent (SGD) and Analytical closed-form solutions.
- Regularization: Mastering L1 (Lasso), L2 (Ridge), and ElasticNet to control model complexity.
- Advanced Preprocessing: Feature normalization (
MinMaxScaler,StandardScaler) and outlier handling. - Feature Engineering: Parsing complex text data into binary features and exploring polynomial transformations.
I derived the analytical solution for linear regression in vector form and explored how L1 and L2 penalties transform the loss function. This stage was crucial for understanding why Lasso acts as a feature selector by shrinking weights to zero.
I moved beyond sklearn by implementing my own LinearRegression class:
- Developed SGD with deterministic behavior for reproducibility.
- Coded the R² (Coefficient of Determination) manually to deeply understand variance explanation.
- Implemented Ridge, Lasso, and ElasticNet by extending the loss function. Comparing my custom code against
sklearnconfirmed the accuracy of my mathematical logic.
Data processing became more granular. I extracted the top 20 most frequent apartment highlights (e.g., 'Elevator', 'FitnessCenter') from raw text features and transformed them into binary flags, expanding the dataset to 22 high-impact features.
I explored why linear models are sensitive to feature scales. By manually implementing MinMaxScaler and StandardScaler, I observed how normalization accelerates gradient descent convergence and makes model coefficients truly interpretable.
To see theory in action, I intentionally overfitted a model using 10th-degree polynomial features. This experiment vividly demonstrated how weights explode during overfitting and how regularization (tuning the Alpha parameter) effectively "tames" the model, restoring its generalization power.
The project concluded with a rigorous comparison of all models, including naive baselines. I investigated advanced tricks like Target Log-transformation to handle skewed distributions and learned the critical distinction of why outliers should only be removed from training data.
This module was a turning point: I transitioned from "tuning parameters" to understanding the geometry and logic of the learning process. I now possess a clear intuition of what happens to data the moment it enters a model.
-
Clone the repository:
git clone https://github.com/knight99rus/ML2_Supervised_learning.git cd ML2_Supervised_learning -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # For Windows: venv\Scripts\activate
-
Install dependencies:
pip install jupyter pandas numpy scikit-learn
-
Download data:
- Read the task on the Kaggle competition page.
- Download
test.jsonfile.
-
Launch Jupyter Notebook:
jupyter notebook
Open and execute the cells in the
project02.ipynbfile.