Course Project: CS-433 Machine Learning - Project 1 (Fall 2025)
Institution: École Polytechnique Fédérale de Lausanne (EPFL)
Grade Obtained: 8.5/10 (10% of final course grade)
Authors: Lluc Santamaria Riba, Francisco Badia Laguillo, Aleksandra Zawadzka
This repository contains our complete submission for the first project of the CS-433 Machine Learning course at EPFL. The project focused on developing a machine learning model to predict coronary heart disease risk using real-world health surveillance data by the CDC.
Project Resources:
- 📋 Official Project Assignment
- 📄 Report.pdf - Our detailed technical report submitted for grading
- 📄 project1_description.pdf - Official project description and requirements
- 📖 codebook.pdf - Complete dataset documentation and feature descriptions
- 💾 dataset/ - Original training and test data files
This project was submitted in 31 October 2025 and represented 10% of our final grade in the course. We are now making it publicly available to showcase our work and contribute to the machine learning community.
This project received two independent reviews from course staff. Below are the complete reviews:
Baselines and Ablation Studies (48/50)
- Justification (10/10): Strong motivation. Excellent discussion of data leakage issues (feature selection, standardization) and how they were corrected. Clear reasoning for class imbalance solutions.
- Comparison (20/20): Good.
- Cross-validation (10/10): Properly implemented 5-fold CV throughout. Well-documented.
- Hyperparameter optimization (8/10): Figure 2 shows learning rate ablation. Fold-specific threshold optimization described. Could be more comprehensive.
Additional Contributions (20/20)
- Contribution (5/5): Loss balancing for class imbalance; Fold-specific threshold optimization; Feature expansion analysis; Data leakage correction
- Motivation (5/5): Clear justification for each approach.
- Assessment (10/10): Good ablation studies.
Reproducibility (8/10)
- Footnote admits run.py produces different results than best submission
- Best hyperparameters only partially specified
Scientific Evidence (10/10)
Writing Quality (10/10)
Scores:
- Report Score: 9 (90%)
- Code Score: A (Outstanding, Extra bonus)
Code Feedback:
- The code passes all tests.
- Documentation is good.
- Code readability is good.
Report Feedback: This is a great and well-structured report that demonstrates a solid understanding of preprocessing, model selection, and evaluation. However, only logistic regression is tested and the performance is relatively low.
Scores:
- Report Score: 8 (80%)
- Code Score: B (Full score)
This project develops a machine learning model to predict the risk of coronary heart disease (MICHD) using data from the U.S. Centers for Disease Control's Behavioral Risk Factor Surveillance System (BRFSS). The model enables risk assessment based on individual behaviors and health status.
Final Model Performance:
- Accuracy: 0.868
- F1-Score: 0.424
- Model: Logistic Regression with threshold adjustment
- About This Project
- Overview
- Dataset
- Project Structure
- Installation
- Usage
- Methodology
- Results
- Key Findings
- Requirements
- Project Files
The BRFSS dataset contains:
- 328,135 samples across 321 features
- Target variable: MICHD diagnosis (y=1 for diagnosed, y=-1 for not diagnosed)
- Severe class imbalance: Only 10% positive samples
- 45% missing values across all entries
- 182 features with >10% NaN ratio
- 95 categorical features (≤10 unique values)
- 44 continuous features
- 64 calculated variables (prefixed with underscore)
Data files (place in dataset/ directory):
x_train.csv- Training featuresy_train.csv- Training labelsx_test.csv- Test featuressample_submission.csv- Submission format template
project-1-ai_force/
├── run.py # Main script - loads data and generates predictions
├── implementations.py # Core ML algorithms (required by project spec)
├── models.py # Model classes (LinearRegression, LogisticRegression)
├── methods.py # Training methods (Cross Validation, Threshold Search)
├── dataset.py # Dataset class with data-splitting utilities
├── data_cleaning.py # Data preprocessing pipeline
├── helpers.py # Utility functions
├── dataset_metadata.json # Feature metadata (types, mappings, vocabularies)
├── dataset/ # Data directory (CSV files)
├── submissions/ # Generated predictions
└── README.md # This file
- Python 3.13+
- pip
- Clone the repository:
git clone https://github.com/CS-433/project-1-ai_force/
cd project-1-ai_force- Create and activate virtual environment:
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate- Install dependencies:
pip install numpy- Place dataset files in the
dataset/directory.
Run the main script to load the dataset, train the best model, and generate predictions:
python run.pyThis will:
- Load and preprocess the data using
dataset_metadata.json - Train the optimized logistic regression model
- Generate predictions on the test set
- Save results to
submissions/submission_<timestamp>.csv
Note: Training takes approximately 40 minutes as it loads the huge dataset, trains the model and adjust hyperparameters (threshold).
The data cleaning pipeline can be used independently:
from data_cleaning import Data
# Load and clean data
data = Data()
data.load_from_csv('dataset/', 'dataset_metadata.json')
# Optional: Apply polynomial feature expansion
data.feature_expansion(degree=2)
# Add intercept term
data.add_intercept()
# Save preprocessed data for faster loading
data.save_to_numpy_file('data.npz')from data_cleaning import Data
from dataset import Dataset
from models import LogisticRegressionGD
data = Data()
data.load_from_numpy_file("data.npz")
# Create dataset with cross-validation
dataset = Dataset(data.x_train, data.y_train,
num_cont_features=data.num_cont_features,
seed=42)
# Train logistic regression
model = LogisticRegressionGD(max_iters=300, gamma=0.1)
# 5-fold cross-validation
for x_tr, y_tr, x_val, y_val, mean, std in dataset.k_fold_generator(k=5):
model.fit(x_tr, y_tr)
predictions = model.predict(x_val)
# Evaluate...To reproduce the plots and metrics reported in the paper, you can run the experiment scripts in the experiments/ directory:
PYTHONPATH=$PYTHONPATH:. python experiments/exp_feature_expansion/exp.pyThis generates the csv with training/validation losses, F1-score, and accuracy versus polynomial degree for linear regression.
PYTHONPATH=$PYTHONPATH:. python experiments/exp_gamma_LogR/exp.pyThis generates the csv with accuracy and F1-score at different learning rates (gamma) for Logistic Regression with optimized thresholds.
To reproduce specific entries in the performance comparison table, run the corresponding scripts:
# Linear Regression - No Treatment
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LR.py
# Linear Regression - Loss Balancing
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LR_balanced.py
# Linear Regression - Threshold Adjustment
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LR_th.py
# Linear Regression - Both Treatments
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LR_balanced_th.py
# Logistic Regression - No Treatment
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LogR.py
# Logistic Regression - Loss Balancing
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LogR_balanced.py
# Logistic Regression - Threshold Adjustment
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LogR_th.py
# Logistic Regression - Both Treatments
PYTHONPATH=$PYTHONPATH:. python experiments/individual_values/LogR_balanced_th.pyNote: Each experiment script performs 5-fold cross-validation and may take several minutes to complete. Results will be displayed in the console and/or saved as CSV files in the respective experiment directories.
Feature Filtering:
- Removed 182 features with >10% NaN ratio
- Removed zero-variance (homogeneous) features
- Retained 139 reliable features
Feature Processing:
- Categorical features: One-hot encoded, NaNs treated as separate category
- Continuous features: NaN values mapped to column mean
- Calculated variables: Retained for capturing feature interactions
- 71 features: Manually processed with domain knowledge
- Remaining features: Algorithmically processed
Data Structure (after preprocessing):
[Intercept | Continuous Features | One-Hot Encoded Categorical Features]
Standardization:
- Applied z-score standardization (μ=0, σ=1)
- Computed exclusively on training partition to prevent data leakage
- Applied only to continuous features (preserves one-hot encoding)
Models Evaluated:
- Linear Regression (Gradient Descent)
- Logistic Regression
Training Configuration:
- 5-fold Cross-Validation
- Maximum 200 iterations (300 for final model)
- Reproducible random seed
Three approaches tested:
-
Loss Balancing: Multiply loss by
$\frac{N_{neg}}{N_{pos}}$ for positive samples - Threshold Adjustment: Optimize decision threshold to maximize F1-score
- Both: Combined approach
Tested polynomial feature expansion up to degree 7. Results showed no improvement, indicating the relationship between features and target is primarily linear.
| Model | Treatment | F1-Score | Accuracy |
|---|---|---|---|
| Linear Regression | No Treatment | 0.000 | 0.910 |
| Linear Regression | Loss Balancing | 0.347 | 0.737 |
| Linear Regression | Threshold Adjustment | 0.402 | 0.858 |
| Linear Regression | Both | 0.410 | 0.880 |
| Logistic Regression | No Treatment | 0.420 | 0.892 |
| Logistic Regression | Loss Balancing | 0.395 | 0.894 |
| Logistic Regression | Threshold Adjustment | 0.424 | 0.868 |
| Logistic Regression | Both | 0.412 | 0.859 |
Best Model: Logistic Regression with Threshold Adjustment
Logistic regression proved highly robust across a wide range of learning rates (γ), maintaining good F1-scores. Performance deteriorates only at extremes:
- Very low gamma: Insufficient convergence within iteration limit
- Very high gamma: Optimization begins to diverge
- Class Imbalance is Critical: Initial models predicted all negative due to 90:10 class imbalance
- Threshold Adjustment Works: Model-agnostic technique that improved both linear and logistic regression
- Logistic Regression is Robust: Inherently resistant to class imbalance, performs better overall
- Loss Balancing for Linear Models: Essential for linear regression but doesn't help logistic regression
- Linear Relationships Dominate: Feature expansion provided no benefit, suggesting primarily linear patterns
- Data Leakage Prevention: Critical to compute standardization parameters exclusively from training data
Priority: Reduce False Negative rate (currently 45% of MICHD cases missed)
For real-world patient diagnosis, minimizing missed diagnoses is critical as it represents the greatest patient risk.
Potential Improvements:
- Ensemble methods
- Advanced feature engineering
- Deep learning approaches
- Cost-sensitive learning optimized for recall
- External data augmentation
- Python 3.13+
- NumPy
This repository includes all materials from our original submission:
- Report.pdf - Our complete technical report (submitted for grading)
- ML_project_1.pdf - Official project requirements and specifications
- codebook.pdf - Comprehensive dataset documentation with feature descriptions
- README.md - This file (project overview and usage guide)
The dataset/ directory contains:
x_train.csv- Training features (328,135 samples × 321 features)y_train.csv- Training labelsx_test.csv- Test featuressample_submission.csv- Submission format template
Note: Dataset files can also be downloaded from the official project page.
- run.py - Main execution script
- implementations.py - Core ML algorithms (per project specification)
- models.py - Model implementations
- methods.py - Training utilities
- dataset.py - Data handling
- data_cleaning.py - Preprocessing pipeline
- helpers.py - Utility functions
- dataset_metadata.json - Feature metadata
This project is part of the CS-433 Machine Learning course at EPFL.
- U.S. CDC Behavioral Risk Factor Surveillance System (BRFSS)
- British Heart Foundation Statistics (2025)