Skip to content

Chencharl/friendly-waffle

Repository files navigation

friendly-waffle

stroke-risk-prediction

Stroke Risk Prediction using Machine Learning

This project aims to develop a machine learning-based stroke risk prediction model using a publicly available Kaggle dataset. Our analysis evaluates multiple classification algorithms under both imbalanced and resampled conditions to identify key predictors of stroke, such as age, BMI, hypertension, and heart disease.

📁 Project Structure

├── final_report.Rmd # Main reproducible report in R Markdown ├── final_report.pdf # Rendered PDF report ├── data/ │ └── stroke_data.csv # Dataset from Kaggle (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) ├── figures/ │ ├── eda_plot1.png # EDA visualizations │ └── model_results.png # Model performance comparison ├── README.md # This file


Dataset

  • Source: Kaggle: Stroke Prediction Dataset
  • Features: Age, gender, hypertension, heart disease, marital status, work type, residence type, avg glucose level, BMI, smoking status
  • Target: stroke (binary classification)

🚀 How to Reproduce

To reproduce the results locally:

  1. Clone this repository
  2. Make sure R (>= 4.0) is installed
  3. Install required packages:
install.packages(c("tidyverse", "caret", "pROC", "xgboost", "rpart", "rpart.plot", "doParallel", "GGally", "epitools", "patchwork"))

## Methods

We used the following models:
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost (Gradient Boosting)

To address class imbalance (only ~5% of stroke cases), we applied **SMOTE** (Synthetic Minority Oversampling Technique) during model training and evaluation.

## Evaluation Metrics

Models were evaluated using:
- AUC (Area Under the Curve)
- Accuracy
- F1-score

Final model selection was based on performance aggregated across these metrics.

## Key Findings

- **Random Forest with SMOTE** outperformed other models.
- **Age** was the most influential predictor.
- **BMI**, **average glucose**, **hypertension**, and **heart disease** were also important.
- **Marital status** appeared predictive but likely acted as a confounder due to age.

## Authors

Yanzhi HuaEDA, logistic regression, XGBoost implementation
Chen YangSMOTE balancing, random forest modeling, results synthesis


## Required R Packages

```r
install.packages(c("tidyverse", "caret", "xgboost", "DMwR", "pROC", "randomForest", "knitr"))

About

stroke-risk-prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages