friendly-waffle

stroke-risk-prediction

Stroke Risk Prediction using Machine Learning

This project aims to develop a machine learning-based stroke risk prediction model using a publicly available Kaggle dataset. Our analysis evaluates multiple classification algorithms under both imbalanced and resampled conditions to identify key predictors of stroke, such as age, BMI, hypertension, and heart disease.

📁 Project Structure

├── final_report.Rmd # Main reproducible report in R Markdown ├── final_report.pdf # Rendered PDF report ├── data/ │ └── stroke_data.csv # Dataset from Kaggle (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) ├── figures/ │ ├── eda_plot1.png # EDA visualizations │ └── model_results.png # Model performance comparison ├── README.md # This file

Dataset

Source: Kaggle: Stroke Prediction Dataset
Features: Age, gender, hypertension, heart disease, marital status, work type, residence type, avg glucose level, BMI, smoking status
Target: stroke (binary classification)

🚀 How to Reproduce

To reproduce the results locally:

Clone this repository
Make sure R (>= 4.0) is installed
Install required packages:

install.packages(c("tidyverse", "caret", "pROC", "xgboost", "rpart", "rpart.plot", "doParallel", "GGally", "epitools", "patchwork"))

## Methods

We used the following models:
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost (Gradient Boosting)

To address class imbalance (only ~5% of stroke cases), we applied **SMOTE** (Synthetic Minority Oversampling Technique) during model training and evaluation.

## Evaluation Metrics

Models were evaluated using:
- AUC (Area Under the Curve)
- Accuracy
- F1-score

Final model selection was based on performance aggregated across these metrics.

## Key Findings

- **Random Forest with SMOTE** outperformed other models.
- **Age** was the most influential predictor.
- **BMI**, **average glucose**, **hypertension**, and **heart disease** were also important.
- **Marital status** appeared predictive but likely acted as a confounder due to age.

## Authors

Yanzhi Hua – EDA, logistic regression, XGBoost implementation
Chen Yang – SMOTE balancing, random forest modeling, results synthesis


## Required R Packages

```r
install.packages(c("tidyverse", "caret", "xgboost", "DMwR", "pROC", "randomForest", "knitr"))

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
healthcare-dataset-stroke-data.csv		healthcare-dataset-stroke-data.csv
imp.pdf		imp.pdf
model_performance_comparison.pdf		model_performance_comparison.pdf
references.bib		references.bib
stroke_EDA1.pdf		stroke_EDA1.pdf
stroke_EDA2.pdf		stroke_EDA2.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

friendly-waffle

Stroke Risk Prediction using Machine Learning

📁 Project Structure

Dataset

🚀 How to Reproduce

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

friendly-waffle

Stroke Risk Prediction using Machine Learning

📁 Project Structure

Dataset

🚀 How to Reproduce

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages