This repository contains Python scripts for predicting diabetes using the 2015 Behavioral Risk Factor Surveillance System (BRFSS) dataset. The scripts preprocess the data with Principal Component Analysis (PCA) and implement various classification techniques, including Logistic Regression, Naive Bayes, Support Vector Machines (SVM), and ensemble voting (hard and soft). The goal is to classify individuals as diabetic or non-diabetic based on health indicators.
- Data Preprocessing: Standardizes data and applies PCA for dimensionality reduction.
- Classification Models:
- Logistic Regression with gradient descent.
- Gaussian Naive Bayes with class weighting.
- Support Vector Machine (SVM) using Sequential Minimal Optimization (SMO).
- Ensemble voting (hard and soft) combining multiple models.
- Evaluation Metrics: Accuracy, precision, recall, and F1-score.
- Balanced Dataset: Handles class imbalance by sampling equal numbers of diabetic and non-diabetic cases.
- Python 3.8+
- Libraries:
numpycollections(forCounterin ensemble voting)
- Clone this repository:
git clone https://github.com/your-username/your-repo-name.git cd your-repo-name