This repository contains solved classification exercises using Machine Learning models in Python with scikit-learn and the ISLP textbook package. The source code explores different classification algorithms and the effects on statistical accuracy when separating Train/Test data vs using the full dataset.
The main script exercicios_logistica.py consists of 3 practical parts:
- Random Forest and Breast Cancer Dataset: Loads native
sklearndata, trains the model using an 80%/20% split, plots a confusion matrix heatmap, and outputs a complete classification report. - Logistic Regression with Train/Test Split: Re-implements 4 classic exercises from the
ISLPlibrary, dividing the modeling with independent samples and validating on unseen slices of the table using binarization matrix notation (get_dummies).ISLP::Default(Target: student)ISLP::Smarket(Target: Direction)ISLP::Weekly(Target: Direction)ISLP::Caravan(Target: Purchase)
- Basic Logistic Regression: Removes
train_test_splitto train and verify accuracy on the entire dataset, illustrating the statistical concept of "training error underestimation" when predicting samples the model has already seen.
To run this project, make sure you have Python 3.10+ and execute the command below in your terminal to fetch the core requirements:
pip install ISLP pandas scikit-learn matplotlib seabornOnce done, run the main file:
python exercicios_logistica.py(Note that the Logistic Regression reports from Part 2 and 3 will print in the terminal only after closing the Matplotlib figure window that opens in Part 1).