GitHub - duncanian303-cloud/ToxicityML: Machine learning projects

Toxicity Prediction using Random Forest Classifier

This project aims to build a classification model to predict toxicity based on molecular descriptors. We use a Random Forest Classifier, applying feature selection and hyperparameter tuning to optimize its performance.

Data Loading and Initial Exploration

First, we load the data.csv file into a pandas DataFrame and perform an initial inspection to understand its structure and content.

df.head()`**: Displays the first 5 rows of the dataset.
df.describe()`**: Provides a statistical summary of the numerical columns.
df['Class'].value_counts()` and Countplot**: Analyzes the distribution of the target variable 'Class' (NonToxic vs. Toxic) and visualizes it to check for class imbalance.

Initial Observation**: The dataset contains molecular descriptors and a 'Class' column (NonToxic/Toxic). There's a class imbalance, with significantly more 'NonToxic' samples than 'Toxic' samples.

Data Preparation

Feature and Target Split: The dataset is split into features (X, all columns except 'Class' and 'Class_encoded') and the target variable (y, the 'Class' column).
Train-Test Split: The data is further divided into training (70%) and testing (30%) sets to evaluate the model's performance on unseen data. This ensures robust evaluation.
- X_train shape: (119, 1203)
- X_test shape: (52, 1203)
- y_train shape: (119,)
- y_test shape: (52,)

Feature Selection using Chi-squared

Chi-squared feature selection is applied to reduce dimensionality and select the most statistically significant features. Since Chi-squared requires non-negative input, a MinMaxScaler is first used to scale the features between 0 and 1.

MinMaxScaler: Scales features to a non-negative range.
SelectKBest(chi2, k=400): Selects the top 400 features based on the Chi-squared statistical test.

Results of Feature Selection:

Original number of features: 1203
Number of features after Chi-squared selection: 400

Model Training and Evaluation (Before Hyperparameter Tuning)

A RandomForestClassifier is trained using its default parameters on the Chi-squared selected features. This provides a baseline performance for the model.

Model: RandomForestClassifier(random_state=42)
Training Data: X_train_chi2_selected, y_train
Testing Data: X_test_chi2_selected, y_test

Model Performance Before Tuning:

Accuracy: 0.596 (approximately 59.6%)

This initial accuracy serves as a benchmark against which the performance after hyperparameter tuning will be compared.

Hyperparameter Tuning (RandomizedSearchCV)

To improve the model's performance, hyperparameter tuning is performed using RandomizedSearchCV. This method efficiently searches a defined hyperparameter space for the best combination.

Model: RandomForestClassifier(random_state=42)
Hyperparameter Grid (param_grid):
- n_estimators: [100, 200, 300]
- max_depth: [None, 10, 20]
- min_samples_split: [2, 5, 10]
Search Strategy: RandomizedSearchCV with n_iter=10 (number of parameter settings sampled) and cv=5 (5-fold cross-validation).
Training Data: X_train_chi2_selected, y_train

Best Score from Hyperparameter Tuning:

model.best_score_: np.float64(0.6804347826086957) (approximately 68.04%)

The best score obtained during cross-validation indicates a significant improvement over the baseline accuracy, suggesting that tuning was beneficial.

Model Evaluation (After Hyperparameter Tuning)

The final model, with the best hyperparameters found by RandomizedSearchCV, is evaluated on the unseen test set (X_test_chi2_selected).

Predictions: y_pred = model.predict(X_test_chi2_selected)
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, Classification Report, Confusion Matrix.

Model Performance After Tuning:

Accuracy: 0.63
Precision (Toxic): 0.20
Recall (Toxic): 0.06
F1-Score (Toxic): 0.10

Classification Report:

          precision    recall  f1-score   support

NonToxic       0.68      0.89      0.77        36
   Toxic       0.20      0.06      0.10        16

accuracy                           0.63        52

macro avg 0.44 0.48 0.43 52 weighted avg 0.53 0.63 0.56 52

Confusion Matrix: (Heatmap visualization would be displayed here)

Analysis of Performance After Tuning: While the overall accuracy slightly improved to 63% compared to the baseline, the model still struggles significantly with predicting the 'Toxic' class. The low precision (0.20) means that when the model predicts a sample is 'Toxic', it's only correct 20% of the time. The even lower recall (0.06) indicates that the model is only able to identify 6% of the actual 'Toxic' samples. This is evident from the confusion matrix, which shows a high number of false negatives for the 'Toxic' class.

Comparison (Before vs. After Tuning):

Accuracy (Before Tuning): ~59.6%
Accuracy (After Tuning): ~63%

Hyperparameter tuning resulted in a modest increase in overall accuracy and the cross-validation score, but the critical metrics for the minority 'Toxic' class (precision, recall, F1-score) remain very low. This suggests that while the model might be slightly better overall, it still isn't effective at identifying the 'Toxic' samples.

7. Conclusion and Next Steps

The project successfully implemented a Random Forest Classifier with feature selection and hyperparameter tuning. While hyperparameter tuning improved the overall accuracy slightly, the model's ability to identify 'Toxic' samples remains very poor, largely due to the severe class imbalance.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
ToxicityML.ipynb		ToxicityML.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

7. Conclusion and Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

7. Conclusion and Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages