Skip to content

duncanian303-cloud/ToxicityML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Toxicity Prediction using Random Forest Classifier

This project aims to build a classification model to predict toxicity based on molecular descriptors. We use a Random Forest Classifier, applying feature selection and hyperparameter tuning to optimize its performance.

  1. Data Loading and Initial Exploration

First, we load the data.csv file into a pandas DataFrame and perform an initial inspection to understand its structure and content.

  • df.head()`**: Displays the first 5 rows of the dataset.
  • df.describe()`**: Provides a statistical summary of the numerical columns.
  • df['Class'].value_counts()` and Countplot**: Analyzes the distribution of the target variable 'Class' (NonToxic vs. Toxic) and visualizes it to check for class imbalance.

Initial Observation**: The dataset contains molecular descriptors and a 'Class' column (NonToxic/Toxic). There's a class imbalance, with significantly more 'NonToxic' samples than 'Toxic' samples.

  1. Data Preparation
  • Feature and Target Split: The dataset is split into features (X, all columns except 'Class' and 'Class_encoded') and the target variable (y, the 'Class' column).
  • Train-Test Split: The data is further divided into training (70%) and testing (30%) sets to evaluate the model's performance on unseen data. This ensures robust evaluation.
    • X_train shape: (119, 1203)
    • X_test shape: (52, 1203)
    • y_train shape: (119,)
    • y_test shape: (52,)
  1. Feature Selection using Chi-squared

Chi-squared feature selection is applied to reduce dimensionality and select the most statistically significant features. Since Chi-squared requires non-negative input, a MinMaxScaler is first used to scale the features between 0 and 1.

  • MinMaxScaler: Scales features to a non-negative range.
  • SelectKBest(chi2, k=400): Selects the top 400 features based on the Chi-squared statistical test.

Results of Feature Selection:

  • Original number of features: 1203
  • Number of features after Chi-squared selection: 400
  1. Model Training and Evaluation (Before Hyperparameter Tuning)

A RandomForestClassifier is trained using its default parameters on the Chi-squared selected features. This provides a baseline performance for the model.

  • Model: RandomForestClassifier(random_state=42)
  • Training Data: X_train_chi2_selected, y_train
  • Testing Data: X_test_chi2_selected, y_test

Model Performance Before Tuning:

  • Accuracy: 0.596 (approximately 59.6%)

This initial accuracy serves as a benchmark against which the performance after hyperparameter tuning will be compared.

  1. Hyperparameter Tuning (RandomizedSearchCV)

To improve the model's performance, hyperparameter tuning is performed using RandomizedSearchCV. This method efficiently searches a defined hyperparameter space for the best combination.

  • Model: RandomForestClassifier(random_state=42)
  • Hyperparameter Grid (param_grid):
    • n_estimators: [100, 200, 300]
    • max_depth: [None, 10, 20]
    • min_samples_split: [2, 5, 10]
  • Search Strategy: RandomizedSearchCV with n_iter=10 (number of parameter settings sampled) and cv=5 (5-fold cross-validation).
  • Training Data: X_train_chi2_selected, y_train

Best Score from Hyperparameter Tuning:

  • model.best_score_: np.float64(0.6804347826086957) (approximately 68.04%)

The best score obtained during cross-validation indicates a significant improvement over the baseline accuracy, suggesting that tuning was beneficial.

  1. Model Evaluation (After Hyperparameter Tuning)

The final model, with the best hyperparameters found by RandomizedSearchCV, is evaluated on the unseen test set (X_test_chi2_selected).

  • Predictions: y_pred = model.predict(X_test_chi2_selected)
  • Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, Classification Report, Confusion Matrix.

Model Performance After Tuning:

  • Accuracy: 0.63
  • Precision (Toxic): 0.20
  • Recall (Toxic): 0.06
  • F1-Score (Toxic): 0.10

Classification Report:

          precision    recall  f1-score   support

NonToxic       0.68      0.89      0.77        36
   Toxic       0.20      0.06      0.10        16

accuracy                           0.63        52

macro avg 0.44 0.48 0.43 52 weighted avg 0.53 0.63 0.56 52

Confusion Matrix: (Heatmap visualization would be displayed here)

Analysis of Performance After Tuning: While the overall accuracy slightly improved to 63% compared to the baseline, the model still struggles significantly with predicting the 'Toxic' class. The low precision (0.20) means that when the model predicts a sample is 'Toxic', it's only correct 20% of the time. The even lower recall (0.06) indicates that the model is only able to identify 6% of the actual 'Toxic' samples. This is evident from the confusion matrix, which shows a high number of false negatives for the 'Toxic' class.

Comparison (Before vs. After Tuning):

  • Accuracy (Before Tuning): ~59.6%
  • Accuracy (After Tuning): ~63%

Hyperparameter tuning resulted in a modest increase in overall accuracy and the cross-validation score, but the critical metrics for the minority 'Toxic' class (precision, recall, F1-score) remain very low. This suggests that while the model might be slightly better overall, it still isn't effective at identifying the 'Toxic' samples.

7. Conclusion and Next Steps

The project successfully implemented a Random Forest Classifier with feature selection and hyperparameter tuning. While hyperparameter tuning improved the overall accuracy slightly, the model's ability to identify 'Toxic' samples remains very poor, largely due to the severe class imbalance.

About

Machine learning projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors