Toxicity Prediction using Random Forest Classifier
This project aims to build a classification model to predict toxicity based on molecular descriptors. We use a Random Forest Classifier, applying feature selection and hyperparameter tuning to optimize its performance.
- Data Loading and Initial Exploration
First, we load the data.csv file into a pandas DataFrame and perform an initial inspection to understand its structure and content.
- df.head()`**: Displays the first 5 rows of the dataset.
- df.describe()`**: Provides a statistical summary of the numerical columns.
- df['Class'].value_counts()` and Countplot**: Analyzes the distribution of the target variable 'Class' (NonToxic vs. Toxic) and visualizes it to check for class imbalance.
Initial Observation**: The dataset contains molecular descriptors and a 'Class' column (NonToxic/Toxic). There's a class imbalance, with significantly more 'NonToxic' samples than 'Toxic' samples.
- Data Preparation
- Feature and Target Split: The dataset is split into features (
X, all columns except 'Class' and 'Class_encoded') and the target variable (y, the 'Class' column). - Train-Test Split: The data is further divided into training (70%) and testing (30%) sets to evaluate the model's performance on unseen data. This ensures robust evaluation.
- X_train shape: (119, 1203)
- X_test shape: (52, 1203)
- y_train shape: (119,)
- y_test shape: (52,)
- Feature Selection using Chi-squared
Chi-squared feature selection is applied to reduce dimensionality and select the most statistically significant features. Since Chi-squared requires non-negative input, a MinMaxScaler is first used to scale the features between 0 and 1.
- MinMaxScaler: Scales features to a non-negative range.
- SelectKBest(chi2, k=400): Selects the top 400 features based on the Chi-squared statistical test.
Results of Feature Selection:
- Original number of features: 1203
- Number of features after Chi-squared selection: 400
- Model Training and Evaluation (Before Hyperparameter Tuning)
A RandomForestClassifier is trained using its default parameters on the Chi-squared selected features. This provides a baseline performance for the model.
- Model:
RandomForestClassifier(random_state=42) - Training Data:
X_train_chi2_selected,y_train - Testing Data:
X_test_chi2_selected,y_test
Model Performance Before Tuning:
- Accuracy: 0.596 (approximately 59.6%)
This initial accuracy serves as a benchmark against which the performance after hyperparameter tuning will be compared.
- Hyperparameter Tuning (RandomizedSearchCV)
To improve the model's performance, hyperparameter tuning is performed using RandomizedSearchCV. This method efficiently searches a defined hyperparameter space for the best combination.
- Model:
RandomForestClassifier(random_state=42) - Hyperparameter Grid (
param_grid):n_estimators: [100, 200, 300]max_depth: [None, 10, 20]min_samples_split: [2, 5, 10]
- Search Strategy:
RandomizedSearchCVwithn_iter=10(number of parameter settings sampled) andcv=5(5-fold cross-validation). - Training Data:
X_train_chi2_selected,y_train
Best Score from Hyperparameter Tuning:
model.best_score_:np.float64(0.6804347826086957)(approximately 68.04%)
The best score obtained during cross-validation indicates a significant improvement over the baseline accuracy, suggesting that tuning was beneficial.
- Model Evaluation (After Hyperparameter Tuning)
The final model, with the best hyperparameters found by RandomizedSearchCV, is evaluated on the unseen test set (X_test_chi2_selected).
- Predictions:
y_pred = model.predict(X_test_chi2_selected) - Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, Classification Report, Confusion Matrix.
Model Performance After Tuning:
- Accuracy: 0.63
- Precision (Toxic): 0.20
- Recall (Toxic): 0.06
- F1-Score (Toxic): 0.10
Classification Report:
precision recall f1-score support
NonToxic 0.68 0.89 0.77 36
Toxic 0.20 0.06 0.10 16
accuracy 0.63 52
macro avg 0.44 0.48 0.43 52 weighted avg 0.53 0.63 0.56 52
Confusion Matrix: (Heatmap visualization would be displayed here)
Analysis of Performance After Tuning: While the overall accuracy slightly improved to 63% compared to the baseline, the model still struggles significantly with predicting the 'Toxic' class. The low precision (0.20) means that when the model predicts a sample is 'Toxic', it's only correct 20% of the time. The even lower recall (0.06) indicates that the model is only able to identify 6% of the actual 'Toxic' samples. This is evident from the confusion matrix, which shows a high number of false negatives for the 'Toxic' class.
Comparison (Before vs. After Tuning):
- Accuracy (Before Tuning): ~59.6%
- Accuracy (After Tuning): ~63%
Hyperparameter tuning resulted in a modest increase in overall accuracy and the cross-validation score, but the critical metrics for the minority 'Toxic' class (precision, recall, F1-score) remain very low. This suggests that while the model might be slightly better overall, it still isn't effective at identifying the 'Toxic' samples.
The project successfully implemented a Random Forest Classifier with feature selection and hyperparameter tuning. While hyperparameter tuning improved the overall accuracy slightly, the model's ability to identify 'Toxic' samples remains very poor, largely due to the severe class imbalance.