This project focuses on predicting the absorption maxima (λₘₐₓ) of organic chromophores based solely on their molecular structure. Using a supervised machine learning approach, I developed a complete pipeline that takes SMILES representations as input and outputs predicted absorption wavelengths in nanometers. The model is trained on a large, experimentally measured dataset and built around a Random Forest regressor, which has proven effective for structured, high-dimensional input like molecular fingerprints.
The pipeline begins by preprocessing the dataset: cleaning missing values, canonicalizing SMILES, and generating 1024-bit Morgan fingerprints to numerically encode molecular structure. A stratified train/test split ensures balanced distribution of outliers and avoids data leakage. After training, the model’s performance is evaluated using standard regression metrics, including Mean Squared Error (MSE) and R² score. An additional outlier detection step is integrated to identify predictions with unusually high error, helping improve the model’s robustness and interpretation.
Overall, this project demonstrates how machine learning can offer a fast, low-cost alternative to experimental UV–Vis measurements. While inspired by open-source examples, all code and pipeline logic were developed independently and adapted to suit the specific dataset and goals of this work.
- Loads and cleans raw dataset
- Canonicalizes SMILES using RDKit
- Converts spectral peaks into a consistent format
- Flags outliers based on IQR
- Outputs cleaned dataset ready for modeling
- Generates 1024-bit Morgan fingerprints
- Splits data into train/test while preserving outlier distribution
- Trains a Random Forest regressor with tuned hyperparameters
- Evaluates performance and saves metrics
- Model: Random Forest Regressor
- Features: Morgan fingerprints (radius=2, nBits=1024)
- Target: Absorption maximum (λₘₐₓ in nm)
n_estimators: 134max_depth: 78max_features: 0.3007min_samples_split: 2bootstrap: True
- Mean Squared Error (MSE)
- Coefficient of Determination (R²)
- Parity plot and error histogram (saved in
/figures/)
- Test R²: 0.9339
- Test MSE: 748.52 nm²
- Improved results after integrating outlier flagging logic
Download the experimental chromophore dataset from: DB for Chromophore – Figshare
Save the .csv file inside:
/dataset/raw_dataset/
python script/run_pipeline.py
This will process the data, train the model, and save outputs to processed_dataset/ and figures/.