Name	Name	Last commit message	Last commit date
parent directory ..
rf_parameter_configs	rf_parameter_configs
CYP2D6_Veith-rf_morgan-al.sh	CYP2D6_Veith-rf_morgan-al.sh
CYP2D6_Veith-rf_morgan-pl.sh	CYP2D6_Veith-rf_morgan-pl.sh
CYP2D6_Veith-rf_morgan-smal.sh	CYP2D6_Veith-rf_morgan-smal.sh
CYP3A4_Veith-rf_morgan-al.sh	CYP3A4_Veith-rf_morgan-al.sh
CYP3A4_Veith-rf_morgan-pl.sh	CYP3A4_Veith-rf_morgan-pl.sh
CYP3A4_Veith-rf_morgan-smal.sh	CYP3A4_Veith-rf_morgan-smal.sh
MDR1_MDCK_classification2-rf_morgan-al.sh	MDR1_MDCK_classification2-rf_morgan-al.sh
MDR1_MDCK_classification2-rf_morgan-pl.sh	MDR1_MDCK_classification2-rf_morgan-pl.sh
MDR1_MDCK_classification2-rf_morgan-smal.sh	MDR1_MDCK_classification2-rf_morgan-smal.sh
PAMPA_NCATS-rf_morgan-al.sh	PAMPA_NCATS-rf_morgan-al.sh
PAMPA_NCATS-rf_morgan-pl.sh	PAMPA_NCATS-rf_morgan-pl.sh
PAMPA_NCATS-rf_morgan-smal.sh	PAMPA_NCATS-rf_morgan-smal.sh
README.md	README.md
ames-rf_morgan-al.sh	ames-rf_morgan-al.sh
ames-rf_morgan-pl.sh	ames-rf_morgan-pl.sh
ames-rf_morgan-smal.sh	ames-rf_morgan-smal.sh
build_processed_results.py	build_processed_results.py
pgp_broccatelli-dmpnn-al.sh	pgp_broccatelli-dmpnn-al.sh
pgp_broccatelli-molformer-al.sh	pgp_broccatelli-molformer-al.sh
pgp_broccatelli-rf_morgan-al.sh	pgp_broccatelli-rf_morgan-al.sh
pgp_broccatelli-rf_morgan-pl.sh	pgp_broccatelli-rf_morgan-pl.sh
pgp_broccatelli-rf_morgan-smal.sh	pgp_broccatelli-rf_morgan-smal.sh
rf_parameter_sets-rf_morgan-al.sh	rf_parameter_sets-rf_morgan-al.sh
rf_parameter_sets-rf_morgan-smal.sh	rf_parameter_sets-rf_morgan-smal.sh
rf_parameter_sets-smal-params.csv	rf_parameter_sets-smal-params.csv
simpd-rf_morgan-al.sh	simpd-rf_morgan-al.sh
simpd-rf_morgan-pl.sh	simpd-rf_morgan-pl.sh
simpd-rf_morgan-smal.sh	simpd-rf_morgan-smal.sh
simpd-smal-params.csv	simpd-smal-params.csv
stratified_shuffle-rf_morgan-al.sh	stratified_shuffle-rf_morgan-al.sh
stratified_shuffle-rf_morgan-smal.sh	stratified_shuffle-rf_morgan-smal.sh
stratified_shuffle-smal-params.csv	stratified_shuffle-smal-params.csv

Name

Last commit message

Last commit date

CYP2D6_Veith-rf_morgan-al.sh

CYP2D6_Veith-rf_morgan-pl.sh

CYP2D6_Veith-rf_morgan-smal.sh

CYP3A4_Veith-rf_morgan-al.sh

CYP3A4_Veith-rf_morgan-pl.sh

CYP3A4_Veith-rf_morgan-smal.sh

MDR1_MDCK_classification2-rf_morgan-al.sh

MDR1_MDCK_classification2-rf_morgan-pl.sh

MDR1_MDCK_classification2-rf_morgan-smal.sh

PAMPA_NCATS-rf_morgan-al.sh

PAMPA_NCATS-rf_morgan-pl.sh

PAMPA_NCATS-rf_morgan-smal.sh

README.md

ames-rf_morgan-al.sh

ames-rf_morgan-pl.sh

ames-rf_morgan-smal.sh

build_processed_results.py

pgp_broccatelli-dmpnn-al.sh

pgp_broccatelli-molformer-al.sh

pgp_broccatelli-rf_morgan-al.sh

pgp_broccatelli-rf_morgan-pl.sh

pgp_broccatelli-rf_morgan-smal.sh

rf_parameter_sets-rf_morgan-al.sh

rf_parameter_sets-rf_morgan-smal.sh

rf_parameter_sets-smal-params.csv

simpd-rf_morgan-al.sh

simpd-rf_morgan-pl.sh

simpd-rf_morgan-smal.sh

simpd-smal-params.csv

stratified_shuffle-rf_morgan-al.sh

stratified_shuffle-rf_morgan-smal.sh

stratified_shuffle-smal-params.csv

Reproduction Scripts

One bash script per (dataset, model, learning_type) combination, named:

<dataset>-<model>-<learning_type>.sh

Where learning_type is one of:

code	meaning
`al`	Active learning (informative selection: `--select_method explorative`)
`pl`	Passive learning (random selection: `--select_method random`)
`smal`	Short-term Memory Active Learning (AL with `--forget_method` and `--n_forget`)

Each run script loops over the seeds, error rates, and (for SMAL) forget methods used in the manuscript, calling molalkit_run once per combination. The grouped SIMPD scripts use one script per learning type and loop over all committed datasets/CHEMBL*.csv files.

Prerequisites

Install MolALKit v1.2.0 following the instructions in the main README. The molalkit_run CLI must be on your PATH, and python -c "import molalkit" must succeed (used by some scripts to locate packaged dataset CSVs).

DMPNN scripts additionally require a MolALKit environment with Chemprop support and a CUDA-capable GPU for the manuscript-scale runs.

MolFormer scripts additionally require the IBM MolFormer pretrained checkpoint file. Download the checkpoint from https://github.com/IBM/molformer and pass its path with MOLFORMER_CKPT:

MOLFORMER_CKPT=/path/to/N-Step-Checkpoint_3_30000.ckpt bash scripts/pgp_broccatelli-molformer-al.sh

Running

From the repository root:

bash scripts/ames-rf_morgan-al.sh

Outputs are written under ./results-initial/<active_learning|passive_learning|smal>/.... Override the destination with the OUT_ROOT environment variable:

OUT_ROOT=/path/to/output bash scripts/ames-rf_morgan-smal.sh

DMPNN and MolFormer pgp active-learning scripts default to model-specific output roots, ./results-dmpnn/active_learning and ./results-molformer/active_learning, respectively. They also accept OUT_ROOT.

SIMPD scripts default to ./results-simpd/<active_learning|passive_learning|smal>. Each SIMPD dataset CSV contains a split column; the scripts preserve this split by creating temporary MolALKit input files with SMILES and Y columns, using the train rows for active learning and the test rows for validation. Override the dataset location with DATA_ROOT if needed:

DATA_ROOT=/path/to/simpd_csvs OUT_ROOT=/path/to/output bash scripts/simpd-rf_morgan-al.sh

The SIMPD SMAL script reads per-dataset f_min_train_size and max_iter values from scripts/simpd-smal-params.csv. Override with FMIN_TABLE only if you have regenerated those manuscript parameters.

Stratified-shuffle label-error scripts default to ./results-stratified_shuffle/<active_learning|smal>:

bash scripts/stratified_shuffle-rf_morgan-al.sh
bash scripts/stratified_shuffle-rf_morgan-smal.sh

The stratified-shuffle SMAL script reads f_min_train_size and max_iter from scripts/stratified_shuffle-smal-params.csv.

RF parameter-set scripts default to ./results-rf-parameter-sets. They use the custom RandomForest/Morgan configs in scripts/rf_parameter_configs/; the active-learning script writes manifest.csv, and the SMAL script writes smal_manifest.csv for scripts/build_processed_results.py.

bash scripts/rf_parameter_sets-rf_morgan-al.sh
bash scripts/rf_parameter_sets-rf_morgan-smal.sh

The RF parameter-set SMAL script reads f_min_train_size and max_iter from scripts/rf_parameter_sets-smal-params.csv.

The scripts are intentionally simple sequential loops — they do not submit SLURM jobs, parallelize, or skip already-completed runs. For HPC use, wrap each molalkit_run invocation in your own job scheduler.

Currently included experiments

Model	Datasets	Learning types
RandomForest/Morgan	ames, CYP2D6_Veith, CYP3A4_Veith, MDR1_MDCK_classification2, PAMPA_NCATS, pgp_broccatelli	al, pl, smal
RandomForest/Morgan	SIMPD ChEMBL datasets (`CHEMBL*.csv`)	al, pl, smal
RandomForest/Morgan	ames, CYP2D6_Veith, CYP3A4_Veith, MDR1_MDCK_classification2, PAMPA_NCATS, pgp_broccatelli with stratified-shuffle label errors	al, smal
RandomForest/Morgan RF parameter sweep	MDR1_MDCK_classification2, PAMPA_NCATS, pgp_broccatelli	al, smal
DMPNN	pgp_broccatelli	al
MolFormer	pgp_broccatelli	al

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Reproduction Scripts

Prerequisites

Running

Currently included experiments

FilesExpand file tree

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

Reproduction Scripts

Prerequisites

Running

Currently included experiments