Skip to content

IDEA-XL/FormulAI-BCS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BCS Reproduction Workspace

当前仓库已经实现并串通了以下流程:

  • 四个基础回归任务的数据准备与固定 test split
  • 单模型训练
  • 模型融合:blend / stacking
  • external BCS classification validation

统一 Python 环境:

/cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python

命令建议统一加:

PYTHONPATH=.

Tasks And Data

当前支持四个回归任务:

  • logs
  • logp
  • logd
  • logpapp

对应数据文件:

  • data/logs.csv
  • data/logp.csv
  • data/logd.csv
  • data/logpapp.csv

外部 BCS 验证集:

  • data/BCS_test.json

补充的 logS 数据合并脚本:

  • src/data/merge_logs_cui.py
  • 默认将 data/solubility/Cui.csv 合并进 data/logs.csv
  • 合并键为 canonical SMILES
  • 会去掉 invalid SMILES,并保证最终 data/logs.csv 无重复 SMILES

Repo Structure

主要目录:

  • configs/models/: 模型级默认配置
  • configs/tasks/: 任务级配置
  • src/data/: 数据清洗、split、图构建
  • src/features/: 1D / UniMol 特征
  • src/train/: 单模型训练入口
  • src/ensemble/: 回归 ensemble
  • src/bcs/: external BCS 验证
  • outputs/: 所有中间结果与模型输出
  • plan/: 复现实验计划

Config Layout

配置采用两级继承结构:

  • configs/models/<model>/base.toml
  • configs/tasks/<task>/<variant>.toml

设计原则:

  • 模型共享超参数放在 configs/models/
  • 任务相关的数据路径、目标列、split 规则放在 configs/tasks/
  • 任务配置通过 base_config = "../../models/<model>/base.toml" 继承

当前模型 base config:

  • configs/models/attentivefp/base.toml
  • configs/models/xgboost/base.toml
  • configs/models/lightgbm/base.toml
  • configs/models/unimol/base.toml
  • configs/models/unimol_ft/base.toml

当前任务 config:

  • configs/tasks/logs/attentivefp.toml
  • configs/tasks/logs/xgboost.toml
  • configs/tasks/logs/lightgbm.toml
  • configs/tasks/logs/unimol.toml
  • configs/tasks/logs/unimol_ft_full.toml
  • configs/tasks/logs/unimol_ft_head.toml
  • configs/tasks/logp/attentivefp.toml
  • configs/tasks/logp/xgboost.toml
  • configs/tasks/logp/lightgbm.toml
  • configs/tasks/logp/unimol.toml
  • configs/tasks/logp/unimol_ft_full.toml
  • configs/tasks/logp/unimol_ft_head.toml
  • configs/tasks/logd/attentivefp.toml
  • configs/tasks/logd/xgboost.toml
  • configs/tasks/logd/lightgbm.toml
  • configs/tasks/logd/unimol.toml
  • configs/tasks/logd/unimol_ft_full.toml
  • configs/tasks/logpapp/attentivefp.toml
  • configs/tasks/logpapp/xgboost.toml
  • configs/tasks/logpapp/lightgbm.toml
  • configs/tasks/logpapp/unimol.toml
  • configs/tasks/logpapp/unimol_ft_full.toml

说明:

  • unimol.toml 对应 frozen encoder + MLP head,模型名是 unimol_mlp
  • unimol_ft_full.toml 对应 full finetune,模型名是 unimol_ft
  • unimol_ft_head.toml 对应冻结 backbone,仅训练 head

Data Preparation

入口:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.data --config <task_config.toml>

示例:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.data --config configs/tasks/logp/attentivefp.toml

输出目录:

  • outputs/prepared/<task>/seed_42/normalized.csv
  • outputs/prepared/<task>/seed_42/split.csv
  • outputs/prepared/<task>/seed_42/test_set.csv
  • outputs/prepared/<task>/seed_42/summary.json

当前 split 规则:

  • 如果原始数据已经有 train/test 划分,则保留原 test
  • 对剩余 train pool 按 8:1 划分新的 train/valid
  • 如果原始数据没有 test,则按 8:1:1 随机划分 train/valid/test
  • 每个任务的 test set 一旦准备完成后保持固定

当前已知特殊规则:

  • logpapp 使用原始 Dataset 列固定 test
  • 对应配置:
    • original_split_col = "Dataset"
    • original_test_values = ["Te"]

Training

入口:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config <task_config.toml>

只检查配置:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config <task_config.toml> --dry-run

示例:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/attentivefp.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/xgboost.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/lightgbm.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/unimol.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/unimol_ft_full.toml

训练输出目录:

  • outputs/<task>/attentivefp_pyg/<timestamp>/
  • outputs/<task>/xgboost/<timestamp>/
  • outputs/<task>/lightgbm/<timestamp>/
  • outputs/<task>/unimol_mlp/<timestamp>/
  • outputs/<task>/unimol_ft/<timestamp>/

常见输出文件:

  • metrics.json
  • resolved_config.json
  • history.csv
  • test_predictions.csv

模型文件:

  • AttentiveFP: best_model.pt
  • XGBoost: best_model.json
  • LightGBM: best_model.txt
  • UniMol MLP: best_model.pt
  • UniMol FT: model_0.pthconfig.yaml

说明:

  • 当前单模型主流程不做 5-fold CV
  • 训练过程会实时打印到 terminal
  • 所有单模型都是基于固定 train / valid / test split 训练
  • unimol_ft 当前已经修正为不再把 test 混进训练

Implemented Models

1. AttentiveFP

实现位置:

  • src/models/attentivefp.py
  • src/train/attentivefp.py
  • src/data/featurization.py
  • src/data/legacy_featurizer.py

框架:

  • torch_geometric

默认超参数来自:

  • configs/models/attentivefp/base.toml

当前图输入:

  • 节点特征来自 legacy_featurizer.atom_features
  • 边特征来自 legacy_featurizer.bond_features
  • 图缓存保存在 outputs/cache/<task>_graphs.pt

当前原子特征包括:

  • atom type
  • degree
  • formal charge
  • radical electrons
  • hybridization
  • aromaticity
  • total H count
  • chirality

当前键特征包括:

  • bond type
  • conjugation
  • ring membership
  • stereochemistry

备注:

  • atom-level ring membership 和显式 valence 目前没有单独展开成额外维度

2. XGBoost

实现位置:

  • src/train/tabular.py
  • src/features/tabular.py

模型名:

  • xgboost

输入特征:

  • ECFP1024
  • radius = 3
  • RDKit descriptors

3. LightGBM

实现位置:

  • src/train/tabular.py
  • src/features/tabular.py

模型名:

  • lightgbm

输入特征:

  • ECFP1024
  • radius = 3
  • RDKit descriptors

4. UniMol Frozen Repr + MLP Head

实现位置:

  • src/features/unimol.py
  • src/models/unimol_mlp.py
  • src/train/unimol.py

模型名:

  • unimol_mlp

逻辑:

  • 使用 UniMolRepr 提取分子表征
  • 当前支持:
    • feature_mode = "cls"
    • feature_mode = "cls_atom_pool",即 CLS || atom_mean || atom_max
  • 特征可缓存到 outputs/cache/<task>_<feature_spec>.pkl
  • 在缓存表征上训练一个 MLP regressor

5. UniMol Full Finetune

实现位置:

  • src/train/unimol_ft.py

模型名:

  • unimol_ft

逻辑:

  • 基于 fixed train / valid / test 做 finetune
  • unimol_ft_full.toml 用于 full finetune
  • unimol_ft_head.toml 通过 freeze_layers 只训练 head

当前默认 base 配置:

  • model_size = "164m"
  • batch_size = 8
  • learning_rate = 5e-5

1D Feature Set

1D baseline 特征实现:

  • src/features/tabular.py

当前组合特征:

  • Morgan fingerprint / ECFP
    • fpSize = 1024
    • radius = 3
  • RDKit descriptors

当前 descriptor 集合包括:

  • mw
  • tpsa
  • hbd
  • hba
  • rotatable_bonds
  • aromatic_ring_count
  • heavy_atom_count
  • fraction_csp3
  • formal_charge
  • mol_logp
  • mol_mr
  • balaban_j
  • bertz_ct
  • hall_kier_alpha
  • kappa1
  • kappa2
  • kappa3
  • chi0
  • chi1
  • chi0n
  • chi1n
  • chi2n
  • chi3n
  • chi4n
  • chi0v
  • chi1v
  • chi2v
  • chi3v
  • chi4v
  • labute_asa
  • ring_count
  • aliphatic_ring_count
  • saturated_ring_count
  • acidic_site_count_proxy
  • basic_site_count_proxy

Ensemble

入口:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble ...

当前支持两种模式:

  • blend
  • stacking

支持的 base model 名称:

  • attentivefp_pyg
  • xgboost
  • lightgbm
  • unimol_ft

说明:

  • blend 支持以上四类模型
  • stacking 当前只支持:
    • attentivefp_pyg
    • xgboost
    • lightgbm

Blend

示例,自动取某任务最新 run:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --task logp --mode blend

只合并指定模型:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
  --task logp \
  --mode blend \
  --base-models xgboost lightgbm

指定具体 run:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
  --task logp \
  --mode blend \
  --base-models xgboost lightgbm \
  --run-dirs \
    outputs/logp/xgboost/20260508_225912 \
    outputs/logp/lightgbm/20260508_230740

对全部任务运行:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --all-tasks --mode blend

当前 blend 的输出包括两组结果:

  • uniform: 等权平均
  • tuned: 在 valid 上学习非负权重

Stacking

示例:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
  --task logp \
  --mode stacking \
  --num-folds 3

只用部分 base models:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
  --task logp \
  --mode stacking \
  --num-folds 3 \
  --base-models attentivefp_pyg xgboost lightgbm

切换 meta model:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
  --task logp \
  --mode stacking \
  --num-folds 3 \
  --meta-model ridge \
  --ridge-alpha 1.0

逻辑:

  • 以固定的 train + valid 作为 meta-train pool
  • 对该 pool 做 outer K-fold
  • 每个 fold 内重新训练 base model
  • outer-valid 生成 OOF prediction
  • 固定 test 只用于最终评估,不参与 meta model 拟合
  • meta model 当前支持:
    • ridge
    • linear

stacking 输出通常包括:

  • metrics.json
  • oof_predictions.csv
  • test_fold_predictions.csv
  • test_predictions.csv
  • meta_model.pkl
  • folds/fold_<k>/<model>/...

External BCS Validation

入口:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs ...

当前支持模式:

  • single
  • single-family
  • uniform
  • blend
  • stacking

Current Classification Logic

external BCS 当前按下面的逻辑做:

  • 溶解度维度使用预测 logS
  • 先按 C_s = 10^{logS} * MW 换算 mg/mL
  • 默认阈值:C_s >= 0.1 mg/mL 记为高溶
  • 渗透性阈值:
    • logP > 1.72
    • logD > -0.1954
    • logPapp > -5.097

四分类规则:

  • high solubility + high permeability -> class 1
  • low solubility + high permeability -> class 2
  • high solubility + low permeability -> class 3
  • low solubility + low permeability -> class 4

标签处理规则:

  • 二分类维度评估会保留模糊标签的兼容集合
  • 四分类评估只保留单一确定标签样本
  • 模糊标签如 1,3 不参与四分类 accuracy

Single And Single-Family

单一组合:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
  --mode single \
  --permeability-task logp \
  --logs-model-name xgboost \
  --permeability-model-name lightgbm

单模型家族:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name attentivefp_pyg
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name xgboost
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name lightgbm
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name unimol_ft

Uniform / Blend / Stacking

uniform,自动用指定 base models:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
  --mode uniform \
  --base-models xgboost lightgbm

blend,指定渗透性主任务:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
  --mode blend \
  --permeability-task logd \
  --base-models attentivefp_pyg xgboost lightgbm

stacking,要求先训练过对应 base models 组合的 stacking run:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
  --mode stacking \
  --base-models attentivefp_pyg xgboost lightgbm

说明:

  • src.bcs --base-models 已与 src.ensemble 对齐
  • uniform / blend / stacking 都支持指定参与模型子集
  • stacking 会自动寻找同一组 base_models 训练出来的 stacking run

Overriding Solubility Threshold

覆盖默认 0.1 mg/mL 阈值:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
  --mode single-family \
  --model-name attentivefp_pyg \
  --solubility-threshold-mg-ml 0.05

或者直接给 logS 阈值:

PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
  --mode single-family \
  --model-name attentivefp_pyg \
  --solubility-threshold-logs -2.6

两者不能同时指定。

Output Layout

当前输出目录约定:

  • 数据准备:outputs/prepared/<task>/seed_<seed>/
  • 单模型训练:outputs/<task>/<model_name>/<timestamp>/
  • blend:outputs/<task>/ensemble/<timestamp>/
  • stacking:outputs/<task>/stacking/<timestamp>/
  • external BCS:outputs/external_bcs/<mode>/...

常见额外缓存:

  • 图缓存:outputs/cache/<task>_graphs.pt
  • 1D 特征缓存:outputs/cache/<task>_<feature_spec>.pkl
  • UniMol 特征缓存:outputs/cache/<task>_<feature_spec>.pkl

Typical Workflow

logp 为例:

  1. 准备数据
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.data --config configs/tasks/logp/attentivefp.toml
  1. 训练单模型
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/attentivefp.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/xgboost.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/lightgbm.toml
  1. 做 blend
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --task logp --mode blend
  1. 做 stacking
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --task logp --mode stacking --num-folds 3
  1. 做 external BCS
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name xgboost

Notes

  • 默认 device 在各个 configs/models/*/base.toml 中设置
  • 当前 base config 默认是 cuda:0
  • 如果切 GPU,直接改对应 config 的 [run].device
  • unimol_ft 的学习率、batch size 对显存和稳定性影响较大,建议按 GPU 重新调
  • blend 当前会输出 uniformtuned 两套结果
  • external BCS 当前默认使用 C_s >= 0.1 mg/mL 作为高溶阈值

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages