当前仓库已经实现并串通了以下流程:
- 四个基础回归任务的数据准备与固定 test split
- 单模型训练
- 模型融合:
blend/stacking - external BCS classification validation
统一 Python 环境:
/cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python命令建议统一加:
PYTHONPATH=.当前支持四个回归任务:
logslogplogdlogpapp
对应数据文件:
data/logs.csvdata/logp.csvdata/logd.csvdata/logpapp.csv
外部 BCS 验证集:
data/BCS_test.json
补充的 logS 数据合并脚本:
src/data/merge_logs_cui.py- 默认将
data/solubility/Cui.csv合并进data/logs.csv - 合并键为 canonical SMILES
- 会去掉 invalid SMILES,并保证最终
data/logs.csv无重复 SMILES
主要目录:
configs/models/: 模型级默认配置configs/tasks/: 任务级配置src/data/: 数据清洗、split、图构建src/features/: 1D / UniMol 特征src/train/: 单模型训练入口src/ensemble/: 回归 ensemblesrc/bcs/: external BCS 验证outputs/: 所有中间结果与模型输出plan/: 复现实验计划
配置采用两级继承结构:
configs/models/<model>/base.tomlconfigs/tasks/<task>/<variant>.toml
设计原则:
- 模型共享超参数放在
configs/models/ - 任务相关的数据路径、目标列、split 规则放在
configs/tasks/ - 任务配置通过
base_config = "../../models/<model>/base.toml"继承
当前模型 base config:
configs/models/attentivefp/base.tomlconfigs/models/xgboost/base.tomlconfigs/models/lightgbm/base.tomlconfigs/models/unimol/base.tomlconfigs/models/unimol_ft/base.toml
当前任务 config:
configs/tasks/logs/attentivefp.tomlconfigs/tasks/logs/xgboost.tomlconfigs/tasks/logs/lightgbm.tomlconfigs/tasks/logs/unimol.tomlconfigs/tasks/logs/unimol_ft_full.tomlconfigs/tasks/logs/unimol_ft_head.tomlconfigs/tasks/logp/attentivefp.tomlconfigs/tasks/logp/xgboost.tomlconfigs/tasks/logp/lightgbm.tomlconfigs/tasks/logp/unimol.tomlconfigs/tasks/logp/unimol_ft_full.tomlconfigs/tasks/logp/unimol_ft_head.tomlconfigs/tasks/logd/attentivefp.tomlconfigs/tasks/logd/xgboost.tomlconfigs/tasks/logd/lightgbm.tomlconfigs/tasks/logd/unimol.tomlconfigs/tasks/logd/unimol_ft_full.tomlconfigs/tasks/logpapp/attentivefp.tomlconfigs/tasks/logpapp/xgboost.tomlconfigs/tasks/logpapp/lightgbm.tomlconfigs/tasks/logpapp/unimol.tomlconfigs/tasks/logpapp/unimol_ft_full.toml
说明:
unimol.toml对应 frozen encoder + MLP head,模型名是unimol_mlpunimol_ft_full.toml对应 full finetune,模型名是unimol_ftunimol_ft_head.toml对应冻结 backbone,仅训练 head
入口:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.data --config <task_config.toml>示例:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.data --config configs/tasks/logp/attentivefp.toml输出目录:
outputs/prepared/<task>/seed_42/normalized.csvoutputs/prepared/<task>/seed_42/split.csvoutputs/prepared/<task>/seed_42/test_set.csvoutputs/prepared/<task>/seed_42/summary.json
当前 split 规则:
- 如果原始数据已经有 train/test 划分,则保留原 test
- 对剩余 train pool 按
8:1划分新的 train/valid - 如果原始数据没有 test,则按
8:1:1随机划分 train/valid/test - 每个任务的 test set 一旦准备完成后保持固定
当前已知特殊规则:
logpapp使用原始Dataset列固定 test- 对应配置:
original_split_col = "Dataset"original_test_values = ["Te"]
入口:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config <task_config.toml>只检查配置:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config <task_config.toml> --dry-run示例:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/attentivefp.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/xgboost.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/lightgbm.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/unimol.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/unimol_ft_full.toml训练输出目录:
outputs/<task>/attentivefp_pyg/<timestamp>/outputs/<task>/xgboost/<timestamp>/outputs/<task>/lightgbm/<timestamp>/outputs/<task>/unimol_mlp/<timestamp>/outputs/<task>/unimol_ft/<timestamp>/
常见输出文件:
metrics.jsonresolved_config.jsonhistory.csvtest_predictions.csv
模型文件:
- AttentiveFP:
best_model.pt - XGBoost:
best_model.json - LightGBM:
best_model.txt - UniMol MLP:
best_model.pt - UniMol FT:
model_0.pth和config.yaml
说明:
- 当前单模型主流程不做 5-fold CV
- 训练过程会实时打印到 terminal
- 所有单模型都是基于固定
train / valid / testsplit 训练 unimol_ft当前已经修正为不再把 test 混进训练
实现位置:
src/models/attentivefp.pysrc/train/attentivefp.pysrc/data/featurization.pysrc/data/legacy_featurizer.py
框架:
torch_geometric
默认超参数来自:
configs/models/attentivefp/base.toml
当前图输入:
- 节点特征来自
legacy_featurizer.atom_features - 边特征来自
legacy_featurizer.bond_features - 图缓存保存在
outputs/cache/<task>_graphs.pt
当前原子特征包括:
- atom type
- degree
- formal charge
- radical electrons
- hybridization
- aromaticity
- total H count
- chirality
当前键特征包括:
- bond type
- conjugation
- ring membership
- stereochemistry
备注:
- atom-level
ring membership和显式valence目前没有单独展开成额外维度
实现位置:
src/train/tabular.pysrc/features/tabular.py
模型名:
xgboost
输入特征:
ECFP1024radius = 3- RDKit descriptors
实现位置:
src/train/tabular.pysrc/features/tabular.py
模型名:
lightgbm
输入特征:
ECFP1024radius = 3- RDKit descriptors
实现位置:
src/features/unimol.pysrc/models/unimol_mlp.pysrc/train/unimol.py
模型名:
unimol_mlp
逻辑:
- 使用
UniMolRepr提取分子表征 - 当前支持:
feature_mode = "cls"feature_mode = "cls_atom_pool",即CLS || atom_mean || atom_max
- 特征可缓存到
outputs/cache/<task>_<feature_spec>.pkl - 在缓存表征上训练一个 MLP regressor
实现位置:
src/train/unimol_ft.py
模型名:
unimol_ft
逻辑:
- 基于 fixed
train / valid / test做 finetune unimol_ft_full.toml用于 full finetuneunimol_ft_head.toml通过freeze_layers只训练 head
当前默认 base 配置:
model_size = "164m"batch_size = 8learning_rate = 5e-5
1D baseline 特征实现:
src/features/tabular.py
当前组合特征:
- Morgan fingerprint / ECFP
fpSize = 1024radius = 3
- RDKit descriptors
当前 descriptor 集合包括:
mwtpsahbdhbarotatable_bondsaromatic_ring_countheavy_atom_countfraction_csp3formal_chargemol_logpmol_mrbalaban_jbertz_cthall_kier_alphakappa1kappa2kappa3chi0chi1chi0nchi1nchi2nchi3nchi4nchi0vchi1vchi2vchi3vchi4vlabute_asaring_countaliphatic_ring_countsaturated_ring_countacidic_site_count_proxybasic_site_count_proxy
入口:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble ...当前支持两种模式:
blendstacking
支持的 base model 名称:
attentivefp_pygxgboostlightgbmunimol_ft
说明:
blend支持以上四类模型stacking当前只支持:attentivefp_pygxgboostlightgbm
示例,自动取某任务最新 run:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --task logp --mode blend只合并指定模型:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
--task logp \
--mode blend \
--base-models xgboost lightgbm指定具体 run:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
--task logp \
--mode blend \
--base-models xgboost lightgbm \
--run-dirs \
outputs/logp/xgboost/20260508_225912 \
outputs/logp/lightgbm/20260508_230740对全部任务运行:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --all-tasks --mode blend当前 blend 的输出包括两组结果:
uniform: 等权平均tuned: 在 valid 上学习非负权重
示例:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
--task logp \
--mode stacking \
--num-folds 3只用部分 base models:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
--task logp \
--mode stacking \
--num-folds 3 \
--base-models attentivefp_pyg xgboost lightgbm切换 meta model:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble \
--task logp \
--mode stacking \
--num-folds 3 \
--meta-model ridge \
--ridge-alpha 1.0逻辑:
- 以固定的
train + valid作为 meta-train pool - 对该 pool 做 outer K-fold
- 每个 fold 内重新训练 base model
- outer-valid 生成 OOF prediction
- 固定 test 只用于最终评估,不参与 meta model 拟合
- meta model 当前支持:
ridgelinear
stacking 输出通常包括:
metrics.jsonoof_predictions.csvtest_fold_predictions.csvtest_predictions.csvmeta_model.pklfolds/fold_<k>/<model>/...
入口:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs ...当前支持模式:
singlesingle-familyuniformblendstacking
external BCS 当前按下面的逻辑做:
- 溶解度维度使用预测
logS - 先按
C_s = 10^{logS} * MW换算mg/mL - 默认阈值:
C_s >= 0.1 mg/mL记为高溶 - 渗透性阈值:
logP > 1.72logD > -0.1954logPapp > -5.097
四分类规则:
- high solubility + high permeability -> class 1
- low solubility + high permeability -> class 2
- high solubility + low permeability -> class 3
- low solubility + low permeability -> class 4
标签处理规则:
- 二分类维度评估会保留模糊标签的兼容集合
- 四分类评估只保留单一确定标签样本
- 模糊标签如
1,3不参与四分类 accuracy
单一组合:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
--mode single \
--permeability-task logp \
--logs-model-name xgboost \
--permeability-model-name lightgbm单模型家族:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name attentivefp_pyg
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name xgboost
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name lightgbm
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name unimol_ftuniform,自动用指定 base models:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
--mode uniform \
--base-models xgboost lightgbmblend,指定渗透性主任务:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
--mode blend \
--permeability-task logd \
--base-models attentivefp_pyg xgboost lightgbmstacking,要求先训练过对应 base models 组合的 stacking run:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
--mode stacking \
--base-models attentivefp_pyg xgboost lightgbm说明:
src.bcs --base-models已与src.ensemble对齐uniform / blend / stacking都支持指定参与模型子集stacking会自动寻找同一组base_models训练出来的 stacking run
覆盖默认 0.1 mg/mL 阈值:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
--mode single-family \
--model-name attentivefp_pyg \
--solubility-threshold-mg-ml 0.05或者直接给 logS 阈值:
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs \
--mode single-family \
--model-name attentivefp_pyg \
--solubility-threshold-logs -2.6两者不能同时指定。
当前输出目录约定:
- 数据准备:
outputs/prepared/<task>/seed_<seed>/ - 单模型训练:
outputs/<task>/<model_name>/<timestamp>/ - blend:
outputs/<task>/ensemble/<timestamp>/ - stacking:
outputs/<task>/stacking/<timestamp>/ - external BCS:
outputs/external_bcs/<mode>/...
常见额外缓存:
- 图缓存:
outputs/cache/<task>_graphs.pt - 1D 特征缓存:
outputs/cache/<task>_<feature_spec>.pkl - UniMol 特征缓存:
outputs/cache/<task>_<feature_spec>.pkl
以 logp 为例:
- 准备数据
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.data --config configs/tasks/logp/attentivefp.toml- 训练单模型
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/attentivefp.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/xgboost.toml
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.train --config configs/tasks/logp/lightgbm.toml- 做 blend
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --task logp --mode blend- 做 stacking
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.ensemble --task logp --mode stacking --num-folds 3- 做 external BCS
PYTHONPATH=. /cto_labs/AIDD/CODE/weinadi/BCS/.venv/bin/python -m src.bcs --mode single-family --model-name xgboost- 默认 device 在各个
configs/models/*/base.toml中设置 - 当前 base config 默认是
cuda:0 - 如果切 GPU,直接改对应 config 的
[run].device unimol_ft的学习率、batch size 对显存和稳定性影响较大,建议按 GPU 重新调blend当前会输出uniform和tuned两套结果external BCS当前默认使用C_s >= 0.1 mg/mL作为高溶阈值