Snakemake workflow for preparing HIV sequence datasets and fine-tuning ProtBERT/HIVBERT models.
- Conda (mamba recommended) and a Python 3.10-compatible GPU machine; the environment installs CUDA 12.1 builds of PyTorch.
git lfs installgit clone https://github.com/Lottie-641/CS284AProject.gitcd CS284AProjectconda create -n hiv python=3.10 -yconda activate hivconda env update -n hiv -f environment.yml- (Optional)
python - <<'PY'\nimport torch; print('CUDA available:', torch.cuda.is_available())\nPY - Please note that several large files are stored under
data/. Make sure all files in this directory are downloaded successfully.
- Default end-to-end run (pretraining + all downstream folds):
bash run_models.sh all
- Run individual groups:
bash run_models.sh pretrain# HIV-BERT genome pretrainingbash run_models.sh protbert# ProtBERT fine-tuning across foldsbash run_models.sh hivbert# HIV-BERT fine-tuning across foldsbash run_models.sh weighted-class# HIV-BERT with class weightsbash run_models.sh weighted-focal# HIV-BERT with focal loss
- The scripts wrap Snakemake; you can adjust resources via env vars, e.g.
CORES=4 GPU_RES=12 CUDA_VISIBLE_DEVICES=0 bash run_models.sh hivbert. - To drive Snakemake manually for the default
rule all:snakemake --use-conda --cores 1 --resources gpu=12. - After running the script, a
datasets/folder is generated automatically, in which all raw data are preprocessed and split into training and validation.
Please find the model checkpoints and training logs at the following Google Drive link: https://drive.google.com/file/d/1CmWM9HJ5rvPnG6g2yq4tWF4fpfE5nnIQ/view?usp=sharing.
You can run the Jupyter notebooks under results/ to reproduce the visualizations and test the trained models.