Skip to content

Lottie-641/CS284AProject

Repository files navigation

Predicting Biologically Relevant HIV-1 Phenotypes From Viral Sequence Using Transformer Models

Snakemake workflow for preparing HIV sequence datasets and fine-tuning ProtBERT/HIVBERT models.

Prerequisites

  • Conda (mamba recommended) and a Python 3.10-compatible GPU machine; the environment installs CUDA 12.1 builds of PyTorch.

Environment setup

  1. git lfs install
  2. git clone https://github.com/Lottie-641/CS284AProject.git
  3. cd CS284AProject
  4. conda create -n hiv python=3.10 -y
  5. conda activate hiv
  6. conda env update -n hiv -f environment.yml
  7. (Optional) python - <<'PY'\nimport torch; print('CUDA available:', torch.cuda.is_available())\nPY
  8. Please note that several large files are stored under data/. Make sure all files in this directory are downloaded successfully.

Running the workflow

  • Default end-to-end run (pretraining + all downstream folds):
    • bash run_models.sh all
  • Run individual groups:
    • bash run_models.sh pretrain # HIV-BERT genome pretraining
    • bash run_models.sh protbert # ProtBERT fine-tuning across folds
    • bash run_models.sh hivbert # HIV-BERT fine-tuning across folds
    • bash run_models.sh weighted-class # HIV-BERT with class weights
    • bash run_models.sh weighted-focal # HIV-BERT with focal loss
  • The scripts wrap Snakemake; you can adjust resources via env vars, e.g. CORES=4 GPU_RES=12 CUDA_VISIBLE_DEVICES=0 bash run_models.sh hivbert.
  • To drive Snakemake manually for the default rule all: snakemake --use-conda --cores 1 --resources gpu=12.
  • After running the script, a datasets/ folder is generated automatically, in which all raw data are preprocessed and split into training and validation.

Model checkpoints

Please find the model checkpoints and training logs at the following Google Drive link: https://drive.google.com/file/d/1CmWM9HJ5rvPnG6g2yq4tWF4fpfE5nnIQ/view?usp=sharing.

Test and visualizations

You can run the Jupyter notebooks under results/ to reproduce the visualizations and test the trained models.

About

Predicting Biologically Relevant HIV-1 Phenotypes From Viral Sequence Using Transformer Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors