Predicting Biologically Relevant HIV-1 Phenotypes From Viral Sequence Using Transformer Models

Snakemake workflow for preparing HIV sequence datasets and fine-tuning ProtBERT/HIVBERT models.

Prerequisites

Conda (mamba recommended) and a Python 3.10-compatible GPU machine; the environment installs CUDA 12.1 builds of PyTorch.

Environment setup

git lfs install
git clone https://github.com/Lottie-641/CS284AProject.git
cd CS284AProject
conda create -n hiv python=3.10 -y
conda activate hiv
conda env update -n hiv -f environment.yml
(Optional) python - <<'PY'\nimport torch; print('CUDA available:', torch.cuda.is_available())\nPY
Please note that several large files are stored under data/. Make sure all files in this directory are downloaded successfully.

Running the workflow

Default end-to-end run (pretraining + all downstream folds):
- bash run_models.sh all
Run individual groups:
- bash run_models.sh pretrain # HIV-BERT genome pretraining
- bash run_models.sh protbert # ProtBERT fine-tuning across folds
- bash run_models.sh hivbert # HIV-BERT fine-tuning across folds
- bash run_models.sh weighted-class # HIV-BERT with class weights
- bash run_models.sh weighted-focal # HIV-BERT with focal loss
The scripts wrap Snakemake; you can adjust resources via env vars, e.g. CORES=4 GPU_RES=12 CUDA_VISIBLE_DEVICES=0 bash run_models.sh hivbert.
To drive Snakemake manually for the default rule all: snakemake --use-conda --cores 1 --resources gpu=12.
After running the script, a datasets/ folder is generated automatically, in which all raw data are preprocessed and split into training and validation.

Model checkpoints

Please find the model checkpoints and training logs at the following Google Drive link: https://drive.google.com/file/d/1CmWM9HJ5rvPnG6g2yq4tWF4fpfE5nnIQ/view?usp=sharing.

Test and visualizations

You can run the Jupyter notebooks under results/ to reproduce the visualizations and test the trained models.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
results		results
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
run_models.sh		run_models.sh
run_weighted.sh		run_weighted.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Biologically Relevant HIV-1 Phenotypes From Viral Sequence Using Transformer Models

Prerequisites

Environment setup

Running the workflow

Model checkpoints

Test and visualizations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Predicting Biologically Relevant HIV-1 Phenotypes From Viral Sequence Using Transformer Models

Prerequisites

Environment setup

Running the workflow

Model checkpoints

Test and visualizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages