This repository contains the manuscript-facing implementation of gVAE, a genomic variational autoencoder framework for stable and interpretable representation learning in high-dimensional genotype data with moderate sample sizes.
The code supports model training, latent representation extraction, SNP prioritization, downstream prediction, SNP-to-gene mapping, pathway enrichment, disease-gene relevance analysis, and reproducibility utilities used in the accompanying manuscript.
Genome-wide genotype matrices are high-dimensional, sparse, and often available for cohorts with limited sample sizes. The goal of this repository is to provide a reproducible implementation of a representation learning workflow that: 1. trains VAE/gVAE models on genotype data, 2. extracts stable latent representations, 3. identifies latent-variable-associated SNPs using attribution methods, 4. maps prioritized SNPs to genes, 5. evaluates disease relevance and drug-target support, 6. performs pathway enrichment analysis, and 7. supports downstream classification and regression analyses.
The repository is organized around a small set of manuscript-facing scripts, shared model utilities, reproducibility files, and cluster execution templates.
The main implementation lives in gvae/.
gvae/
Shared architecture and utilities
gvae/__init__.py— package initializer.gvae/model.py— shared gVAE, Vanilla VAE, and beta-VAE model definitions.gvae/metrics.py— shared reconstruction and prediction metric utilities.
Model training and representation learning
gvae/gvae.py— main model-training entry point.gvae/latent_classification.py— downstream classification and regression from latent features.
Interpretability and biological analysis
gvae/snp_prioritization.py— SHAP-based SNP prioritization from latent variables.gvae/gene-pathway_enrichment.py— SNP-to-gene mapping, pathway enrichment, and disease-gene relevance analysis.gvae/build_target_support_table.py— gene-level disease and drug-target support summaries.gvae/gwas-xai.R— matched-budget comparison of GWAS-ranked and gVAE-XAI-prioritized signals.
Cluster execution templates
gvae/gvae.slurm— SLURM template for model training.gvae/gene-pathway_enrichment.slurm— SLURM template for enrichment analysis.gvae/gwas-xai.slurm— SLURM template for GWAS-XAI comparison.
These files provide example cluster-job configurations and should be edited to match the local computing environment, data paths, memory limits, and runtime requirements.
README.md— repository overview and usage guide.reproducibility.md— reproducibility notes and recommended workflow.requirements.txt— Python package requirements.environment.yml— Conda environment specification.pyproject.toml— package metadata and editable-install configuration.Makefile— helper commands for installation, cleanup, and checks.CITATION.cff— citation metadata.LICENSE— software license.
The repository can be installed using either conda or pip.
conda env create -f environment.yml
conda activate gvaepython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpip install -e .After installation, the main Python workflows can be run from the repository root using module-style commands such as:
python -m gvae.gvae
python -m gvae.snp_prioritizationUsers can verify that the package imports correctly with:
python -c "from gvae.model import GVAE; from gvae.metrics import evaluate_r_square; print('gVAE imports OK')"The command-line interfaces can be inspected with:
python -m gvae.gvae --help
python -m gvae.snp_prioritization --helpA minimal model-training test can be run on a small PLINK BED dataset using:
python -m gvae.gvae \
--disease TEST \
--bed_prefix /path/to/example \
--latent_dim 4 \
--num_sample 5 \
--num_layer 2 \
--epochs 2 \
--batch_size 16 \
--feature_mode none \
--output_dir test_outputsThis command is intended only to verify installation, imports, data loading, model construction, training, and output writing. Manuscript-scale analyses require the full genotype, GWAS, SNP-to-gene, and pathway resources described below.
The scripts are designed for genotype and annotation files commonly used in genome-wide association studies and genomic representation learning.
Typical inputs include:
Genotype matrix:
<DISEASE>_filtered.csv
Phenotype file:
<DISEASE>_origin.phen
Variant annotation files:
<DISEASE>_origin.tped
<DISEASE>.bim
GWAS association file:
<DISEASE>_gwas.assoc
SNP-to-gene mapping file:
cS2G or other SNP-to-gene mapping table
Pathway resources:
Enrichr libraries or GMT files
Disease-gene resources:
DisGeNET TSV file or API access
The exact file paths should be adjusted in the command-line arguments or SLURM scripts.
python -m gvae.gvae \
--disease T2D \
--bed_prefix /path/to/plink/T2D \
--latent_dim 100 \
--num_sample 150 \
--num_layer 4 \
--epochs 50 \
--batch_size 256python -m gvae.snp_prioritization \
--disease T2D \
--base_path /path/to/genotype/files \
--latent_dim 100 \
--num_samples 150 \
--num_layers 4 \
--shap_top_k 10 \
--tped_file /path/to/T2D_origin.tped \
--output_dir /path/to/xai_outputspython gvae/latent_classification.py \
--disease T2D \
--base_path /path/to/genotype/files \
--model_type gvae \
--latent_dim 100 \
--num_samples 150 \
--num_layers 4 \
--feature_mode gwas_top \
--downsample_d 50000 \
--assoc_path /path/to/T2D_gwas.assoc \
--tped_file /path/to/T2D_origin.tped \
--train_vae_epochs 50 \
--vae_batch_size 256 \
--batch_size 256 \
--epochs 120 \
--cache_latents \
--out_root /path/to/latent_classification_outputs \
--make_plotspython gvae/gene-pathway_enrichment.py \
--disease T2D \
--base_dir /path/to/xai_outputs \
--s2g_path /path/to/snp_to_gene.tsv \
--bim_file /path/to/T2D.bim \
--run_gene_analysis \
--disgenet_mode tsv \
--disgenet_tsv /path/to/disgenet.tsv \
--disgenet_disease_name "type 2 diabetes" \
--out_root /path/to/gene_pathway_outputsDepending on the script, outputs may include:
Model outputs:
trained model weights
reconstruction summaries
robustness summaries
latent representations
XAI outputs:
top SNPs per latent variable
SNP attribution summaries
q25/q75 latent feature files
SHAP-weighted genotype matrices
Prediction outputs:
classification or regression metrics
training histories
latent feature caches
performance plots
Gene/pathway outputs:
SNP-to-gene mapped tables
pathway enrichment tables
LV-by-pathway heatmaps
LV bubble plots
DisGeNET disease-gene relevance summaries
target-support tables
The repository includes:
reproducibility.md
environment.yml
requirements.txt
pyproject.toml
Makefile
These files document the computational environment and provide installation or workflow helpers. Paths in the example scripts and SLURM files should be adjusted to the local system.
For reviewer-facing reproducibility, the recommended workflow is:
- create the documented environment,
- prepare genotype, phenotype, GWAS, and annotation files,
- run the model-training script,
- run SNP attribution,
- run gene/pathway enrichment,
- run matched downstream analyses or support-table construction.
Throughout the repository:
LVdenotes latent variable.LDdenotes latent dimension in configuration names.NSdenotes the number of posterior latent samples used for gVAE quantile aggregation.Ldenotes the number of encoder/decoder hidden layers.shap_top_kdenotes the number of top SNPs retained per latent variable in attribution outputs.- q25/q75 denote the posterior latent quantiles defining the reported gVAE feature representation.
- GWAS-top SNP filtering denotes structured filtering based on GWAS ranking, not random SNP downsampling.
Run gVAE analyses from a single configuration file.
In addition to the manuscript-facing implementation in gvae/, this repository includes a configuration-driven software pipeline in software/. The pipeline provides a simplified entry point for users who want to run selected gVAE analyses on their own data without manually calling each script.
- Model training using the shared gVAE architecture.
- SHAP-based SNP prioritization from learned latent variables.
- Latent-space prediction for classification or regression tasks.
- SNP-to-gene and pathway enrichment for biological interpretation.
- GWAS-XAI matched-budget comparison for benchmarking prioritized signals.
- Smoke tests for installation, imports, and command-line interfaces.
Run selected steps:
python software/gvae_pipeline.py \
--config software/config_template.yaml \
--steps smoke train xaiPreview commands without running them:
python software/gvae_pipeline.py \
--config software/config_template.yaml \
--steps train xai enrich \
--dry-runRun all configured steps:
python software/gvae_pipeline.py \
--config software/config_template.yaml \
--steps allsmoke check imports and command-line interfaces
train train gVAE models
xai run SHAP-based SNP prioritization
predict run latent-space classification or regression
enrich run SNP-to-gene, pathway, and disease-gene analysis
gwas_xai run GWAS versus gVAE-XAI matched-budget comparison
all run all configured steps in order
Please cite this repository using the metadata in:
CITATION.cff
This repository is released under the MIT License. See:
LICENSE
For questions about the manuscript code or reproducibility workflow, please contact the repository maintainer through GitHub.

