Skip to content

1234-Ariel-code/gvae

Repository files navigation

gVAE: Genomic Variational Autoencoder

gVAE overview image


About

This repository contains the manuscript-facing implementation of gVAE, a genomic variational autoencoder framework for stable and interpretable representation learning in high-dimensional genotype data with moderate sample sizes.

The code supports model training, latent representation extraction, SNP prioritization, downstream prediction, SNP-to-gene mapping, pathway enrichment, disease-gene relevance analysis, and reproducibility utilities used in the accompanying manuscript.


Overview

gVAE workflow overview

Genome-wide genotype matrices are high-dimensional, sparse, and often available for cohorts with limited sample sizes. The goal of this repository is to provide a reproducible implementation of a representation learning workflow that: 1. trains VAE/gVAE models on genotype data, 2. extracts stable latent representations, 3. identifies latent-variable-associated SNPs using attribution methods, 4. maps prioritized SNPs to genes, 5. evaluates disease relevance and drug-target support, 6. performs pathway enrichment analysis, and 7. supports downstream classification and regression analyses.


Repository structure

The repository is organized around a small set of manuscript-facing scripts, shared model utilities, reproducibility files, and cluster execution templates.

Core Python package

The main implementation lives in gvae/.

gvae/

Shared architecture and utilities

Model training and representation learning

Interpretability and biological analysis

Cluster execution templates

These files provide example cluster-job configurations and should be edited to match the local computing environment, data paths, memory limits, and runtime requirements.

Repository-level files


Installation

The repository can be installed using either conda or pip.

Option 1: Conda environment

conda env create -f environment.yml
conda activate gvae

Option 2: Python requirements

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Option 3: Editable package installation

pip install -e .

After installation, the main Python workflows can be run from the repository root using module-style commands such as:

python -m gvae.gvae
python -m gvae.snp_prioritization

Quick smoke test

Users can verify that the package imports correctly with:

python -c "from gvae.model import GVAE; from gvae.metrics import evaluate_r_square; print('gVAE imports OK')"

The command-line interfaces can be inspected with:

python -m gvae.gvae --help
python -m gvae.snp_prioritization --help

A minimal model-training test can be run on a small PLINK BED dataset using:

python -m gvae.gvae \
  --disease TEST \
  --bed_prefix /path/to/example \
  --latent_dim 4 \
  --num_sample 5 \
  --num_layer 2 \
  --epochs 2 \
  --batch_size 16 \
  --feature_mode none \
  --output_dir test_outputs

This command is intended only to verify installation, imports, data loading, model construction, training, and output writing. Manuscript-scale analyses require the full genotype, GWAS, SNP-to-gene, and pathway resources described below.


Expected data inputs

The scripts are designed for genotype and annotation files commonly used in genome-wide association studies and genomic representation learning.

Typical inputs include:

Genotype matrix:
  <DISEASE>_filtered.csv

Phenotype file:
  <DISEASE>_origin.phen

Variant annotation files:
  <DISEASE>_origin.tped
  <DISEASE>.bim

GWAS association file:
  <DISEASE>_gwas.assoc

SNP-to-gene mapping file:
  cS2G or other SNP-to-gene mapping table

Pathway resources:
  Enrichr libraries or GMT files

Disease-gene resources:
  DisGeNET TSV file or API access

The exact file paths should be adjusted in the command-line arguments or SLURM scripts.


Example workflows

1. Train gVAE model

python -m gvae.gvae \
  --disease T2D \
  --bed_prefix /path/to/plink/T2D \
  --latent_dim 100 \
  --num_sample 150 \
  --num_layer 4 \
  --epochs 50 \
  --batch_size 256

2. Prioritize SNPs using latent-variable attribution

python -m gvae.snp_prioritization \
  --disease T2D \
  --base_path /path/to/genotype/files \
  --latent_dim 100 \
  --num_samples 150 \
  --num_layers 4 \
  --shap_top_k 10 \
  --tped_file /path/to/T2D_origin.tped \
  --output_dir /path/to/xai_outputs

3. Run latent-space classification or regression

python gvae/latent_classification.py \
  --disease T2D \
  --base_path /path/to/genotype/files \
  --model_type gvae \
  --latent_dim 100 \
  --num_samples 150 \
  --num_layers 4 \
  --feature_mode gwas_top \
  --downsample_d 50000 \
  --assoc_path /path/to/T2D_gwas.assoc \
  --tped_file /path/to/T2D_origin.tped \
  --train_vae_epochs 50 \
  --vae_batch_size 256 \
  --batch_size 256 \
  --epochs 120 \
  --cache_latents \
  --out_root /path/to/latent_classification_outputs \
  --make_plots

4. Run SNP-to-gene and pathway enrichment analysis

python gvae/gene-pathway_enrichment.py \
  --disease T2D \
  --base_dir /path/to/xai_outputs \
  --s2g_path /path/to/snp_to_gene.tsv \
  --bim_file /path/to/T2D.bim \
  --run_gene_analysis \
  --disgenet_mode tsv \
  --disgenet_tsv /path/to/disgenet.tsv \
  --disgenet_disease_name "type 2 diabetes" \
  --out_root /path/to/gene_pathway_outputs

Output structure

Depending on the script, outputs may include:

Model outputs:
  trained model weights
  reconstruction summaries
  robustness summaries
  latent representations

XAI outputs:
  top SNPs per latent variable
  SNP attribution summaries
  q25/q75 latent feature files
  SHAP-weighted genotype matrices

Prediction outputs:
  classification or regression metrics
  training histories
  latent feature caches
  performance plots

Gene/pathway outputs:
  SNP-to-gene mapped tables
  pathway enrichment tables
  LV-by-pathway heatmaps
  LV bubble plots
  DisGeNET disease-gene relevance summaries
  target-support tables

Reproducibility

The repository includes:

reproducibility.md
environment.yml
requirements.txt
pyproject.toml
Makefile

These files document the computational environment and provide installation or workflow helpers. Paths in the example scripts and SLURM files should be adjusted to the local system.

For reviewer-facing reproducibility, the recommended workflow is:

  1. create the documented environment,
  2. prepare genotype, phenotype, GWAS, and annotation files,
  3. run the model-training script,
  4. run SNP attribution,
  5. run gene/pathway enrichment,
  6. run matched downstream analyses or support-table construction.

Notes on terminology

Throughout the repository:

  • LV denotes latent variable.
  • LD denotes latent dimension in configuration names.
  • NS denotes the number of posterior latent samples used for gVAE quantile aggregation.
  • L denotes the number of encoder/decoder hidden layers.
  • shap_top_k denotes the number of top SNPs retained per latent variable in attribution outputs.
  • q25/q75 denote the posterior latent quantiles defining the reported gVAE feature representation.
  • GWAS-top SNP filtering denotes structured filtering based on GWAS ranking, not random SNP downsampling.

User-facing software pipeline

Run gVAE analyses from a single configuration file.

gVAE pipeline config template smoke test software documentation

In addition to the manuscript-facing implementation in gvae/, this repository includes a configuration-driven software pipeline in software/. The pipeline provides a simplified entry point for users who want to run selected gVAE analyses on their own data without manually calling each script.

What the software pipeline supports

  • Model training using the shared gVAE architecture.
  • SHAP-based SNP prioritization from learned latent variables.
  • Latent-space prediction for classification or regression tasks.
  • SNP-to-gene and pathway enrichment for biological interpretation.
  • GWAS-XAI matched-budget comparison for benchmarking prioritized signals.
  • Smoke tests for installation, imports, and command-line interfaces.

Pipeline files

File Role Description
software/gvae_pipeline.py workflow runner Configuration-driven pipeline runner for training, SNP attribution, prediction, enrichment, and GWAS-XAI comparison.
software/config_template.yaml full analysis template Full configuration template for running user-defined gVAE analyses.
software/config_smoke_test.yaml smoke test template Minimal configuration for checking installation, imports, and basic execution.
software/README.md usage guide User-facing guide for configuring and running the gVAE software pipeline.

Example usage

Run selected steps:

python software/gvae_pipeline.py \
  --config software/config_template.yaml \
  --steps smoke train xai

Preview commands without running them:

python software/gvae_pipeline.py \
  --config software/config_template.yaml \
  --steps train xai enrich \
  --dry-run

Run all configured steps:

python software/gvae_pipeline.py \
  --config software/config_template.yaml \
  --steps all

Available pipeline steps

smoke      check imports and command-line interfaces
train      train gVAE models
xai        run SHAP-based SNP prioritization
predict    run latent-space classification or regression
enrich     run SNP-to-gene, pathway, and disease-gene analysis
gwas_xai   run GWAS versus gVAE-XAI matched-budget comparison
all        run all configured steps in order

Citation

Please cite this repository using the metadata in:

CITATION.cff

License

This repository is released under the MIT License. See:

LICENSE

Contact

For questions about the manuscript code or reproducibility workflow, please contact the repository maintainer through GitHub.

About

gVAE: Stable and interpretable representation learning for high-dimensional genomic cohorts with moderate sample sizes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors