GUT-FORMer is a Transformer-based model that jointly encodes microbiome taxonomy (which organisms are present) and functional pathways (what they are doing) into a shared latent space. It learns to reconstruct both modalities simultaneously, enabling cross-modal prediction and compact microbiome embeddings.
Developed by TomaszLab.
Both can be installed via brew install pyenv poetry.
pyenv install 3.12 # if not already installed
make installPlace input files in the data/ directory:
data/
├── taxonomy_{dataset}.csv # samples × species (pipe-separated taxonomy strings)
├── pathways_{dataset}.csv # samples × pathways
└── metadata_{dataset}.csv # sample metadata, must include study_condition column
All files are indexed by sample ID. The {dataset} name is passed via --dataset to all scripts.
A sample dataset (--dataset sample) is included in the repository to get started quickly.
poetry run python src/gut_former/run_training.pyOptions:
| Argument | Default | Description |
|---|---|---|
--dataset |
sample |
Dataset name |
--epochs |
55 |
Number of training epochs |
--embedding_dim |
128 |
Embedding dimension |
--latent_dim |
64 |
Latent space dimension |
--batch_size |
16 |
Batch size |
--learning_rate |
1.89e-4 |
Learning rate |
Outputs saved to output/:
{date}_{dataset}_checkpoint.pt— model weights{date}_{dataset}_training_stats.csv— per-epoch metrics
Continue training from an existing checkpoint (replace the checkpoint path with your own):
nohup poetry run python src/gut_former/run_training.py \
--dataset my_dataset \
--epochs 1000 \
--checkpoint output/{date}_{dataset}_checkpoint.pt \
> output/training.log 2>&1 &Follow progress:
tail -f output/training.logpoetry run python src/gut_former/run_inference.py \
--dataset sample \
--checkpoint output/20260412_sample_checkpoint.ptOutputs saved to output/:
latent_{dataset}.csv— latent embeddingspred_taxonomy_{dataset}.csv— predicted taxonomypred_pathways_{dataset}.csv— predicted pathways
NMF decomposes microbiome data into latent signatures. Supports both pathway and taxonomy data, with optional filtering by health status.
poetry run python src/gut_former/nmf/run_nmf.py --type pathways --dataset sampleOptions:
| Argument | Default | Description |
|---|---|---|
--type |
required | Data type: pathways or taxonomy |
--dataset |
sample |
Dataset name |
--n_signatures |
8 |
Number of NMF signatures |
--n_runs |
100 |
Runs per seed — best selected by validation metrics |
--filter |
all |
Sample subset: all, healthy, non-healthy |
Examples:
# Pathway NMF, all samples
poetry run python src/gut_former/nmf/run_nmf.py --type pathways --dataset sample
# Taxonomy NMF, healthy samples only
poetry run python src/gut_former/nmf/run_nmf.py --type taxonomy --dataset sample --filter healthyOutputs saved to output/nmf/:
H_{n}_{dataset}_{type}_{filter}.csv— signature matrixW_train_{n}_{dataset}_{type}_{filter}.csv— sample loadings, train setW_val_{n}_{dataset}_{type}_{filter}.csv— sample loadings, val setnmf_stats_{n}_{dataset}_{type}_{filter}.csv— per-run metrics
Bi-Cross-Validation selects the optimal number of NMF signatures for pathway data by evaluating reconstruction quality on held-out data blocks across a range of ranks. For taxonomy, the number of signatures is established in the literature.
poetry run python src/gut_former/nmf/run_bicv.py --dataset sampleOptions:
| Argument | Default | Description |
|---|---|---|
--dataset |
sample |
Dataset name |
--n_signatures_min |
2 |
Minimum number of signatures to test |
--n_signatures_max |
15 |
Maximum number of signatures to test |
--n_runs |
50 |
NMF runs per fold per signature |
Output saved to output/bicv/:
bicv_nmf_stats.csv— metrics for all folds, signatures, and runs
The optimal signature count is printed at the end of the run.
To start fresh (remove virtual environment and reinstall):
make clean
make installSee CONTRIBUTING.md for development guidelines.