GUT-FORMer

GUT-FORMer is a Transformer-based model that jointly encodes microbiome taxonomy (which organisms are present) and functional pathways (what they are doing) into a shared latent space. It learns to reconstruct both modalities simultaneously, enabling cross-modal prediction and compact microbiome embeddings.

Developed by TomaszLab.

🚀 Setup

📋 Prerequisites

pyenv — Python version management
Poetry — Dependency management

Both can be installed via brew install pyenv poetry.

⚙️ Installation

pyenv install 3.12  # if not already installed
make install

📂 Data

Place input files in the data/ directory:

data/
├── taxonomy_{dataset}.csv     # samples × species (pipe-separated taxonomy strings)
├── pathways_{dataset}.csv     # samples × pathways
└── metadata_{dataset}.csv     # sample metadata, must include study_condition column

All files are indexed by sample ID. The {dataset} name is passed via --dataset to all scripts.

A sample dataset (--dataset sample) is included in the repository to get started quickly.

🧠 Model

Train from scratch

poetry run python src/gut_former/run_training.py

Options:

Argument	Default	Description
`--dataset`	`sample`	Dataset name
`--epochs`	`55`	Number of training epochs
`--embedding_dim`	`128`	Embedding dimension
`--latent_dim`	`64`	Latent space dimension
`--batch_size`	`16`	Batch size
`--learning_rate`	`1.89e-4`	Learning rate

Outputs saved to output/:

{date}_{dataset}_checkpoint.pt — model weights
{date}_{dataset}_training_stats.csv — per-epoch metrics

Fine-tuning

Continue training from an existing checkpoint (replace the checkpoint path with your own):

nohup poetry run python src/gut_former/run_training.py \
  --dataset my_dataset \
  --epochs 1000 \
  --checkpoint output/{date}_{dataset}_checkpoint.pt \
  > output/training.log 2>&1 &

Follow progress:

tail -f output/training.log

Inference

poetry run python src/gut_former/run_inference.py \
  --dataset sample \
  --checkpoint output/20260412_sample_checkpoint.pt

Outputs saved to output/:

latent_{dataset}.csv — latent embeddings
pred_taxonomy_{dataset}.csv — predicted taxonomy
pred_pathways_{dataset}.csv — predicted pathways

🧬 NMF

NMF decomposes microbiome data into latent signatures. Supports both pathway and taxonomy data, with optional filtering by health status.

poetry run python src/gut_former/nmf/run_nmf.py --type pathways --dataset sample

Options:

Argument	Default	Description
`--type`	required	Data type: `pathways` or `taxonomy`
`--dataset`	`sample`	Dataset name
`--n_signatures`	`8`	Number of NMF signatures
`--n_runs`	`100`	Runs per seed — best selected by validation metrics
`--filter`	`all`	Sample subset: `all`, `healthy`, `non-healthy`

Examples:

# Pathway NMF, all samples
poetry run python src/gut_former/nmf/run_nmf.py --type pathways --dataset sample

# Taxonomy NMF, healthy samples only
poetry run python src/gut_former/nmf/run_nmf.py --type taxonomy --dataset sample --filter healthy

Outputs saved to output/nmf/:

H_{n}_{dataset}_{type}_{filter}.csv — signature matrix
W_train_{n}_{dataset}_{type}_{filter}.csv — sample loadings, train set
W_val_{n}_{dataset}_{type}_{filter}.csv — sample loadings, val set
nmf_stats_{n}_{dataset}_{type}_{filter}.csv — per-run metrics

🔍 biCV — Rank Selection

Bi-Cross-Validation selects the optimal number of NMF signatures for pathway data by evaluating reconstruction quality on held-out data blocks across a range of ranks. For taxonomy, the number of signatures is established in the literature.

poetry run python src/gut_former/nmf/run_bicv.py --dataset sample

Options:

Argument	Default	Description
`--dataset`	`sample`	Dataset name
`--n_signatures_min`	`2`	Minimum number of signatures to test
`--n_signatures_max`	`15`	Maximum number of signatures to test
`--n_runs`	`50`	NMF runs per fold per signature

Output saved to output/bicv/:

bicv_nmf_stats.csv — metrics for all folds, signatures, and runs

The optimal signature count is printed at the end of the run.

🔧 Troubleshooting

To start fresh (remove virtual environment and reinstall):

make clean
make install

🤝 Contributing

See CONTRIBUTING.md for development guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
figures		figures
notebooks		notebooks
output		output
src		src
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GUT-FORMer

🚀 Setup

📋 Prerequisites

⚙️ Installation

📂 Data

🧠 Model

Train from scratch

Fine-tuning

Inference

🧬 NMF

🔍 biCV — Rank Selection

🔧 Troubleshooting

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GUT-FORMer

🚀 Setup

📋 Prerequisites

⚙️ Installation

📂 Data

🧠 Model

Train from scratch

Fine-tuning

Inference

🧬 NMF

🔍 biCV — Rank Selection

🔧 Troubleshooting

🤝 Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages