GFMBench-API is an extensible benchmarking suite for assessing genomic foundation models (GFMs) across a diverse set of downstream tasks, including classification, variant effect prediction, and zero-shot evaluation.
-
Create a virtual environment (choose one option):
Option A: Using pip (venv)
python -m venv gfmbench_env source gfmbench_env/bin/activate # On Windows: gfmbench_env\Scripts\activate
Option B: Using conda
conda create -n gfmbench_env python=3.10 conda activate gfmbench_env
-
Install required Python dependencies:
pip install -r requirements.txt
-
(Optional, GPU users) Check CUDA availability:
python -c "import torch; print(torch.cuda.is_available())" -
(Optional, for DNABERT2 with flash attention):
pip install flash-attn --no-build-isolation
Note: The requirements in
requirements.txtare configured for running examples with DNABERT2 and DNABERT models. For other models or algorithms, additional dependencies may be required and should be installed separately.
GFMBench-API provides:
- A core API package for unified GFM evaluation (
gfmbench_api/). - A suite of standard benchmark tasks covering supervised and zero-shot scenarios.
- Consistent interfaces for models and tasks, enabling both out-of-the-box use and customized evaluation pipelines.
- Example scripts and templates (
usage_examples/) to get started quickly or for rapid prototyping.
gfmbench_api_rep/
├── gfmbench_api/ # Main API package
│ ├── benchmark_report/ # CSV report utilities
│ ├── metrics/ # Built-in metrics (AUROC, AUPRC, etc.)
│ ├── tasks/ # Task definitions
│ │ ├── base/ # Base task/model classes
│ │ └── concrete/ # 20+ ready-to-use tasks
│ └── utils/ # Misc utilities (data I/O, download helpers)
├── usage_examples/ # Getting started scripts and toy models
│ ├── run_benchmark.py
│ ├── trainers/
│ └── sanity_models/
├── logs/ # Logs (autocreated)
└── requirements.txt
GFMBench-API supports evaluation on 20 unique tasks, grouped as:
| Task Class | Description |
|---|---|
| GuePromoterAllTask | Binary classification of promoter vs non-promoter DNA sequences. |
| GueSpliceSiteTask | Three-class classification of splice sites as donor, acceptor, or non-splice. |
| GueTranscriptionFactorTask | Binary classification of transcription factor binding sites from ChIP-seq data. |
| VariantBenchmarksCodingTask | Binary classification of coding variants as benign or pathogenic. |
| VariantBenchmarksNonCodingTask | Binary classification of non-coding variants as benign or pathogenic. |
| VariantBenchmarksExpressionTask | Binary classification of variants affecting gene expression. |
| VariantBenchmarksCommonVsRareTask | Binary classification distinguishing common variants from synthetic rare controls. |
| VariantBenchmarksMEQTLTask | Binary classification of variants affecting DNA methylation rates. |
| VariantBenchmarksSQTLTask | Binary classification of variants affecting alternative splicing. |
| LRBCausalEqtlTask | Binary classification of variants causally influencing gene expression with tissue context. |
| Task Class | Description |
|---|---|
| VepevalClinvarTask | Zero-shot pathogenicity prediction for ClinVar SNVs using embedding-distance scoring. |
| IndelClinvarTask | Zero-shot pathogenicity prediction for ClinVar insertions and deletions. |
| BendVEPExpression | Zero-shot prediction of expression effects for non-coding variants. |
| BendVEPDisease | Zero-shot prediction of disease effects for non-coding variants. |
| SonglabClinvarTask | Zero-shot pathogenicity prediction for ClinVar SNVs using likelihood-based scoring. |
| BRCA1Task | Zero-shot prediction of functional impact for BRCA1 variants (LOF, intermediate, functional). |
| TraitGymComplexTask | Zero-shot prediction of complex trait-associated variants. |
| TraitGymMendelianTask | Zero-shot prediction of Mendelian disease-associated variants. |
| LrbVariantEffectPathogenicOmimTask | Zero-shot prediction of pathogenic variants associated with Mendelian diseases. |
| LoleveCausalEqtlTask | Zero-shot prediction of causal expression-modulating variants (indels) in promoters. |
All task classes expose a consistent interface (see gfmbench_api/tasks/base/):
get_task_name()get_task_attributes()— metadata (e.g. number of labels, dataset splits)get_finetune_dataset()eval_test_set(model)eval_validation_set(model)eval_cross_validation_fold(model, train_indices)
Simply implement the methods below in your model class; inheritance is not required. The API uses duck typing and will only call methods needed by the specific tasks you run.
| Method Signature | Description |
|---|---|
infer_sequence_to_labels_probs(sequences, ...) |
Takes a list of DNA sequences and returns probabilities for each class label for classification tasks. |
infer_variant_ref_sequences_to_labels_probs(variant_sequences, ref_sequences, ...) |
Takes lists of variant and reference sequences and returns probabilities for variant effect classification tasks. |
infer_sequence_to_sequence(sequences, ...) |
Takes a list of DNA sequences and returns per-nucleotide probabilities, per-position embeddings, and a single sequence-level embedding. |
sequence_pos_to_prob_pos(sequences, pos) |
Takes a list of DNA sequences and a position index, returns the corresponding output position indices accounting for tokenization differences. |
infer_masked_sequence_to_token_probs(sequences, variant_pos, variant_letters, reference_letters, ...) |
Takes sequences and masks the variant position, returns probabilities for the variant and reference nucleotides at the masked position. |
- Any not-implemented methods can simply return
Noneand metrics depending on them will be skipped. - See
gfmbench_api/tasks/base/base_gfm_model.pyfor detailed docstrings.
This reference script demonstrates a standard workflow:
- Loading a model (built-in or your own)
- Running a configurable set of tasks
- Supervised fine-tuning or zero-shot evaluation
- Auto-generating benchmark CSVs
| Argument | Type | Default | Notes |
|---|---|---|---|
--model |
str | DNABERT2 | Model string key (see MODEL_REGISTRY) |
--checkpoint_path |
str | None | Path to model checkpoint |
--mlm_head_path |
str | None | Optional: path to MLM head checkpoint |
--report_algo_name |
str | temp | Column name for this model in benchmark CSV |
--csv_path |
str | ... | Where benchmark report is saved |
--linear_prob |
flag | False | If set: train only projection layer |
--epochs |
int | 3 | Num epochs for fine-tuning |
--disable_safe_model_call |
flag | False | Bypass try/except if set |
- Root data directory
⚠️ Update around line 164:root_data_dir_path = "/path/to/your/data/"
- Sanity check mode (100-sample subsets)
sanity_check_mode = True # Set to False for full eval
- Task configuration
Edit parameters likemax_sequence_length,batch_size. - Training parameters
Changenum_epochs,learning_rate, etc. in the training_params dict. - Task list
Edit/add the tasks in thetasks = [...]list. - Model registry
Add your models to theMODEL_REGISTRYdict.
Default (no checkpoint):
python usage_examples/run_benchmark.py \
--model DNABERT2 \
--report_algo_name dna_bert2_baseline \
--csv_path results/baseline_results.csvWith a custom model checkpoint:
python usage_examples/run_benchmark.py \
--model DNABERT2 \
--checkpoint_path /path/to/model.pt \
--report_algo_name my_custom_model \
--csv_path results/my_results.csv \
--epochs 5Linear probe (freeze backbone, train only projection):
python usage_examples/run_benchmark.py \
--model DNABERT2 \
--linear_prob \
--report_algo_name linear_probe \
--csv_path results/linear_probe_results.csvAll task data is stored under a single root directory (customize the path!):
root_data_dir_path = "/path/to/your/data/"Each task creates a subdirectory, e.g.:
<root_data_dir_path>/
├── gue_promoter_all/
│ ├── train.csv
│ ├── dev.csv
│ └── test.csv
├── clinvar_vepeval/
│ └── ...
└── ...
- Task subdirectory names correspond to class names.
- If files are missing, they will be automatically downloaded on first use.
| Task Type | Key columns | Example |
|---|---|---|
| Classification | sequence, label |
ATCCGA..., 1 |
| Variant effect | ref_sequence, alt_sequence, label |
A...T..., 0 |
GFMBench-API uses the following standard symbols for DNA sequences in datasets:
| Symbol | Meaning | Usage |
|---|---|---|
A, T, C, G |
Standard DNA nucleotides | Used to represent the four DNA bases (adenine, thymine, cytosine, guanine) |
N |
Unknown/ambiguous nucleotide | Used when the nucleotide at a position is unknown or ambiguous |
P |
Padding character | Used to pad sequences when extracting windows near chromosome boundaries |
Notes:
- All sequences are case-insensitive (automatically converted to uppercase)
- The
Ppadding character should be added when handling boundary cases - Unknown nucleotides (
N) may appear in reference genome data or when sequence information is incomplete - When creating custom tasks, ensure your sequences use only these accepted symbols
Results are auto-saved in a single CSV with this schema:
| task | metric | model1 | model2 | ... |
|---|---|---|---|---|
| gue_promoter_all | accuracy | 0.85 | 0.82 | ... |
| gue_promoter_all | auroc | 0.91 | 0.88 | ... |
| ... | ... | ... | ... | ... |
- Each (task, metric) is a row.
- Each model name (as passed to
--report_algo_name) is a column. - If a result is missing, its cell is
NO_RESULTS. - Incremental saving ensures safe restarts/interruption recovery.
Logs are written to logs/ for all major steps (run progress, model errors, fine-tune loss curves, auto-downloading messages, etc).
from gfmbench_api.tasks.concrete.gue_promoter_all_task import GuePromoterAllTask
from gfmbench_api.benchmark_report import BenchmarkReport
task = GuePromoterAllTask(
root_data_dir_path="/your/data/path",
task_config={"max_sequence_length": 2500, "batch_size": 16},
)
my_model = MyCustomModel(device="cuda") # Implements the required inference methods
# Optionally, supervised fine-tuning:
if task.get_task_attributes()["has_finetuning_data"]:
dataset = task.get_finetune_dataset()
# ... Your custom training code ...
# After fine-tuning, wrap model with projection layer for classification (my_model = wrapped_model)
# Evaluate
my_model.eval()
results = task.eval_test_set(my_model)
print(results)
# Save results
report = BenchmarkReport(csv_path="results.csv")
report.add_scores("gue_promoter_all", "my_model", results)
report.save_csv()To add a new concrete task, inherit from the appropriate base class based on your task type and implement the required methods.
Note: If your task has unique functionality that doesn't fit the standard task types, you can inherit directly from BaseGFMTask and implement all required methods from scratch.
Choose the appropriate base class based on your task:
-
Supervised Single-Sequence Classification: Inherit from
BaseGFMSupervisedSingleSeqTask- For tasks with single DNA sequences and categorical labels (e.g., promoter prediction, splice site detection)
-
Supervised Variant Effect Prediction: Inherit from
BaseGFMSupervisedVariantEffectTask- For tasks with paired reference/variant sequences and categorical labels (e.g., pathogenic variant classification)
-
Zero-Shot SNV Variant Effect: Inherit from
BaseGFMZeroShotSNVTask- For zero-shot evaluation of single-nucleotide variants (SNVs) with equal-length reference and variant sequences
-
Zero-Shot General Indel: Inherit from
BaseGFMZeroShotGeneralIndelTask- For zero-shot evaluation of insertions/deletions with variable-length sequences
All concrete tasks must implement:
get_task_name() -> str: Return the task name (must match the data directory name)_get_default_max_seq_len() -> int: Return the default maximum sequence length_create_datasets()or_create_test_dataset(): Create and return datasets- Supervised tasks: Return
Tuple[Optional[Dataset], Optional[Dataset], Dataset](train, validation, test) - Zero-shot tasks: Return
Dataset(test only) - Important: When creating datasets, you must account for:
self.max_sequence_length: Truncate sequences if they exceed this length (usetruncate_sequence_from_ends()fromgfmbench_api.utils.preprocutils)self.max_num_samples: Limit the number of samples per split if specified (for sanity testing with smaller datasets)
- Supervised tasks: Return
get_conditional_input_meta_data_frame() -> Optional[pd.DataFrame]: Return metadata DataFrame if task uses conditional inputs, otherwise returnNone
Additional methods for supervised tasks:
_get_num_labels() -> int: Return the number of classification labels
from gfmbench_api.tasks.base.base_gfm_supervised_single_seq_task import BaseGFMSupervisedSingleSeqTask
import os
import pandas as pd
import torch
import numpy as np
class MyCustomTask(BaseGFMSupervisedSingleSeqTask):
def get_task_name(self) -> str:
return "my_custom_task"
def _get_default_max_seq_len(self) -> int:
return 512
def _get_num_labels(self) -> int:
return 2 # Binary classification
def _create_datasets(self):
data_dir = os.path.join(self.root_data_dir_path, self.get_task_name())
train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
val_df = pd.read_csv(os.path.join(data_dir, "dev.csv")) if os.path.exists(os.path.join(data_dir, "dev.csv")) else None
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
# Data process: Account for self.max_num_samples and self.max_sequence_length
train_dataset = [(seq, label, np.array([])) for seq, label in zip(train_df['sequence'], train_df['label'])]
validation_dataset = [(seq, label, np.array([])) for seq, label in zip(val_df['sequence'], val_df['label'])] if val_df is not None else None
test_dataset = [(seq, label, np.array([])) for seq, label in zip(test_df['sequence'], test_df['label'])]
return train_dataset, validation_dataset, test_dataset
def get_conditional_input_meta_data_frame(self):
return Nonefrom gfmbench_api.tasks.base.base_gfm_zeroshot_snv_task import BaseGFMZeroShotSNVTask
import os
import pandas as pd
import numpy as np
class MyZeroShotTask(BaseGFMZeroShotSNVTask):
def get_task_name(self) -> str:
return "my_zero_shot_task"
def _get_default_max_seq_len(self) -> int:
return 1024
def _create_test_dataset(self):
data_dir = os.path.join(self.root_data_dir_path, self.get_task_name())
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
# Data process: Account for self.max_num_samples and self.max_sequence_length
test_dataset = [
(var_seq, ref_seq, label, np.array([]))
for var_seq, ref_seq, label in zip(test_df['variant_sequence'], test_df['reference_sequence'], test_df['label'])
]
return test_dataset
def _get_variant_position_in_sequence(self) -> int:
return self.max_sequence_length // 2
def get_conditional_input_meta_data_frame(self):
return NoneThis project will download and install additional third-party open source software projects, including datasets licensed with non-commercial terms. Review the license terms of these open source projects before use.
Third-party components and attributions: see THIRD_PARTY_NOTICES.md (Python dependencies, Hugging Face assets, reference genomes, and other data URLs). Example: LicenseRef-UCSC-Genome-Browser — https://genome.ucsc.edu/license/
NVIDIA-authored code in this repository is licensed under the Apache License, Version 2.0. The full text is in LICENSE.
Each contributed .py file includes NVIDIA SPDX-FileCopyrightText, SPDX-License-Identifier: Apache-2.0, and the short Apache-2.0 notice (through “limitations under the License”), then third-party URL notices scoped to that file (or the line stating that the module does not embed third-party data download URLs). Python package attributions remain in THIRD_PARTY_NOTICES.md. The full Apache-2.0 text is in LICENSE.
@article{larey2026gfmbench, title={GFMBench-API: A Standardized Interface for Benchmarking Genomic Foundation Models}, author={Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini, Gideon Rechavi, Nicole Bussola, Simon Lee, Shane O’Connell, Dung Hoang, Marissa Wirth, Alexander W. Charney, Yoli Shavit, Nati Daniel}, journal={bioRxiv}, pages={2026--02}, year={2026}, publisher={Cold Spring Harbor Laboratory} }