GFMBench-API

GFMBench-API is an extensible benchmarking suite for assessing genomic foundation models (GFMs) across a diverse set of downstream tasks, including classification, variant effect prediction, and zero-shot evaluation.

Quick Start

Installation

Create a virtual environment (choose one option):

Option A: Using pip (venv)

python -m venv gfmbench_env
source gfmbench_env/bin/activate  # On Windows: gfmbench_env\Scripts\activate

Option B: Using conda

conda create -n gfmbench_env python=3.10
conda activate gfmbench_env

Install required Python dependencies:
```
pip install -r requirements.txt
```

(Optional, GPU users) Check CUDA availability:

python -c "import torch; print(torch.cuda.is_available())"

(Optional, for DNABERT2 with flash attention):
```
pip install flash-attn --no-build-isolation
```
Note: The requirements in requirements.txt are configured for running examples with DNABERT2 and DNABERT models. For other models or algorithms, additional dependencies may be required and should be installed separately.

What’s Included?

GFMBench-API provides:

A core API package for unified GFM evaluation (gfmbench_api/).
A suite of standard benchmark tasks covering supervised and zero-shot scenarios.
Consistent interfaces for models and tasks, enabling both out-of-the-box use and customized evaluation pipelines.
Example scripts and templates (usage_examples/) to get started quickly or for rapid prototyping.

Repository Organization

gfmbench_api_rep/
├── gfmbench_api/              # Main API package
│   ├── benchmark_report/      # CSV report utilities
│   ├── metrics/               # Built-in metrics (AUROC, AUPRC, etc.)
│   ├── tasks/                 # Task definitions
│   │   ├── base/              # Base task/model classes
│   │   └── concrete/          # 20+ ready-to-use tasks
│   └── utils/                 # Misc utilities (data I/O, download helpers)
├── usage_examples/            # Getting started scripts and toy models
│   ├── run_benchmark.py
│   ├── trainers/
│   └── sanity_models/
├── logs/                      # Logs (autocreated)
└── requirements.txt

Supported Benchmarks & Tasks

GFMBench-API supports evaluation on 20 unique tasks, grouped as:

Supervised Classification & Variant Prediction

Task Class	Description
GuePromoterAllTask	Binary classification of promoter vs non-promoter DNA sequences.
GueSpliceSiteTask	Three-class classification of splice sites as donor, acceptor, or non-splice.
GueTranscriptionFactorTask	Binary classification of transcription factor binding sites from ChIP-seq data.
VariantBenchmarksCodingTask	Binary classification of coding variants as benign or pathogenic.
VariantBenchmarksNonCodingTask	Binary classification of non-coding variants as benign or pathogenic.
VariantBenchmarksExpressionTask	Binary classification of variants affecting gene expression.
VariantBenchmarksCommonVsRareTask	Binary classification distinguishing common variants from synthetic rare controls.
VariantBenchmarksMEQTLTask	Binary classification of variants affecting DNA methylation rates.
VariantBenchmarksSQTLTask	Binary classification of variants affecting alternative splicing.
LRBCausalEqtlTask	Binary classification of variants causally influencing gene expression with tissue context.

Zero-Shot Variant Effect Prediction

Task Class	Description
VepevalClinvarTask	Zero-shot pathogenicity prediction for ClinVar SNVs using embedding-distance scoring.
IndelClinvarTask	Zero-shot pathogenicity prediction for ClinVar insertions and deletions.
BendVEPExpression	Zero-shot prediction of expression effects for non-coding variants.
BendVEPDisease	Zero-shot prediction of disease effects for non-coding variants.
SonglabClinvarTask	Zero-shot pathogenicity prediction for ClinVar SNVs using likelihood-based scoring.
BRCA1Task	Zero-shot prediction of functional impact for BRCA1 variants (LOF, intermediate, functional).
TraitGymComplexTask	Zero-shot prediction of complex trait-associated variants.
TraitGymMendelianTask	Zero-shot prediction of Mendelian disease-associated variants.
LrbVariantEffectPathogenicOmimTask	Zero-shot prediction of pathogenic variants associated with Mendelian diseases.
LoleveCausalEqtlTask	Zero-shot prediction of causal expression-modulating variants (indels) in promoters.

How Model & Task Interfaces Work

Task API

All task classes expose a consistent interface (see gfmbench_api/tasks/base/):

get_task_name()
get_task_attributes() — metadata (e.g. number of labels, dataset splits)
get_finetune_dataset()
eval_test_set(model)
eval_validation_set(model)
eval_cross_validation_fold(model, train_indices)

Model Integration

Simply implement the methods below in your model class; inheritance is not required. The API uses duck typing and will only call methods needed by the specific tasks you run.

Method Signature	Description
`infer_sequence_to_labels_probs(sequences, ...)`	Takes a list of DNA sequences and returns probabilities for each class label for classification tasks.
`infer_variant_ref_sequences_to_labels_probs(variant_sequences, ref_sequences, ...)`	Takes lists of variant and reference sequences and returns probabilities for variant effect classification tasks.
`infer_sequence_to_sequence(sequences, ...)`	Takes a list of DNA sequences and returns per-nucleotide probabilities, per-position embeddings, and a single sequence-level embedding.
`sequence_pos_to_prob_pos(sequences, pos)`	Takes a list of DNA sequences and a position index, returns the corresponding output position indices accounting for tokenization differences.
`infer_masked_sequence_to_token_probs(sequences, variant_pos, variant_letters, reference_letters, ...)`	Takes sequences and masks the variant position, returns probabilities for the variant and reference nucleotides at the masked position.

Any not-implemented methods can simply return None and metrics depending on them will be skipped.
See gfmbench_api/tasks/base/base_gfm_model.py for detailed docstrings.

Example: Running Benchmarks

The `usage_examples/run_benchmark.py` script

This reference script demonstrates a standard workflow:

Loading a model (built-in or your own)
Running a configurable set of tasks
Supervised fine-tuning or zero-shot evaluation
Auto-generating benchmark CSVs

Common CLI Arguments

Argument	Type	Default	Notes
`--model`	str	DNABERT2	Model string key (see `MODEL_REGISTRY`)
`--checkpoint_path`	str	None	Path to model checkpoint
`--mlm_head_path`	str	None	Optional: path to MLM head checkpoint
`--report_algo_name`	str	temp	Column name for this model in benchmark CSV
`--csv_path`	str	...	Where benchmark report is saved
`--linear_prob`	flag	False	If set: train only projection layer
`--epochs`	int	3	Num epochs for fine-tuning
`--disable_safe_model_call`	flag	False	Bypass try/except if set

Key Variables to Adjust in `run_benchmark.py`

Root data directory
⚠️ Update around line 164:
```
root_data_dir_path = "/path/to/your/data/"
```

Sanity check mode (100-sample subsets)

sanity_check_mode = True  # Set to False for full eval

Task configuration
Edit parameters like max_sequence_length, batch_size.
Training parameters
Change num_epochs, learning_rate, etc. in the training_params dict.
Task list
Edit/add the tasks in the tasks = [...] list.
Model registry
Add your models to the MODEL_REGISTRY dict.

Example usage

Default (no checkpoint):

python usage_examples/run_benchmark.py \
    --model DNABERT2 \
    --report_algo_name dna_bert2_baseline \
    --csv_path results/baseline_results.csv

With a custom model checkpoint:

python usage_examples/run_benchmark.py \
    --model DNABERT2 \
    --checkpoint_path /path/to/model.pt \
    --report_algo_name my_custom_model \
    --csv_path results/my_results.csv \
    --epochs 5

Linear probe (freeze backbone, train only projection):

python usage_examples/run_benchmark.py \
    --model DNABERT2 \
    --linear_prob \
    --report_algo_name linear_probe \
    --csv_path results/linear_probe_results.csv

Data Preparation

Root Data Directory

All task data is stored under a single root directory (customize the path!):

root_data_dir_path = "/path/to/your/data/"

Task Data Structure

Each task creates a subdirectory, e.g.:

<root_data_dir_path>/
├── gue_promoter_all/
│   ├── train.csv
│   ├── dev.csv
│   └── test.csv
├── clinvar_vepeval/
│   └── ...
└── ...

Task subdirectory names correspond to class names.
If files are missing, they will be automatically downloaded on first use.

Dataset Format

Task Type	Key columns	Example
Classification	`sequence`, `label`	`ATCCGA...`, 1
Variant effect	`ref_sequence`, `alt_sequence`, label	`A...T...`, 0

Sequence Symbol Conventions

GFMBench-API uses the following standard symbols for DNA sequences in datasets:

Symbol	Meaning	Usage
`A`, `T`, `C`, `G`	Standard DNA nucleotides	Used to represent the four DNA bases (adenine, thymine, cytosine, guanine)
`N`	Unknown/ambiguous nucleotide	Used when the nucleotide at a position is unknown or ambiguous
`P`	Padding character	Used to pad sequences when extracting windows near chromosome boundaries

Notes:

All sequences are case-insensitive (automatically converted to uppercase)
The P padding character should be added when handling boundary cases
Unknown nucleotides (N) may appear in reference genome data or when sequence information is incomplete
When creating custom tasks, ensure your sequences use only these accepted symbols

Benchmark Output

Results are auto-saved in a single CSV with this schema:

task	metric	model1	model2	...
gue_promoter_all	accuracy	0.85	0.82	...
gue_promoter_all	auroc	0.91	0.88	...
...	...	...	...	...

Each (task, metric) is a row.
Each model name (as passed to --report_algo_name) is a column.
If a result is missing, its cell is NO_RESULTS.
Incremental saving ensures safe restarts/interruption recovery.

Logs are written to logs/ for all major steps (run progress, model errors, fine-tune loss curves, auto-downloading messages, etc).

Customizing

Example: Minimal Custom Pipeline

from gfmbench_api.tasks.concrete.gue_promoter_all_task import GuePromoterAllTask
from gfmbench_api.benchmark_report import BenchmarkReport

task = GuePromoterAllTask(
    root_data_dir_path="/your/data/path",
    task_config={"max_sequence_length": 2500, "batch_size": 16},
)

my_model = MyCustomModel(device="cuda")  # Implements the required inference methods

# Optionally, supervised fine-tuning:
if task.get_task_attributes()["has_finetuning_data"]:
    dataset = task.get_finetune_dataset()
    # ... Your custom training code ...
    # After fine-tuning, wrap model with projection layer for classification (my_model = wrapped_model)

# Evaluate
my_model.eval()
results = task.eval_test_set(my_model)
print(results)

# Save results
report = BenchmarkReport(csv_path="results.csv")
report.add_scores("gue_promoter_all", "my_model", results)
report.save_csv()

Adding Tasks

To add a new concrete task, inherit from the appropriate base class based on your task type and implement the required methods.

Note: If your task has unique functionality that doesn't fit the standard task types, you can inherit directly from BaseGFMTask and implement all required methods from scratch.

Task Type Hierarchy

Choose the appropriate base class based on your task:

Supervised Single-Sequence Classification: Inherit from BaseGFMSupervisedSingleSeqTask
- For tasks with single DNA sequences and categorical labels (e.g., promoter prediction, splice site detection)
Supervised Variant Effect Prediction: Inherit from BaseGFMSupervisedVariantEffectTask
- For tasks with paired reference/variant sequences and categorical labels (e.g., pathogenic variant classification)
Zero-Shot SNV Variant Effect: Inherit from BaseGFMZeroShotSNVTask
- For zero-shot evaluation of single-nucleotide variants (SNVs) with equal-length reference and variant sequences
Zero-Shot General Indel: Inherit from BaseGFMZeroShotGeneralIndelTask
- For zero-shot evaluation of insertions/deletions with variable-length sequences

Required Methods to Implement

All concrete tasks must implement:

get_task_name() -> str: Return the task name (must match the data directory name)
_get_default_max_seq_len() -> int: Return the default maximum sequence length
_create_datasets() or _create_test_dataset(): Create and return datasets
- Supervised tasks: Return Tuple[Optional[Dataset], Optional[Dataset], Dataset] (train, validation, test)
- Zero-shot tasks: Return Dataset (test only)
- Important: When creating datasets, you must account for:
  - self.max_sequence_length: Truncate sequences if they exceed this length (use truncate_sequence_from_ends() from gfmbench_api.utils.preprocutils)
  - self.max_num_samples: Limit the number of samples per split if specified (for sanity testing with smaller datasets)
get_conditional_input_meta_data_frame() -> Optional[pd.DataFrame]: Return metadata DataFrame if task uses conditional inputs, otherwise return None

Additional methods for supervised tasks:

_get_num_labels() -> int: Return the number of classification labels

Example: Supervised Single-Sequence Task

from gfmbench_api.tasks.base.base_gfm_supervised_single_seq_task import BaseGFMSupervisedSingleSeqTask
import os
import pandas as pd
import torch
import numpy as np

class MyCustomTask(BaseGFMSupervisedSingleSeqTask):
    def get_task_name(self) -> str:
        return "my_custom_task"
    
    def _get_default_max_seq_len(self) -> int:
        return 512
    
    def _get_num_labels(self) -> int:
        return 2  # Binary classification
    
    def _create_datasets(self):
        data_dir = os.path.join(self.root_data_dir_path, self.get_task_name())
        train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
        val_df = pd.read_csv(os.path.join(data_dir, "dev.csv")) if os.path.exists(os.path.join(data_dir, "dev.csv")) else None
        test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
        
        # Data process: Account for self.max_num_samples and self.max_sequence_length
        train_dataset = [(seq, label, np.array([])) for seq, label in zip(train_df['sequence'], train_df['label'])]
        validation_dataset = [(seq, label, np.array([])) for seq, label in zip(val_df['sequence'], val_df['label'])] if val_df is not None else None
        test_dataset = [(seq, label, np.array([])) for seq, label in zip(test_df['sequence'], test_df['label'])]
        
        return train_dataset, validation_dataset, test_dataset
    
    def get_conditional_input_meta_data_frame(self):
        return None

Example: Zero-Shot SNV Task

from gfmbench_api.tasks.base.base_gfm_zeroshot_snv_task import BaseGFMZeroShotSNVTask
import os
import pandas as pd
import numpy as np

class MyZeroShotTask(BaseGFMZeroShotSNVTask):
    def get_task_name(self) -> str:
        return "my_zero_shot_task"
    
    def _get_default_max_seq_len(self) -> int:
        return 1024
    
    def _create_test_dataset(self):
        data_dir = os.path.join(self.root_data_dir_path, self.get_task_name())
        test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
        
        # Data process: Account for self.max_num_samples and self.max_sequence_length
        test_dataset = [
            (var_seq, ref_seq, label, np.array([]))
            for var_seq, ref_seq, label in zip(test_df['variant_sequence'], test_df['reference_sequence'], test_df['label'])
        ]
        
        return test_dataset
    
    def _get_variant_position_in_sequence(self) -> int:
        return self.max_sequence_length // 2
    
    def get_conditional_input_meta_data_frame(self):
        return None

NOTICE

This project will download and install additional third-party open source software projects, including datasets licensed with non-commercial terms. Review the license terms of these open source projects before use.

Third-party components and attributions: see THIRD_PARTY_NOTICES.md (Python dependencies, Hugging Face assets, reference genomes, and other data URLs). Example: LicenseRef-UCSC-Genome-Browser — https://genome.ucsc.edu/license/

License

NVIDIA-authored code in this repository is licensed under the Apache License, Version 2.0. The full text is in LICENSE.

Each contributed .py file includes NVIDIA SPDX-FileCopyrightText, SPDX-License-Identifier: Apache-2.0, and the short Apache-2.0 notice (through “limitations under the License”), then third-party URL notices scoped to that file (or the line stating that the module does not embed third-party data download URLs). Python package attributions remain in THIRD_PARTY_NOTICES.md. The full Apache-2.0 text is in LICENSE.

Citation

@article{larey2026gfmbench, title={GFMBench-API: A Standardized Interface for Benchmarking Genomic Foundation Models}, author={Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini, Gideon Rechavi, Nicole Bussola, Simon Lee, Shane O’Connell, Dung Hoang, Marissa Wirth, Alexander W. Charney, Yoli Shavit, Nati Daniel}, journal={bioRxiv}, pages={2026--02}, year={2026}, publisher={Cold Spring Harbor Laboratory} }

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
gfmbench_api		gfmbench_api
scripts		scripts
usage_examples		usage_examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GFMBench-API

Quick Start

Installation

What’s Included?

Repository Organization

Supported Benchmarks & Tasks

Supervised Classification & Variant Prediction

Zero-Shot Variant Effect Prediction

How Model & Task Interfaces Work

Task API

Model Integration

Example: Running Benchmarks

The usage_examples/run_benchmark.py script

Common CLI Arguments

Key Variables to Adjust in run_benchmark.py

Example usage

Data Preparation

Root Data Directory

Task Data Structure

Dataset Format

Sequence Symbol Conventions

Benchmark Output

Customizing

Example: Minimal Custom Pipeline

Adding Tasks

Task Type Hierarchy

Required Methods to Implement

Example: Supervised Single-Sequence Task

Example: Zero-Shot SNV Task

NOTICE

License

Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `usage_examples/run_benchmark.py` script

Key Variables to Adjust in `run_benchmark.py`

Packages