STEAM — Search with TEA against Many

STEAM is heavily adapted from Foldseek (Van Kempen et al., Nature Biotechnology 2024), replacing Foldseek's 3Di structural alphabet with TEA. This means STEAM can be applied to any protein sequence, no 3D structure required. Like Foldseek, STEAM is built on the MMseqs2 framework.

Requirements

CMake >= 3.15
GCC >= 7 or Clang
For TEA sequence generation: TEA (pip install git+https://github.com/PickyBinders/tea.git)

Installation

# Install build dependencies (if needed)
mamba install -c conda-forge cmake gxx_linux-64

# Build
git clone --recursive https://github.com/PickyBinders/steam.git
cd steam
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j4

The binary will be at build/src/steam.

Quick start

1. Generate TEA sequences

Convert amino acid sequences to the TEA structural alphabet using the tea_convert tool:

tea_convert -f proteins.fasta -o proteins_tea.fasta

This requires a GPU and the TEA package. The output is a FASTA file with TEA sequences in the same order as the input.

2. Search

steam easy-search query_tea.fasta query_aa.fasta \
                   target_tea.fasta target_aa.fasta \
                   result.m8 tmp

Useful flags:

Flag	Default	Notes
`-e`	100	E-value threshold
`--max-seqs`	2000	Maximum results per query from prefiltering

3. Cluster

easy-cluster runs cascaded clustering (sensitive) and easy-linclust runs linear-time clustering (faster, less sensitive). Both take paired TEA/AA FASTAs:

# Cascaded clustering
steam easy-cluster proteins_tea.fasta proteins_aa.fasta clusterResult tmp

# Linear-time clustering (large datasets)
steam easy-linclust proteins_tea.fasta proteins_aa.fasta clusterResult tmp

Outputs three files alongside clusterResult:

clusterResult_cluster.tsv — <representative> <member> adjacency list
clusterResult_rep_seq.fasta — one AA sequence per cluster representative
clusterResult_all_seqs.fasta — FASTA grouped by cluster

Useful flags:

Flag	Default	Notes
`--min-seq-id`	0	Minimum sequence identity for cluster members
`-c`	0.8	Minimum coverage
`--cov-mode`	0	0=bidirectional, 1=target, 2=query
`--cluster-reassign`	off	Cascaded only: corrects criteria-violations from cascaded merging
`--single-step-cluster`	off	Cascaded only: skip cascading, single pass

Commands

Command	Description
`easy-search`	Search FASTA pairs against FASTA pairs or a pre-built database
`easy-cluster`	Cluster paired TEA/AA FASTAs (cascaded, sensitive)
`easy-linclust`	Cluster paired TEA/AA FASTAs (linear-time, faster)
`createdb`	Create a STEAM database from paired TEA/AA FASTA files
`search`	Search pre-built databases (faster for repeated searches)
`cluster`	Cluster a pre-built database (cascaded)
`linclust`	Cluster a pre-built database (linear-time)
`convertalis`	Convert alignment results to various output formats
`createsubdb`	Subset a STEAM database (keeps `_aa` companion in sync)

Database workflow

For searching or clustering the same database multiple times, pre-build it:

# Create database (one time)
steam createdb target_tea.fasta target_aa.fasta targetDB

# Search against pre-built database (fast, repeatable)
steam easy-search query_tea.fasta query_aa.fasta targetDB result.m8 tmp

# Cluster the pre-built database
steam cluster targetDB clusterDB tmp

Output format

Default BLAST-tab format (same as MMseqs2/BLAST -outfmt 6):

query  target  fident  alnlen  mismatch  gapopen  qstart  qend  tstart  tend  evalue  bits

Custom output with --format-output adds TEA-specific output columns:

Column	Description
`tfident`	TEA fractional identity
`tpident`	TEA percent identity
`qteaseq`	Query TEA full sequence
`tteaseq`	Target TEA full sequence
`qteaaln`	Query TEA aligned sequence
`tteaaln`	Target TEA aligned sequence

Standard MMseqs2 output columns (fident, alnlen, qcov, tcov, evalue, raw, bits, etc.) are also available.

Scoring

The alignment score at each position is the sum of:

MATCHA score: substitution score from the TEA alphabet matrix
AA score: BLOSUM62 substitution score, weighted by --aa-weight (default 1.4)

E-value computation

STEAM uses a log-linear E-value model following Edgar & Sahakyan (2025). E-values are computed as:

E(s) = (H/Q) * 10^(m*s + c)

where s is the raw alignment score, H/Q is the average number of reported hits per query (computed at runtime from prefilter results), and m and c are parameters fitted on SCOP40c.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
lib		lib
patches		patches
src		src
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEAM — Search with TEA against Many

Requirements

Installation

Quick start

1. Generate TEA sequences

2. Search

3. Cluster

Commands

Database workflow

Output format

Scoring

E-value computation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STEAM — Search with TEA against Many

Requirements

Installation

Quick start

1. Generate TEA sequences

2. Search

3. Cluster

Commands

Database workflow

Output format

Scoring

E-value computation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages