STEAM is heavily adapted from Foldseek (Van Kempen et al., Nature Biotechnology 2024), replacing Foldseek's 3Di structural alphabet with TEA. This means STEAM can be applied to any protein sequence, no 3D structure required. Like Foldseek, STEAM is built on the MMseqs2 framework.
- CMake >= 3.15
- GCC >= 7 or Clang
- For TEA sequence generation: TEA (
pip install git+https://github.com/PickyBinders/tea.git)
# Install build dependencies (if needed)
mamba install -c conda-forge cmake gxx_linux-64
# Build
git clone --recursive https://github.com/PickyBinders/steam.git
cd steam
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j4The binary will be at build/src/steam.
Convert amino acid sequences to the TEA structural alphabet using the tea_convert tool:
tea_convert -f proteins.fasta -o proteins_tea.fastaThis requires a GPU and the TEA package. The output is a FASTA file with TEA sequences in the same order as the input.
steam easy-search query_tea.fasta query_aa.fasta \
target_tea.fasta target_aa.fasta \
result.m8 tmpUseful flags:
| Flag | Default | Notes |
|---|---|---|
-e |
100 | E-value threshold |
--max-seqs |
2000 | Maximum results per query from prefiltering |
easy-cluster runs cascaded clustering (sensitive) and easy-linclust runs linear-time clustering (faster, less sensitive). Both take paired TEA/AA FASTAs:
# Cascaded clustering
steam easy-cluster proteins_tea.fasta proteins_aa.fasta clusterResult tmp
# Linear-time clustering (large datasets)
steam easy-linclust proteins_tea.fasta proteins_aa.fasta clusterResult tmpOutputs three files alongside clusterResult:
clusterResult_cluster.tsv—<representative> <member>adjacency listclusterResult_rep_seq.fasta— one AA sequence per cluster representativeclusterResult_all_seqs.fasta— FASTA grouped by cluster
Useful flags:
| Flag | Default | Notes |
|---|---|---|
--min-seq-id |
0 | Minimum sequence identity for cluster members |
-c |
0.8 | Minimum coverage |
--cov-mode |
0 | 0=bidirectional, 1=target, 2=query |
--cluster-reassign |
off | Cascaded only: corrects criteria-violations from cascaded merging |
--single-step-cluster |
off | Cascaded only: skip cascading, single pass |
| Command | Description |
|---|---|
easy-search |
Search FASTA pairs against FASTA pairs or a pre-built database |
easy-cluster |
Cluster paired TEA/AA FASTAs (cascaded, sensitive) |
easy-linclust |
Cluster paired TEA/AA FASTAs (linear-time, faster) |
createdb |
Create a STEAM database from paired TEA/AA FASTA files |
search |
Search pre-built databases (faster for repeated searches) |
cluster |
Cluster a pre-built database (cascaded) |
linclust |
Cluster a pre-built database (linear-time) |
convertalis |
Convert alignment results to various output formats |
createsubdb |
Subset a STEAM database (keeps _aa companion in sync) |
For searching or clustering the same database multiple times, pre-build it:
# Create database (one time)
steam createdb target_tea.fasta target_aa.fasta targetDB
# Search against pre-built database (fast, repeatable)
steam easy-search query_tea.fasta query_aa.fasta targetDB result.m8 tmp
# Cluster the pre-built database
steam cluster targetDB clusterDB tmpDefault BLAST-tab format (same as MMseqs2/BLAST -outfmt 6):
query target fident alnlen mismatch gapopen qstart qend tstart tend evalue bits
Custom output with --format-output adds TEA-specific output columns:
| Column | Description |
|---|---|
tfident |
TEA fractional identity |
tpident |
TEA percent identity |
qteaseq |
Query TEA full sequence |
tteaseq |
Target TEA full sequence |
qteaaln |
Query TEA aligned sequence |
tteaaln |
Target TEA aligned sequence |
Standard MMseqs2 output columns (fident, alnlen, qcov, tcov, evalue, raw, bits, etc.) are also available.
The alignment score at each position is the sum of:
- MATCHA score: substitution score from the TEA alphabet matrix
- AA score: BLOSUM62 substitution score, weighted by
--aa-weight(default 1.4)
STEAM uses a log-linear E-value model following Edgar & Sahakyan (2025). E-values are computed as:
E(s) = (H/Q) * 10^(m*s + c)
where s is the raw alignment score, H/Q is the average number of reported hits per query (computed at runtime from prefilter results), and m and c are parameters fitted on SCOP40c.