This repository contains the code accompanying our pre-print: Rewriting protein alphabets with language models. A web server with TEA converted datasets is available here.
python -m pip install git+https://github.com/PickyBinders/tea.git- Tested on Python 3.11, 3.12 and 3.13
- Typical installation time: 2min
The tea_convert command takes protein sequences from a FASTA file and generates new tea-FASTA. It supports confidence-based sequence output where low-confidence positions are displayed in lowercase, and has options for saving logits and entropy. If --save_avg_entropy is set, the FASTA identifiers will contain the average entropy of the sequence in the format <key>|H=<avg_entropy>.
usage: tea_convert [-h] -f FASTA_FILE -o OUTPUT_FILE [-l] [-H] [-r] [-c] [-t ENTROPY_THRESHOLD]
options:
-h, --help show this help message and exit
-f FASTA_FILE, --fasta_file FASTA_FILE
Input FASTA file containing protein amino acid sequences
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output FASTA file for generated tea sequences
-l, --save_logits Save per-residue logits to .pt file
-H, --save_avg_entropy
Save average entropy values in FASTA identifiers
-r, --save_residue_entropy
Save per-residue entropy values to .pt file
-c, --lowercase_entropy
Save residues with entropy > threshold in lowercase
-t ENTROPY_THRESHOLD, --entropy_threshold ENTROPY_THRESHOLD
Entropy threshold for lowercase conversionfrom tea.model import Tea
from transformers import AutoTokenizer, AutoModel
from transformers import BitsAndBytesConfig
import torch
import re
tea = Tea.from_pretrained("PickyBinders/tea")
tea.eval()
device = next(tea.parameters()).device
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
esm2 = AutoModel.from_pretrained(
"facebook/esm2_t33_650M_UR50D",
torch_dtype="auto",
quantization_config=bnb_config,
add_pooling_layer=False,
).to(device)
esm2.eval()
sequence_examples = ["PRTEINO", "SEQWENCE"]
sequence_examples = [" ".join(list(re.sub(r"[UZOBJ]", "X", sequence))) for sequence in sequence_examples]
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
with torch.no_grad():
x = esm2(
input_ids=input_ids, attention_mask=attention_mask
).last_hidden_state.to(device)
results = tea.to_sequences(embeddings=x, input_ids=input_ids, return_avg_entropy=True, return_logits=False, return_residue_entropy=False)
resultsIn order to perform fast sequence searches and generate alignments, we recommend checking out STEAM. This tool is designed to leverage both TEA representations and standard amino acid information, allowing you to execute comprehensive dual-character sequence screening against large datasets. You can find the repository and usage instructions at github.com/PickyBinders/steam.
