Skip to content

PickyBinders/tea

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Embedded Alphabet (TEA)

Model Architecture

This repository contains the code accompanying our pre-print: Rewriting protein alphabets with language models. A web server with TEA converted datasets is available here.

Installation

python -m pip install git+https://github.com/PickyBinders/tea.git
  • Tested on Python 3.11, 3.12 and 3.13
  • Typical installation time: 2min

Sequence Conversion with TEA

The tea_convert command takes protein sequences from a FASTA file and generates new tea-FASTA. It supports confidence-based sequence output where low-confidence positions are displayed in lowercase, and has options for saving logits and entropy. If --save_avg_entropy is set, the FASTA identifiers will contain the average entropy of the sequence in the format <key>|H=<avg_entropy>.

usage: tea_convert [-h] -f FASTA_FILE -o OUTPUT_FILE [-l] [-H] [-r] [-c] [-t ENTROPY_THRESHOLD]

options:
  -h, --help            show this help message and exit
  -f FASTA_FILE, --fasta_file FASTA_FILE
                        Input FASTA file containing protein amino acid sequences
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output FASTA file for generated tea sequences
  -l, --save_logits     Save per-residue logits to .pt file
  -H, --save_avg_entropy
                        Save average entropy values in FASTA identifiers
  -r, --save_residue_entropy
                        Save per-residue entropy values to .pt file
  -c, --lowercase_entropy
                        Save residues with entropy > threshold in lowercase
  -t ENTROPY_THRESHOLD, --entropy_threshold ENTROPY_THRESHOLD
                        Entropy threshold for lowercase conversion

Using the huggingface model

from tea.model import Tea
from transformers import AutoTokenizer, AutoModel
from transformers import BitsAndBytesConfig
import torch
import re

tea = Tea.from_pretrained("PickyBinders/tea")
tea.eval()
device = next(tea.parameters()).device
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
esm2 = AutoModel.from_pretrained(
        "facebook/esm2_t33_650M_UR50D",
        torch_dtype="auto",
        quantization_config=bnb_config,
        add_pooling_layer=False,
    ).to(device)
esm2.eval()
sequence_examples = ["PRTEINO", "SEQWENCE"]
sequence_examples = [" ".join(list(re.sub(r"[UZOBJ]", "X", sequence))) for sequence in sequence_examples]
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
with torch.no_grad():
    x = esm2(
        input_ids=input_ids, attention_mask=attention_mask
    ).last_hidden_state.to(device)
    results = tea.to_sequences(embeddings=x, input_ids=input_ids, return_avg_entropy=True, return_logits=False, return_residue_entropy=False)
results

Search with TEA against Many

In order to perform fast sequence searches and generate alignments, we recommend checking out STEAM. This tool is designed to leverage both TEA representations and standard amino acid information, allowing you to execute comprehensive dual-character sequence screening against large datasets. You can find the repository and usage instructions at github.com/PickyBinders/steam.

About

"Rewriting protein alphabets with language models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages