Skip to content

Aaryesh-AD/esm-embed

Repository files navigation

esm-embed

ESM Python License: MIT uv GitHub Issues Contributions Welcome


Fast, multi-layer protein language model embeddings extractor for ESM-2 and ESM-C.

Extract mean-pooled residue embeddings from any ESM model in a single forward pass, with first-class support for multi-layer extraction, bfloat16 weights, Flash Attention 2, and SLURM array jobs.

Note

This library grew out of several personal research projects and working scripts, consolidated, migrating, compiling and refactored into one place (Hence the quick repo publication time). It is not meant to be a general-purpose ESM SDK but the scope is intentionally narrow: efficient, multi-layer embedding extraction for large-scale representation experiments. If something looks familiar, it probably is. PRs are welcome :)


Table of Contents


Models

Model Family Layers Embedding dim Source
esm2_8M ESM-2 6 320 HuggingFace
esm2_35M ESM-2 12 480 HuggingFace
esm2_150M ESM-2 30 640 HuggingFace
esm2_650M ESM-2 33 1280 HuggingFace
esmc_300m ESM-C 30 960 evolutionaryscale/esmBioHub
esmc_600m ESM-C 36 1152 evolutionaryscale/esmBioHub

Note

The initial code was based on the evolutionaryscale/esm SDK. Since the transfer of ESM-C weights to BioHub there may be some divergence. If you encounter any discrepancies please open an issue or submit a pull request.


Installation

Requires Python ≥ 3.13 and uv.

git clone https://github.com/Aaryesh-AD/esm-embed
cd esm-embed
uv sync
source .venv/bin/activate

Verify the install:

# Run using uv
uv run esm-embed

# or directly

esm-embed --help
esm-embed models

No GPU? Everything works on CPU — expect ~50–200× slower embedding. ESM-C weights are downloaded from BioHub on first use and cached under ~/.cache/esm/. No API token is required.


Quick Start

CLI

# Embed a FASTA (default model: esm2_650M, default layer)
esm-embed embed proteins.fasta

# Specific model + layer
esm-embed embed proteins.fasta --model esm2_150M --layer 24 --output out.npy

# Multi-layer ablation (saves one .npy per layer)
esm-embed embed proteins.fasta \
    --model esm2_650M \
    --layers 9,18,27,30,33 \
    --output-dir ./embeddings/

# ESM-C
esm-embed embed proteins.fasta --model esmc_300m --batch-size 8

# bfloat16 + Flash Attention 2 (ESM-2 only)
esm-embed embed proteins.fasta --model esm2_650M --half --sdpa

# List models / inspect ablation layers
esm-embed models
esm-embed info esm2_650M

Python API

from esm_embed import embed, embed_multilayer

seqs = ["ACDEFGHIKLMNPQRSTVWY", "MKTIIALSYIFCLVFA"]

# Single-layer (last recommended layer per model)
embs = embed(seqs, model="esm2_650M")
print(embs.shape)   # (2, 1280)

# Multi-layer: all ablation layers in one forward pass
layer_embs = embed_multilayer(seqs, model="esm2_650M")
# {9: (2, 1280), 18: (2, 1280), 27: (2, 1280), 30: (2, 1280), 33: (2, 1280)}

# Custom layers
layer_embs = embed_multilayer(seqs, model="esm2_650M", layers=[18, 33])

Speed Options (ESM-2)

from esm_embed import ESM2Embedder

embedder = ESM2Embedder(
    "esm2_650M",
    half=True,      # bfloat16 weights: halves VRAM, ~21% faster
    use_sdpa=True,  # Flash Attention 2: 2–4× faster on sequences > 256 aa
)
embs = embedder.embed(seqs, batch_size=32)

Scripts

Script Purpose
scripts/embed_sequences.py Embed one FASTA or CSV (any model, any layers)
scripts/embed_batch.py Multi-protein × multi-model batch runner (local GPU, resume-aware)
scripts/verify_embeddings.py Check output shapes, NaN/Inf, completeness
slurm/embed_array.sbatch SLURM array job (one task per protein)
slurm/submit_all_models.sh Submit one array per model to SLURM

Batch Runner (local GPU)

For large-scale runs with many proteins and multiple models:

# proteins.csv must have 'id' and 'filename' columns
uv run python scripts/embed_batch.py \
    --proteins proteins.csv \
    --dms-dir  data/sequences/ \
    --output-dir embeddings/ \
    --mode ablation          # or 'primary' for one layer per model

# Single model, resume from idx 60
uv run python scripts/embed_batch.py \
    --proteins proteins.csv \
    --model esm2_650M \
    --start-idx 60

# Dry-run: see plan without embedding
uv run python scripts/embed_batch.py --proteins proteins.csv --dry-run

The batch runner:

  • Loads each model once and streams all proteins through it (vs SLURM, which re-loads per task)
  • Auto-detects already-done .npy files and skips them (full resume)
  • Prefetches tokenisation in a background thread while the GPU runs the forward pass
  • Halves batch size and retries on OOM

SLURM (HPC Clusters)

Note

The SLURM script is a template based on Georgia Tech's PACE cluster. You may need to modify resource requests, array indexing, and module loading for your HPC environment.

# Ablation run for all 6 models
MODE=ablation bash slurm/submit_all_models.sh

# Monitor
squeue -u $USER

Edit slurm/embed_array.sbatch to set #SBATCH --array=0-N to match your protein count.


Output Format

All embeddings are saved as .npy files:

{protein_id}_{model}_layer{k}.npy   →   shape (N, D)   dtype float32

where N is the number of sequences and D is the model embedding dimension.

Example directory:

embeddings/
  GFP_AVIC_esm2_650M_layer9.npy    (3809, 1280)
  GFP_AVIC_esm2_650M_layer18.npy   (3809, 1280)
  GFP_AVIC_esm2_650M_layer33.npy   (3809, 1280)

Load with NumPy:

import numpy as np
embs = np.load("embeddings/GFP_AVIC_esm2_650M_layer33.npy")
print(embs.shape)   # (3809, 1280)

Default Ablation Layer Sets

Layer indices extracted when --layers is not specified. Corresponds to approximately 25%, 50%, 75%, and 100% of model depth, following Valeriani et al. (NeurIPS 2023).

Model Ablation layers
esm2_8M 2, 4, 5, 6
esm2_35M 4, 8, 10, 12
esm2_150M 8, 16, 24, 30
esm2_650M 9, 18, 27, 30, 33
esmc_300m 7, 14, 21, 29
esmc_600m 8, 17, 26, 35

Performance Notes

ESM-2

Optimisation Speedup How to enable
bfloat16 weights ~21% faster, half VRAM half=True or --half
SDPA / Flash Attention 2 2–4× on seqs > 256 aa use_sdpa=True or --sdpa
Tokenisation prefetching hides CPU bottleneck on small models automatic in embed_batch.py
torch.compile() ~20–30% extra (after warm-up) compile=True in batch runner

Recommended for most CUDA setups: half=True + use_sdpa=True.
torch.compile() is disabled by default because SDPA already provides the dominant speedup, and enabling both causes a 6+ min warm-up.

ESM-C

ESM-C embeddings are extracted by calling ESMC.forward() directly with a (B, L) token tensor, bypassing the EvolutionaryScale SDK's encode() + logits() wrapper which forces batch size 1. This gives ~2.4× speedup at batch size 8 on sequences of ~500 residues.

Recommended batch sizes for 8 GB VRAM: esmc_300m: 8  |  esmc_600m: 8


Pooling Convention

All embeddings are mean-pooled over residue positions, excluding the BOS (position 0) and EOS tokens, following Valeriani et al. NeurIPS 2023. Pooling is fully vectorised on the GPU (no Python loop per sequence).


Tests

uv run pytest tests/ -v

Tests use esm2_8M (32 MB) on CPU to keep CI fast. The suite checks:

  • Output shapes and dtype
  • No NaN / Inf values
  • Multi-layer key consistency
  • Single-layer vs multi-layer numerical agreement

Note

These tests are not representative of a full-scale testing suite. We recommend adding test cases tailored to your specific use case.


Citation

If you use this code in your research, please cite the relevant ESM papers.

ESM-2
@ARTICLE{Lin2023-tw,
  title     = "Evolutionary-scale prediction of atomic-level protein structure
               with a language model",
  author    = "Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and
               Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil,
               Robert and Kabeli, Ori and Shmueli, Yaniv and Dos Santos Costa,
               Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido,
               Salvatore and Rives, Alexander",
  journal   = "Science",
  publisher = "American Association for the Advancement of Science (AAAS)",
  volume    =  379,
  number    =  6637,
  pages     = "1123--1130",
  month     =  mar,
  year      =  2023,
  language  = "en"
}
ESM-C
@misc{candido2026language,
  title  = {Language Modeling Materializes a World Model of Protein Biology},
  author = {Candido, Salvatore and Hayes, Thomas and Derry, Alexander and Rao, Roshan
            and Lin, Zeming and Verkuil, Robert and Wu, Bryan and Lee, Jin Sub
            and Bruguera, Elise S. and Keval, Jehan A. and Kopylov, Mykhailo
            and Pak, John E. and Wu, Wesley and Thomas, Neil and Mataraso, Samson
            and Hsu, Alvin and Trotman-Grant, Ashton C. and Fatras, Kilian
            and dos Santos Costa, Allan and Badkundri, Rohil and Ak{\i}n, Halil
            and Oktay, Deniz and Deaton, Jonathan and Montabana, Elizabeth
            and Sitwala, Hrishita and Yu, Yue and Wiggert, Marius
            and Carlin, Dylan Alexander and Goering, Anthony W. and Blazejewski, Tomasz
            and Sandora, McCullen and Hla, Michael and Jia, Tina Z.
            and Kloker, Leon H. and Sofroniew, Nicholas J. and Uehara, Masatoshi
            and Pannu, Jassi and Bachas, Sharrol and Liu, Daniel S.
            and Sercu, Tom and Rives, Alexander},
  year   = {2026},
  url    = {https://biohub.ai/papers/esm_protein.pdf},
  note   = {Preprint}
}
ESM-C weights
@software{evolutionaryscale_2024,
  author = {{EvolutionaryScale Team}},
  title = {evolutionaryscale/esm},
  year = {2024},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.14219303},
  URL = {https://doi.org/10.5281/zenodo.14219303}
}
Pooling convention
@INPROCEEDINGS{Valeriani2023-tr,
  title         = "The geometry of hidden representations of large transformer
                   models",
  author        = "Valeriani, Lucrezia and Doimo, Diego and Cuturello,
                   Francesca and Laio, Alessandro and Ansuini, Alessio and
                   Cazzaniga, Alberto",
  month         =  feb,
  year          =  2023,
  copyright     = "http://creativecommons.org/licenses/by/4.0/",
  archivePrefix = "arXiv",
  primaryClass  = "cs.LG",
  eprint        = "2302.00294"
}

License

Distributed under the MIT License.

About

Extract sequence embeddings from ESM protein language models with minimal setup.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors