GitHub - Aaryesh-AD/esm-embed: Extract sequence embeddings from ESM protein language models with minimal setup.

Fast, multi-layer protein language model embeddings extractor for ESM-2 and ESM-C.

Extract mean-pooled residue embeddings from any ESM model in a single forward pass, with first-class support for multi-layer extraction, bfloat16 weights, Flash Attention 2, and SLURM array jobs.

Note

This library grew out of several personal research projects and working scripts, consolidated, migrating, compiling and refactored into one place (Hence the quick repo publication time). It is not meant to be a general-purpose ESM SDK but the scope is intentionally narrow: efficient, multi-layer embedding extraction for large-scale representation experiments. If something looks familiar, it probably is. PRs are welcome :)

Models

Model	Family	Layers	Embedding dim	Source
`esm2_8M`	ESM-2	6	320	HuggingFace
`esm2_35M`	ESM-2	12	480	HuggingFace
`esm2_150M`	ESM-2	30	640	HuggingFace
`esm2_650M`	ESM-2	33	1280	HuggingFace
`esmc_300m`	ESM-C	30	960	evolutionaryscale/esm → BioHub
`esmc_600m`	ESM-C	36	1152	evolutionaryscale/esm → BioHub

Note

The initial code was based on the evolutionaryscale/esm SDK. Since the transfer of ESM-C weights to BioHub there may be some divergence. If you encounter any discrepancies please open an issue or submit a pull request.

Installation

Requires Python ≥ 3.13 and uv.

git clone https://github.com/Aaryesh-AD/esm-embed
cd esm-embed
uv sync
source .venv/bin/activate

Verify the install:

# Run using uv
uv run esm-embed

# or directly

esm-embed --help
esm-embed models

No GPU? Everything works on CPU — expect ~50–200× slower embedding. ESM-C weights are downloaded from BioHub on first use and cached under ~/.cache/esm/. No API token is required.

Quick Start

CLI

# Embed a FASTA (default model: esm2_650M, default layer)
esm-embed embed proteins.fasta

# Specific model + layer
esm-embed embed proteins.fasta --model esm2_150M --layer 24 --output out.npy

# Multi-layer ablation (saves one .npy per layer)
esm-embed embed proteins.fasta \
    --model esm2_650M \
    --layers 9,18,27,30,33 \
    --output-dir ./embeddings/

# ESM-C
esm-embed embed proteins.fasta --model esmc_300m --batch-size 8

# bfloat16 + Flash Attention 2 (ESM-2 only)
esm-embed embed proteins.fasta --model esm2_650M --half --sdpa

# List models / inspect ablation layers
esm-embed models
esm-embed info esm2_650M

Python API

from esm_embed import embed, embed_multilayer

seqs = ["ACDEFGHIKLMNPQRSTVWY", "MKTIIALSYIFCLVFA"]

# Single-layer (last recommended layer per model)
embs = embed(seqs, model="esm2_650M")
print(embs.shape)   # (2, 1280)

# Multi-layer: all ablation layers in one forward pass
layer_embs = embed_multilayer(seqs, model="esm2_650M")
# {9: (2, 1280), 18: (2, 1280), 27: (2, 1280), 30: (2, 1280), 33: (2, 1280)}

# Custom layers
layer_embs = embed_multilayer(seqs, model="esm2_650M", layers=[18, 33])

Speed Options (ESM-2)

from esm_embed import ESM2Embedder

embedder = ESM2Embedder(
    "esm2_650M",
    half=True,      # bfloat16 weights: halves VRAM, ~21% faster
    use_sdpa=True,  # Flash Attention 2: 2–4× faster on sequences > 256 aa
)
embs = embedder.embed(seqs, batch_size=32)

Scripts

Script	Purpose
`scripts/embed_sequences.py`	Embed one FASTA or CSV (any model, any layers)
`scripts/embed_batch.py`	Multi-protein × multi-model batch runner (local GPU, resume-aware)
`scripts/verify_embeddings.py`	Check output shapes, NaN/Inf, completeness
`slurm/embed_array.sbatch`	SLURM array job (one task per protein)
`slurm/submit_all_models.sh`	Submit one array per model to SLURM

Batch Runner (local GPU)

For large-scale runs with many proteins and multiple models:

# proteins.csv must have 'id' and 'filename' columns
uv run python scripts/embed_batch.py \
    --proteins proteins.csv \
    --dms-dir  data/sequences/ \
    --output-dir embeddings/ \
    --mode ablation          # or 'primary' for one layer per model

# Single model, resume from idx 60
uv run python scripts/embed_batch.py \
    --proteins proteins.csv \
    --model esm2_650M \
    --start-idx 60

# Dry-run: see plan without embedding
uv run python scripts/embed_batch.py --proteins proteins.csv --dry-run

The batch runner:

Loads each model once and streams all proteins through it (vs SLURM, which re-loads per task)
Auto-detects already-done .npy files and skips them (full resume)
Prefetches tokenisation in a background thread while the GPU runs the forward pass
Halves batch size and retries on OOM

SLURM (HPC Clusters)

Note

The SLURM script is a template based on Georgia Tech's PACE cluster. You may need to modify resource requests, array indexing, and module loading for your HPC environment.

# Ablation run for all 6 models
MODE=ablation bash slurm/submit_all_models.sh

# Monitor
squeue -u $USER

Edit slurm/embed_array.sbatch to set #SBATCH --array=0-N to match your protein count.

Output Format

All embeddings are saved as .npy files:

{protein_id}_{model}_layer{k}.npy   →   shape (N, D)   dtype float32

where N is the number of sequences and D is the model embedding dimension.

Example directory:

embeddings/
  GFP_AVIC_esm2_650M_layer9.npy    (3809, 1280)
  GFP_AVIC_esm2_650M_layer18.npy   (3809, 1280)
  GFP_AVIC_esm2_650M_layer33.npy   (3809, 1280)

Load with NumPy:

import numpy as np
embs = np.load("embeddings/GFP_AVIC_esm2_650M_layer33.npy")
print(embs.shape)   # (3809, 1280)

Default Ablation Layer Sets

Layer indices extracted when --layers is not specified. Corresponds to approximately 25%, 50%, 75%, and 100% of model depth, following Valeriani et al. (NeurIPS 2023).

Model	Ablation layers
`esm2_8M`	2, 4, 5, 6
`esm2_35M`	4, 8, 10, 12
`esm2_150M`	8, 16, 24, 30
`esm2_650M`	9, 18, 27, 30, 33
`esmc_300m`	7, 14, 21, 29
`esmc_600m`	8, 17, 26, 35

Performance Notes

ESM-2

Optimisation	Speedup	How to enable
bfloat16 weights	~21% faster, half VRAM	`half=True` or `--half`
SDPA / Flash Attention 2	2–4× on seqs > 256 aa	`use_sdpa=True` or `--sdpa`
Tokenisation prefetching	hides CPU bottleneck on small models	automatic in `embed_batch.py`
`torch.compile()`	~20–30% extra (after warm-up)	`compile=True` in batch runner

Recommended for most CUDA setups: half=True + use_sdpa=True.
torch.compile() is disabled by default because SDPA already provides the dominant speedup, and enabling both causes a 6+ min warm-up.

ESM-C

ESM-C embeddings are extracted by calling ESMC.forward() directly with a (B, L) token tensor, bypassing the EvolutionaryScale SDK's encode() + logits() wrapper which forces batch size 1. This gives ~2.4× speedup at batch size 8 on sequences of ~500 residues.

Recommended batch sizes for 8 GB VRAM: esmc_300m: 8 | esmc_600m: 8

Pooling Convention

All embeddings are mean-pooled over residue positions, excluding the BOS (position 0) and EOS tokens, following Valeriani et al. NeurIPS 2023. Pooling is fully vectorised on the GPU (no Python loop per sequence).

Tests

uv run pytest tests/ -v

Tests use esm2_8M (32 MB) on CPU to keep CI fast. The suite checks:

Output shapes and dtype
No NaN / Inf values
Multi-layer key consistency
Single-layer vs multi-layer numerical agreement

Note

These tests are not representative of a full-scale testing suite. We recommend adding test cases tailored to your specific use case.

Citation

If you use this code in your research, please cite the relevant ESM papers.

ESM-2

@ARTICLE{Lin2023-tw,
  title     = "Evolutionary-scale prediction of atomic-level protein structure
               with a language model",
  author    = "Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and
               Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil,
               Robert and Kabeli, Ori and Shmueli, Yaniv and Dos Santos Costa,
               Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido,
               Salvatore and Rives, Alexander",
  journal   = "Science",
  publisher = "American Association for the Advancement of Science (AAAS)",
  volume    =  379,
  number    =  6637,
  pages     = "1123--1130",
  month     =  mar,
  year      =  2023,
  language  = "en"
}

ESM-C

@misc{candido2026language,
  title  = {Language Modeling Materializes a World Model of Protein Biology},
  author = {Candido, Salvatore and Hayes, Thomas and Derry, Alexander and Rao, Roshan
            and Lin, Zeming and Verkuil, Robert and Wu, Bryan and Lee, Jin Sub
            and Bruguera, Elise S. and Keval, Jehan A. and Kopylov, Mykhailo
            and Pak, John E. and Wu, Wesley and Thomas, Neil and Mataraso, Samson
            and Hsu, Alvin and Trotman-Grant, Ashton C. and Fatras, Kilian
            and dos Santos Costa, Allan and Badkundri, Rohil and Ak{\i}n, Halil
            and Oktay, Deniz and Deaton, Jonathan and Montabana, Elizabeth
            and Sitwala, Hrishita and Yu, Yue and Wiggert, Marius
            and Carlin, Dylan Alexander and Goering, Anthony W. and Blazejewski, Tomasz
            and Sandora, McCullen and Hla, Michael and Jia, Tina Z.
            and Kloker, Leon H. and Sofroniew, Nicholas J. and Uehara, Masatoshi
            and Pannu, Jassi and Bachas, Sharrol and Liu, Daniel S.
            and Sercu, Tom and Rives, Alexander},
  year   = {2026},
  url    = {https://biohub.ai/papers/esm_protein.pdf},
  note   = {Preprint}
}

ESM-C weights

@software{evolutionaryscale_2024,
  author = {{EvolutionaryScale Team}},
  title = {evolutionaryscale/esm},
  year = {2024},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.14219303},
  URL = {https://doi.org/10.5281/zenodo.14219303}
}

Pooling convention

@INPROCEEDINGS{Valeriani2023-tr,
  title         = "The geometry of hidden representations of large transformer
                   models",
  author        = "Valeriani, Lucrezia and Doimo, Diego and Cuturello,
                   Francesca and Laio, Alessandro and Ansuini, Alessio and
                   Cazzaniga, Alberto",
  month         =  feb,
  year          =  2023,
  copyright     = "http://creativecommons.org/licenses/by/4.0/",
  archivePrefix = "arXiv",
  primaryClass  = "cs.LG",
  eprint        = "2302.00294"
}

License

Distributed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
assets		assets
examples		examples
scripts		scripts
slurm		slurm
src/esm_embed		src/esm_embed
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
esmTokenConfig.py		esmTokenConfig.py
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Models

Installation

Quick Start

CLI

Python API

Speed Options (ESM-2)

Scripts

Batch Runner (local GPU)

SLURM (HPC Clusters)

Output Format

Default Ablation Layer Sets

Performance Notes

ESM-2

ESM-C

Pooling Convention

Tests

Citation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Models

Installation

Quick Start

CLI

Python API

Speed Options (ESM-2)

Scripts

Batch Runner (local GPU)

SLURM (HPC Clusters)

Output Format

Default Ablation Layer Sets

Performance Notes

ESM-2

ESM-C

Pooling Convention

Tests

Citation

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages