Companion repository for:
- Shen, Mingzhe, Dayhoff II, Guy W., et al. "Protein Electrostatic Properties are Fine-Tuned Through Evolution." bioRxiv (2025): 2025-09. (https://www.biorxiv.org/content/10.1101/2025.04.17.649309v1)
If you use the KaML models, command-line interface, or precomputed human-wide pKa predictions in your work, please cite:
Shen, M., Dayhoff II, G.W., et al., 2025. Protein Electrostatic
Properties are Fine-Tuned Through Evolution. bioRxiv (2025)
Shen, Mingzhe, et al. "KaMLs for Predicting Protein pKa Values and
Ionization States: Are Trees All You Need?." JCTC 21.3 (2025): 1446-1458.
BibTeX:
@article{shen2025kamlesm,
title={Protein Electrostatic Properties are Fine-Tuned Through Evolution},
author={Shen, Mingzhe and Dayhoff II, Guy W and Shen, Jana},
journal={bioRxiv},
pages={2025--04},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
@article{shen2025kamls,
title={KaMLs for Predicting Protein p K a Values and Ionization States: Are Trees All You Need?},
author={Shen, Mingzhe and Kortzak, Daniel and Ambrozak, Simon and Bhatnagar, Shubham and Buchanan, Ian and Liu, Ruibin and Shen, Jana},
journal={Journal of Chemical Theory and Computation},
volume={21},
number={3},
pages={1446--1458},
year={2025},
publisher={ACS Publications}
}
License Content © 2025 Mingzhe Shen, Guy W. Dayhoff II and Jana Shen, licensed under CC BY-NC 4.0.
Unless otherwise noted, this repository’s content is © 2025 Guy W. Dayhoff II (on behalf of all authors) and is licensed under Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0).
Plain English: You may copy, adapt, and share the repository’s content for non-commercial purposes as long as you provide proper attribution. For any commercial use, please contact the authors for permission.
We provide a browser-based KaML interface for both on-demand predictions and interactive exploration of a precomputed human proteome atlas:
Home: https://kaml.computchem.org
Atlas: https://kaml.computchem.org/human
The web tools run the same KaML models described in the manuscript and
use ESM-2 (esm2_t33_650M_UR50D) and/or ESM-C (esmc-6b-2024-12) via
the EvolutionaryScale Forge API for protein embeddings where on-demand
inference is required.
Web-based inference: https://kaml.computchem.org
The web-base interface provides a simple front end to the KaML inference pipeline for user-supplied sequences.
Requirements for KaML-ESM2:
- A modern web browser.
Requirements for KaML-ESMC:
- An active ESM Forge API token issued by EvolutionaryScale.
The KaML web interface does not issue ESM Forge tokens. Each user must obtain and manage their own token directly from EvolutionaryScale.
Obtaining an ESM Forge API token:
-
Visit the EvolutionaryScale Forge site:
https://forge.evolutionaryscale.ai -
Sign up or sign in with your account.
-
In the leftmost menu, under the API header select 'API Keys'
-
In the textbox with the 'API Key Name' placehold text, name your key, e.g. esm
-
Create a new Forge API token and copy the token string.
Treat this token as a secret (similar to a password); do not publish or commit it.
Using the token in the KaML web interface:
-
Open:
https://kaml.computchem.org -
Select ESMC for either channel, i.e. Acidic or Basic.
-
Enter your protein sequence, UniProtID, or PDBID-CHAINID into the input box.
-
Paste your ESM Forge API token into the token field.
-
Submit to run predictions.
The Forge token is used only to obtain ESM-C embeddings for your sequences via the Forge API; KaML then applies the pre-trained KaML-ESMC heads to produce residue-level predictions.
Please follow EvolutionaryScale’s terms of use and your institution’s policies when requesting, storing, and using Forge API tokens.
The /human view provides interactive access to a precomputed
KaML “human atlas” of predictions across the human proteome.
Key points:
- Predictions in this atlas were precomputed using the KaML-ESM2 models and ESM-2 embeddings described in the manuscript.
- Exploration of the atlas (searching, browsing, viewing scores) does not require an ESM Forge token, because no new embeddings or model inference are run client-side.
- The atlas is intended as a convenient starting point for exploring residue-level predictions without installing local software.
Typical usage:
-
Open:
https://kaml.computchem.org/human -
Search or browse for a protein of interest by UniProtKB accession or gene name.
-
Inspect the per-residue KaML-ESM2 predictions.
The precomputed human proteome-wide atlas integrates three key types of information, which correspond to sections in the web interface:
Command-line interface for running KaML-ESM2, KaML-ESMC, and KaML-CBT2 residue-level predictions on protein sequences using either ESM-2 or ESM-C embeddings and pre-trained weights.
-
kamlCLI.py Main command-line interface.
-
env/wizard.sh Simple installation wizard that creates a Python virtual environment, installs dependencies, and downloads pretrained weights from Zenodo.
-
env/wts/ Default location for model weight directories. The CLI expects task-specific subdirectories here (i.e. esm2, esmC, CBtree2).
- NVIDIA GPU with >= 24GB VRAM (e.g. RTX 4090)
- 40 GB disk space
- 128 GB system memory
- Python 3.10 or newer
- POSIX-like environment (Linux / macOS)
- Packages: $(project_root)/env/KaML-ESM_env.txt
You can install these manually:
pip install -r $(path_to)/KaML-ESM_env.txt
note: if you do a manual installation, you must add $(project_root)/bin to your PATH variable.
alternatively (recommended): use the provided wizard.
Clone the repository:
git clone https://github.com/JanaShenLab/KaML-ESM.git
cd KaML-ESM
Run the installation wizard:
bash env/wizard.sh
The wizard:
- creates a virtual environment named "KaML"
- activates it
- upgrades pip
- installs core Python dependencies
- is the place to add commands to download weights from Zenodo into env/wts/
To re-activate the environment later:
source activate
By default, kamlCLI.py looks for model weights under:
env/wts/
with task-specific subdirectories (for example):
- env/wts/esm2/acidic
- env/wts/esm2/basic
- env/wts/esmC/acidic
- env/wts/esmC/basic
- env/wts/CBtree2
You can override the root directory for weights with:
export KAML_WTS_DIR=$(path_to)/wts
and the CLI will use that directory instead of env/wts/.
NOTE: This section ONLY pertains to KaML-ESMC models.
The CLI requires an ESM Forge token to compute ESM-C embeddings.
- To obtain an ESM Forge token see Web-based inference.
Token handling:
-
First run:
- Provide a token via --forge-token, either as the raw string or a path to a file that contains the token.
- The token is cached to a user-level file (by default: ~/.esm_forge_token).
-
Subsequent runs:
- If --forge-token is omitted, the cached token is used.
- If no cached token exists, the CLI will prompt for one interactively (input is hidden) and then cache it.
You can override the cache location with:
export KAML_FORGE_TOKEN_FILE=$(path_to)/token_file
source env/KaML/bin/activatepython kamlCLI.py --helpIf you do not pass --nofold, KaML may need to fold a structure. Folding uses ESM Forge, so you must have a Forge token available (via --forge-token or the cached token mechanism) unless you provide a structure:
- Provide a structure directly:
--pdb ...or--pdbid ... - Or provide a precomputed structure directory:
--structs ...containing<unique_id>.pdb
Practical rule: If you are running on
--seq,--uniprot, or--fastaand you don’t pass--nofold, you should assume a Forge token is required unless you are sure a matching structure will be found via--structs.
--seq <AASEQ>: amino-acid sequence string--uniprot <UNIPROT_ID>: fetch sequence from UniProt--pdb <PATH>: use a local PDB file--pdbid <PDB_ID>: fetch a PDB from RCSB--fasta <PATH>: process a multi-FASTA file (one output subdir per record)
--outdir <DIR>: output directory (default:output/)--structs <DIR>: directory of precomputed structures; KaML looks for<unique_id>.pdbhere before folding--nofold: skip folding (sequence-only inference when no structure is provided)--nproc <INT>: parallel workers for multi-FASTA (default: 1)--nthreads <INT>: threads per sequence for ensemble inference (default: 1)--acidic {esm2,esmC}: acidic channel model preference (default:esm2)--basic {esm2,esmC}: basic channel model (default:esm2)--nocbtree: skip CBTree2 predictions--debug: enable debug logging--skip_safety: skip Forge safety filter (requires permission)--forge-token <TOKEN_OR_FILE>: Forge token string or a path to a file containing it (used/cached when Forge is needed)
python kamlCLI.py \\
--seq "ACDEFGHIKLMNPQRSTVWY" \\
--nofold \\
--outdir outputpython kamlCLI.py \\
--seq "ACDEFGHIKLMNPQRSTVWY" \\
--forge-token /path/to/forge_token.txt \\
--outdir outputpython kamlCLI.py \\
--pdb /path/to/structure.pdb \\
--outdir outputpython kamlCLI.py \\
--pdbid 1I0E \\
--outdir outputpython kamlCLI.py \\
--fasta proteins.fasta \\
--nofold \\
--nproc 4 \\
--outdir outputpython kamlCLI.py \\
--fasta proteins.fasta \\
--nproc 4 \\
--forge-token /path/to/forge_token.txt \\
--outdir outputpython kamlCLI.py \\
--fasta proteins.fasta \\
--structs /path/to/structs_dir \\
--nproc 4 \\
--outdir outputpython kamlCLI.py \\
--seq "ACDEFGHIKLMNPQRSTVWY" \\
--nofold \\
--basic esmC \\
--acidic esmC \\
--forge-token /path/to/forge_token.txt \\
--outdir outputpython kamlCLI.py \\
--seq "ACDEFGHIKLMNPQRSTVWY" \\
--nofold \\
--nocbtree \\
--outdir outputKaML writes outputs to an output directory (default: ./output/). It does not print a tab-separated prediction table to standard output.
You control the output location with --outdir:
python kamlCLI.py \\
--seq "MQLKPMEINPEMLNKVLSRLGVAGQWRFVDVLGLEEESLGSVPAPACALLLLFPLTAQHENFRKKQIEELKGQEVSPKVYFMKQTIGNSCGTIGLIHAVANNQDKLGFEDGSVLKQFLSETEKMSPEDRAKCFEKNEAIQAAHDAVAQEGQCRVDDKVNFHFILFNNVDGHLYELDGRMPFPVNHGASSEDTLLKDAAKVCREFTEREQGEVRFSAVALCKAA" \\
--nofold \\
--outdir outputFor a single sequence run, KaML writes:
<outdir>/predictions.csv<outdir>/predicted_structure.pdb(only if a structure was provided or generated)
For multi-FASTA runs (--fasta), KaML creates one subdirectory per sequence under <outdir>/, using <unique_id> from the FASTA header (first whitespace-delimited token), and writes the same files inside each subdirectory.
- If you do not pass
--nofoldand you do not provide a structure (--pdb,--pdbid, or--structs), KaML will attempt to fold via ESM Forge (Forge token required). - At the end of the run, KaML prints a completion message indicating the output directory, e.g.:
KaML run complete. Outputs written to ./output/
The first line of the file is a header:
Residue_ID Pred_pKa Pred_Shift Error Conf_pKa
...
Each subsequent line corresponds to a residue position. Note: Conf_pKa only present when CBTree2 has been invoked.
Pre-trained Ensemble Weights:
- KaML-ESM2 ensemble weights v1.0.0 (Zenodo) — DOI:
https://doi.org/10.5281/zenodo.17943825 - KaML-ESMC ensemble weights v1.0.0 (Zenodo) — DOI:
https://doi.org/10.5281/zenodo.17943447 - KaML-CBT2 weights v1.0.0 (Zenodo) - DOI:
https://doi.org/10.5281/zenodo.17943947