SLM: ShortList Model

🏆 Update (2025.9): Our paper ShortListing Model: A Streamlined Simplex Diffusion for Discrete Variable Generation has been accepted at NeurIPS 2025! Implementation of ShortList Model and its applications.

Environment

conda env create -f slm.yaml

Since there're many dependences, we provide a quick-solving pathway building from pytorch:2.1.0-cu11.8.0-py3.9

# build slm base
pip install datasets einops fsspec git-lfs h5py hydra-core lightning nvitop omegaconf packaging pandas rich seaborn scikit-learn timm transformers triton
# check versions
# flash_attn:2.7.2.post1+cu11torch2.1cxx11abiFALSE-cp39   causal_conv1d:1.5.0.post8+cu11torch2.1cxx11abiFALSE-cp39   mamba_ssm:2.2.4+cu11torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
pip install pika torchdiffeq
# build slm protein
pip install evodiff biopython
# build slm dna
pip install pyBigWig pytabix selene_sdk pyranges cooltools

Preparation

Text8 Dataset

Download preprocessed text8 dataset or set the data_dir in file ./configs/data/text8.yaml: cache_dir. If so, automatic download will be done.

Uniref50 Dataset

Download Uniref50 data as Evodiff Repo with such files:

---uniref50
  | consensus.fasta 
  | lengths_and_offsets.npz
  | splits.json
  | uniref50.tar.gz

Then set uniref50 dir to file ./configs/data/uniref50.yaml: cache_dir

DNA Enhancer Dataset

Download dataset and organize files like that:

---dna_data
  | DeepMEL2_data.pkl
  | DeepFlyBrain_data.pkl

after unzip, two .pkl file are under General/data

Then set dna_data dir to file ./configs/data/Mel.yaml: cache_dir and ./configs/data/FB.yaml: cache_dir

DNA Promoter Dataset

Download dataset and organize files like that:

---promoter_design
  | .agg.minus.bw.bedgraph.bw
  | .agg.plus.bw.bedgraph.bw
  ...
  ...
  | Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
  | Homo_sapiens.GRCh38.dna.primary_assembly.fa.mmap
  | target.sei.names

tips: you need to gunzip .gz files in raw dir.

GPT2 Model

GPT2 are only used for evaluation, download gpt2-large and report dir to ./configs/config.yaml:eval.gen_ppl_eval_model_name_or_path

Training

Text8 experiements

Run make train_text8 to start training, sentences will be sampled eval 500 steps.

Protein experiments

Run make train_uniref50 to start training, sequences will be sampled eval 500 steps.

DNA experiments

For DNA-Enhancer Design, please Run make train_fb and make train_mel to train SLM with different dataset.

For DNA-Promoter Design, run make train_promoter instead.

Load from checkpoints and Sampling

We've released our SLM model on GenSI's huggingface, download ckpt files and replace the CKPT_PATH in makefile targets.

Run make sample_uniref50 to sample protein sequences.

Run make sample_fb and make sample_mel to sample dna enhancer sequences.

Reference

@article{song2025shortlisting,
  title={ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation},
  author={Song, Yuxuan and Zhang, Zhe and Pei, Yu and Gong, Jingjing and Yu, Qiying and Zhang, Zheng and Wang, Mingxuan and Zhou, Hao and Liu, Jingjing and Ma, Wei-Ying},
  journal={arXiv preprint arXiv:2508.17345},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
evodiff		evodiff
flow_matching		flow_matching
models		models
promoter_utils		promoter_utils
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE copy		LICENSE copy
README.md		README.md
dataloader.py		dataloader.py
diffusion.py		diffusion.py
main.py		main.py
makefile		makefile
noise_schedule.py		noise_schedule.py
requirement.txt		requirement.txt
slm.py		slm.py
slm.yaml		slm.yaml
slm_enhancer.py		slm_enhancer.py
slm_probability.py		slm_probability.py
slm_promoter.py		slm_promoter.py
slm_utils.py		slm_utils.py
synthetic_nns.py		synthetic_nns.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLM: ShortList Model

Environment

Preparation

Text8 Dataset

Uniref50 Dataset

DNA Enhancer Dataset

DNA Promoter Dataset

GPT2 Model

Training

Text8 experiements

Protein experiments

DNA experiments

Load from checkpoints and Sampling

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SLM: ShortList Model

Environment

Preparation

Text8 Dataset

Uniref50 Dataset

DNA Enhancer Dataset

DNA Promoter Dataset

GPT2 Model

Training

Text8 experiements

Protein experiments

DNA experiments

Load from checkpoints and Sampling

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages