Skip to content

David-cripto/IDLM

Repository files navigation

Few-step text generation from distilled diffusion.

arXiv Hugging Face Models MIT License

What is IDLM?

Diffusion Language Models can generate high-quality text, but their iterative reverse-diffusion sampling makes inference slow. IDLM speeds them up by distilling a pretrained many-step diffusion language model into a few-step generator.

Instead of simply matching every teacher step, IDLM uses an Inverse Distillation view for discrete token spaces.The paper reports 4×–64× fewer inference steps while preserving the teacher model’s entropy and generative perplexity.


Highlights

  • Fast diffusion-language generation with 4, 8, 16, and 32 step sampling recipes.
  • Inverse distillation for discrete tokens, including student/fake models training logic.
  • Hydra-powered experiments for easy configuration and reproducibility.
  • PyTorch Lightning training loop with checkpointing, logging, and distributed training support.
  • Ready-to-run scripts for MDLM, DUO, and DCD-style recipes.
  • Evaluation utilities for NLL, BPD, perplexity, generative perplexity, and sample entropy.

Repository layout

IDLM/
├── configs/                 # Hydra configs: data, model, algo, strategy, callbacks, etc.
│   ├── algo/                # ar, mdlm, duo, duo_base, d3pm, sedd
│   ├── data/                # OpenWebText configs
│   ├── model/               # tiny / small / medium model configs
│   ├── noise/               # diffusion noise schedules
│   └── config.yaml          # main experiment config
├── integral/                # precomputed tokenizer / integration assets
├── models/                  # DiT backbone, EMA utilities, attention tests
├── scripts/                 # training and generation recipes
├── algo.py                  # model families and IDLM distillation logic
├── dataloader.py            # tokenizers, datasets, dataloaders
├── main.py                  # Hydra + Lightning entry point
├── metrics.py               # perplexity, entropy, BPD, NLL metrics
├── trainer_base.py          # shared training / sampling base classes
├── utils.py                 # logging and helper utilities
├── requirements.txt         # environment note / dependency list
└── LICENSE

Getting Started

1. Clone the repository

git clone https://github.com/David-cripto/IDLM.git
cd IDLM

2. Create an environment

To get started, create a conda environment containing the required dependencies.

conda create -n idlm python=3.12
conda activate idlm
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

Checkpoints


Train IDLM

The repository includes three training recipes. Before executing the scripts, configure the cache_dir parameter in configs.data.openwebtext-split.yaml to specify the desired output path.

MDLM teacher → IDLM-MDLM student

bash scripts/train_idlm_mdlm.sh

DUO teacher → IDLM-Duo student

bash scripts/train_idlm_duo.sh

DCD teacher → IDLM-DCD student

bash scripts/train_idlm_dcd.sh

These scripts use Hydra overrides for batch size, dataset, teacher checkpoint, algorithm, sampling steps, precision, logging name, and validation frequency. Use them as strong starting points, then tune the overrides for your compute budget.


Generate samples

The generation scripts sweep over 4, 8, 16, and 32 sampling steps.

Before running them, set eval.generated_samples_path to a real JSON output path.

IDLM-MDLM checkpoint

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=mdlm \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-mdlm \
  sampling.steps=16 \
  sampling.num_sample_batches=10 \
  sampling.predictor=ancestral_cache \
  sampling.noise_removal=ancestral \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_mdlm_16steps.json

IDLM-Duo checkpoint

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=duo \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-duo \
  sampling.steps=16 \
  sampling.num_sample_batches=10 \
  sampling.noise_removal=greedy \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_duo_16steps.json

IDLM-DCD checkpoint

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=duo \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-dcd \
  sampling.steps=4 \
  sampling.num_sample_batches=10 \
  sampling.noise_removal=greedy \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_duo_4steps.json

Run the provided scripts

bash scripts/generation_idlm_mdlm.sh
bash scripts/generation_idlm_duo.sh
bash scripts/generation_idlm_dcd.sh

Generated sample files contain:

{
  "generative_ppl": 0.0,
  "entropy": 0.0,
  "generated_seqs": []
}

Checkpoints referenced by the scripts

Training scripts reference teacher checkpoints such as:

kuleshov-group/mdlm-owt
s-sahoo/duo
s-sahoo/duo-distilled

Generation scripts reference IDLM checkpoints such as:

kekchpek/idlm-mdlm
kekchpek/idlm-duo
kekchpek/idlm-dcd

Make sure your environment can access the required checkpoints before running training or generation.


Outputs

By default, Hydra writes experiment outputs under:

outputs/<dataset>/<date>/<time>/

TensorBoard logs are written under:

tb_logs/

Checkpoints are written according to the checkpointing config in configs/config.yaml.


Citation

If you find this repository useful, please cite:

@article{li2026idlm,
  title={IDLM: Inverse-distilled Diffusion Language Models},
  author={Li, David and Gushchin, Nikita and Abulkhanov, Dmitry and Moulines, Eric and Oseledets, Ivan and Panov, Maxim and Korotin, Alexander},
  journal={arXiv preprint arXiv:2602.19066},
  year={2026}
}

Acknowledgements

Our codebase is inspired by recent Discrete Diffusion Models projects. Namely, MDLM and Duo.


License

This project is released under the MIT License. See LICENSE for details.

About

(ICML 2026) IDLM: Inverse-distilled Diffusion Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors