IDLM: Inverse-distilled Diffusion Language Models

Few-step text generation from distilled diffusion.

What is IDLM?

Diffusion Language Models can generate high-quality text, but their iterative reverse-diffusion sampling makes inference slow. IDLM speeds them up by distilling a pretrained many-step diffusion language model into a few-step generator.

Instead of simply matching every teacher step, IDLM uses an Inverse Distillation view for discrete token spaces.The paper reports 4×–64× fewer inference steps while preserving the teacher model’s entropy and generative perplexity.

Highlights

Fast diffusion-language generation with 4, 8, 16, and 32 step sampling recipes.
Inverse distillation for discrete tokens, including student/fake models training logic.
Hydra-powered experiments for easy configuration and reproducibility.
PyTorch Lightning training loop with checkpointing, logging, and distributed training support.
Ready-to-run scripts for MDLM, DUO, and DCD-style recipes.
Evaluation utilities for NLL, BPD, perplexity, generative perplexity, and sample entropy.

Repository layout

IDLM/
├── configs/                 # Hydra configs: data, model, algo, strategy, callbacks, etc.
│   ├── algo/                # ar, mdlm, duo, duo_base, d3pm, sedd
│   ├── data/                # OpenWebText configs
│   ├── model/               # tiny / small / medium model configs
│   ├── noise/               # diffusion noise schedules
│   └── config.yaml          # main experiment config
├── integral/                # precomputed tokenizer / integration assets
├── models/                  # DiT backbone, EMA utilities, attention tests
├── scripts/                 # training and generation recipes
├── algo.py                  # model families and IDLM distillation logic
├── dataloader.py            # tokenizers, datasets, dataloaders
├── main.py                  # Hydra + Lightning entry point
├── metrics.py               # perplexity, entropy, BPD, NLL metrics
├── trainer_base.py          # shared training / sampling base classes
├── utils.py                 # logging and helper utilities
├── requirements.txt         # environment note / dependency list
└── LICENSE

Getting Started

1. Clone the repository

git clone https://github.com/David-cripto/IDLM.git
cd IDLM

2. Create an environment

To get started, create a conda environment containing the required dependencies.

conda create -n idlm python=3.12
conda activate idlm
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

Checkpoints

IDLM-MDLM. Trained on OpenWebText:
- Huggingface🤗.
IDLM-Duo. Trained on OpenWebText:
- Huggingface🤗.
IDLM-DCD. Trained on OpenWebText:
- Huggingface🤗.

Train IDLM

The repository includes three training recipes. Before executing the scripts, configure the cache_dir parameter in configs.data.openwebtext-split.yaml to specify the desired output path.

MDLM teacher → IDLM-MDLM student

bash scripts/train_idlm_mdlm.sh

DUO teacher → IDLM-Duo student

bash scripts/train_idlm_duo.sh

DCD teacher → IDLM-DCD student

bash scripts/train_idlm_dcd.sh

These scripts use Hydra overrides for batch size, dataset, teacher checkpoint, algorithm, sampling steps, precision, logging name, and validation frequency. Use them as strong starting points, then tune the overrides for your compute budget.

Generate samples

The generation scripts sweep over 4, 8, 16, and 32 sampling steps.

Before running them, set eval.generated_samples_path to a real JSON output path.

IDLM-MDLM checkpoint

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=mdlm \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-mdlm \
  sampling.steps=16 \
  sampling.num_sample_batches=10 \
  sampling.predictor=ancestral_cache \
  sampling.noise_removal=ancestral \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_mdlm_16steps.json

IDLM-Duo checkpoint

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=duo \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-duo \
  sampling.steps=16 \
  sampling.num_sample_batches=10 \
  sampling.noise_removal=greedy \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_duo_16steps.json

IDLM-DCD checkpoint

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=duo \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-dcd \
  sampling.steps=4 \
  sampling.num_sample_batches=10 \
  sampling.noise_removal=greedy \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_duo_4steps.json

Run the provided scripts

bash scripts/generation_idlm_mdlm.sh
bash scripts/generation_idlm_duo.sh
bash scripts/generation_idlm_dcd.sh

Generated sample files contain:

{
  "generative_ppl": 0.0,
  "entropy": 0.0,
  "generated_seqs": []
}

Checkpoints referenced by the scripts

Training scripts reference teacher checkpoints such as:

kuleshov-group/mdlm-owt
s-sahoo/duo
s-sahoo/duo-distilled

Generation scripts reference IDLM checkpoints such as:

kekchpek/idlm-mdlm
kekchpek/idlm-duo
kekchpek/idlm-dcd

Make sure your environment can access the required checkpoints before running training or generation.

Outputs

By default, Hydra writes experiment outputs under:

outputs/<dataset>/<date>/<time>/

TensorBoard logs are written under:

tb_logs/

Checkpoints are written according to the checkpointing config in configs/config.yaml.

Citation

If you find this repository useful, please cite:

@article{li2026idlm,
  title={IDLM: Inverse-distilled Diffusion Language Models},
  author={Li, David and Gushchin, Nikita and Abulkhanov, Dmitry and Moulines, Eric and Oseledets, Ivan and Panov, Maxim and Korotin, Alexander},
  journal={arXiv preprint arXiv:2602.19066},
  year={2026}
}

Acknowledgements

Our codebase is inspired by recent Discrete Diffusion Models projects. Namely, MDLM and Duo.

License

This project is released under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IDLM: Inverse-distilled Diffusion Language Models

What is IDLM?

Highlights

Repository layout

Getting Started

1. Clone the repository

2. Create an environment

Checkpoints

Train IDLM

MDLM teacher → IDLM-MDLM student

DUO teacher → IDLM-Duo student

DCD teacher → IDLM-DCD student

Generate samples

IDLM-MDLM checkpoint

IDLM-Duo checkpoint

IDLM-DCD checkpoint

Run the provided scripts

Checkpoints referenced by the scripts

Outputs

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
integral		integral
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
algo.py		algo.py
dataloader.py		dataloader.py
main.py		main.py
metrics.py		metrics.py
requirements.txt		requirements.txt
trainer_base.py		trainer_base.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

IDLM: Inverse-distilled Diffusion Language Models

What is IDLM?

Highlights

Repository layout

Getting Started

1. Clone the repository

2. Create an environment

Checkpoints

Train IDLM

MDLM teacher → IDLM-MDLM student

DUO teacher → IDLM-Duo student

DCD teacher → IDLM-DCD student

Generate samples

IDLM-MDLM checkpoint

IDLM-Duo checkpoint

IDLM-DCD checkpoint

Run the provided scripts

Checkpoints referenced by the scripts

Outputs

Citation

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages