Masked Diffusion LM Lab

This repo contains my current Infinite Jest diffusion language model experiment.

The active model starts from RoBERTa-large and gets posttrained on data/infinite_jest.txt with a masked denoising objective. It is diffusion-only: no causal attention mask and no next-token prediction objective.

The Infinite Jest corpus is doing the style work. There is no DFW prompt template, style-transfer layer, or postprocessing pass. Every clean target in training comes from the book, and the model learns to reconstruct those targets from corrupted versions of the same text. That is how the run pushes the denoiser toward the book's local distribution: long clauses, bureaucratic flatness, tennis/AA/institutional vocabulary, strange compression, and the kind of sentence drift that shows up before the model falls apart.

The result should be read narrowly. This is not a general language model and it is not trying to recover exact paragraphs from the book. It is a small RoBERTa-style denoiser posttrained into a text diffusion generator whose samples are conditioned by the Infinite Jest training distribution.

Current checkpoint:

outputs/roberta-large-infinite-jest-mdlm-subs-selfcond-step500-preserved

Current write-up:

docs/current-dlm-writeup.md

Current GIFs:

Setup

python3 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/python -m pip install -r requirements.txt

Corpus

Fetch or extract the authorized Infinite Jest text into the training path:

make fetch

This downloads or extracts the authorized source into:

data/infinite_jest.txt

Train

The current training target posttrains the previous Infinite Jest diffusion checkpoint and writes a new HF masked-diffusion checkpoint.

make train

Default training configuration:

objective: mdlm-subs
corruption: mixed
uniform corruption fraction: 0.75
mask distribution: high
full mask fraction: 0.35
loss weighting: mdlm
self-conditioning probability: 0.25
self-conditioning strength: 0.5

During posttraining, sampled continuation spans from data/infinite_jest.txt are corrupted with masks and random vocabulary tokens. The model sees the corrupted span and learns to predict the original book tokens at the corrupted positions. At sampling time, a prompt is held fixed and the continuation canvas is refined in parallel over repeated denoising steps.

Sample

make sample

Override the prompt if needed:

make sample PROMPT="Serious juniors never pick up tennis balls with their hands."

The sampler uses uniform canvas initialization, uniform re-noising, entropy-based retention, cosine unmasking, and self-conditioning.

GIF

make gif

The default output is:

assets/token-diffusion-mdlm-subs-selfcond-step500.gif

Ablations

make ablations

The ablation script compares:

mask-refine
uniform-refine
uniform-selfcond

The latest report is:

outputs/evals/hf-diffusion-ablations-mdlm-subs-step500.json

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
diffusion_lm		diffusion_lm
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Masked Diffusion LM Lab

Setup

Corpus

Train

Sample

GIF

Ablations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Masked Diffusion LM Lab

Setup

Corpus

Train

Sample

GIF

Ablations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages