This repo contains my current Infinite Jest diffusion language model experiment.
The active model starts from RoBERTa-large and gets posttrained on data/infinite_jest.txt with a masked denoising objective. It is diffusion-only: no causal attention mask and no next-token prediction objective.
The Infinite Jest corpus is doing the style work. There is no DFW prompt template, style-transfer layer, or postprocessing pass. Every clean target in training comes from the book, and the model learns to reconstruct those targets from corrupted versions of the same text. That is how the run pushes the denoiser toward the book's local distribution: long clauses, bureaucratic flatness, tennis/AA/institutional vocabulary, strange compression, and the kind of sentence drift that shows up before the model falls apart.
The result should be read narrowly. This is not a general language model and it is not trying to recover exact paragraphs from the book. It is a small RoBERTa-style denoiser posttrained into a text diffusion generator whose samples are conditioned by the Infinite Jest training distribution.
Current checkpoint:
outputs/roberta-large-infinite-jest-mdlm-subs-selfcond-step500-preserved
Current write-up:
Current GIFs:
python3 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/python -m pip install -r requirements.txtFetch or extract the authorized Infinite Jest text into the training path:
make fetchThis downloads or extracts the authorized source into:
data/infinite_jest.txt
The current training target posttrains the previous Infinite Jest diffusion checkpoint and writes a new HF masked-diffusion checkpoint.
make trainDefault training configuration:
objective: mdlm-subs
corruption: mixed
uniform corruption fraction: 0.75
mask distribution: high
full mask fraction: 0.35
loss weighting: mdlm
self-conditioning probability: 0.25
self-conditioning strength: 0.5
During posttraining, sampled continuation spans from data/infinite_jest.txt are corrupted with masks and random vocabulary tokens. The model sees the corrupted span and learns to predict the original book tokens at the corrupted positions. At sampling time, a prompt is held fixed and the continuation canvas is refined in parallel over repeated denoising steps.
make sampleOverride the prompt if needed:
make sample PROMPT="Serious juniors never pick up tennis balls with their hands."The sampler uses uniform canvas initialization, uniform re-noising, entropy-based retention, cosine unmasking, and self-conditioning.
make gifThe default output is:
assets/token-diffusion-mdlm-subs-selfcond-step500.gif
make ablationsThe ablation script compares:
mask-refine
uniform-refine
uniform-selfcond
The latest report is:
outputs/evals/hf-diffusion-ablations-mdlm-subs-step500.json

