Supporting code for A Compression View of LLMs. See article.md for the full argument.
The code consists of eight small scripts wired in a pipe: an LLM predicts the next token, the surprises are ANS-encoded, and you can run it backward to decompress or to make up text by supplying correction bytes on stdin instead of a .bin file. This is not meant for real compression work.
Needs Python 3.9+, PyTorch, transformers, NumPy, Matplotlib (requirements.txt). The first run downloads weights from Hugging Face.
Same layout as the article. Default model is gpt2; use -n for another Hugging Face causal LM.
The run.sh scripts put scripts on PYTHONPATH. From the repo root, python3 -m scripts.tokensOfStr etc. also works.
Opening of A Study in Scarlet in examples/compressor/.
cd examples/compressor
./run.sh AStudyInScarlet.txtWrites AStudyInScarlet.bin, files under AStudyInScarlet/, and AStudyInScarlet.decomp.txt to check lossless round-trip. modelToInx and modelOfInx are slow: one forward pass per token. The size table in the article uses AStudyInScarlet2.txt (same text, second copy in the tree).
Gzip baseline with the period case-flip trick: examples/gzipFilp.
"Generating Some Text" in the article. Prompt on -c, corrections from stdin. All zeros means always take the model's first choice in the top-k list:
cd examples/generate
./run.shRun the compressor example first, then:
cd figures/symbolHist && ./run.sh
cd figures/heatmap && ./run.shBoth read examples/compressor/AStudyInScarlet/AStudyInScarlet.ttout.json (figures 4 and 5 in the article).
Compression: tokensOfStr.py, modelToInx.py, rangeEncode.py, binOfRange.py
Decompression: binToRange.py, rangeDecode.py, modelOfInx.py, tokensToStr.py
tokensOfData.py is only for the generation demo. Stages speak JSON on purpose so you can poke at or plot the intermediates.
The article quotes gpt2 on the Scarlet excerpt: .bin about 25% of raw .txt, gzip -9 about 39%. The .bin does not include the model weights. Phi-2 should do better; re-run ./run.sh after changing model.sh and compare your *.bin size.