A simple testbed project to cover various optimization tricks used in modern transformer architectures. Was abandoned for a while, but currently is under active development. Its origin can be traced back to Let's build GPT:... video by Andrej Karpathy, and the default startup parameters are near to those featured at some point in the video.
The project is focused primarily on tiny models and training/inference on CPU.
- model weights saving and loading (May 2026)
- enum dispatch based model construction managed from CLI (May 13, 2026)
- RMSnorm (May 13, 2026)
- BPE tokenizer
- parallel transformer block (May 13, 2026)
- rotary position embedding
- polar position embedding
- Mixture-of-Experts
- random feature attention
- Taylor series based softmax approximation
- sliced ReLU attention
- QK-norm
- MQA
- GQA
- MLA
- KV-caching
- speculative decoding
- Muon optimizer (???)