A small, self-contained LLM inference engine for Apple Silicon —
built from scratch, in the open, to learn and teach how inference engineering works.
A failed star (a brown dwarf) is smaller than a dwarf star: not enough mass
to sustain fusion. This project is the smaller sibling of
Dwarf Star (ds4), antirez's
self-contained inference engine for DeepSeek-V4. Where ds4 targets big MoE
models on 96GB+ Macs, Failed Star runs a tiny model on a 64GB MacBook Pro
(M5) — and trades raw capability for something else: every line is meant to be
read, understood, and learned from.
The goal is understanding inference, by building it. Reading about attention is one thing; writing the kernel that computes it and watching tokens stream out of your own code is another. This repo is the second thing.
Three sources form its spine, cross-referenced throughout the docs. (The prerequisites point to a wider set of optional brush-up and go-deeper resources — those fill gaps; these three are what the docs lean on.)
- The concepts — Inference Engineering by Philip Kiely (Baseten, 2026).
The "why" and the vocabulary. (Peruse the free
interactive guide, or get your own copy
from Baseten Books;
Inference Engineering.pdfis in this repo.) - A real implementation —
ds4, cloned intoreference/ds4/. The "how a pro does it." Working code doesn't lie. - Architecture context — Sebastian Raschka's free articles: the architecture comparison, gallery, and workflow for understanding LLMs. (His book is a good optional extra, not a dependency — see the prerequisites.)
- Host language: Rust. Model loading, tokenizer, orchestration, sampling, KV cache — all Rust.
- GPU kernels: MSL (Metal Shading Language) — hand-written, one operation per
file, just like
ds4'smetal/shaders. - Metal via raw FFI / the Objective-C runtime — no convenience wrapper crate.
We send messages to Metal ourselves so nothing is hidden. Tight, like
ds4. - First model: Qwen3-0.6B — a tiny dense model with GQA, RoPE, SwiGLU, and RMSNorm; small enough to inspect and debug while still looking like a real modern LLM.
- Correctness via golden vectors: match logits from the model's official implementation. (Python appears only as a one-shot oracle, never as a second engine.)
fs/
├── README.md ← you are here
├── PLAN.md ← the milestone curriculum (M0 … M7+)
├── PROGRESS.md ← running session log; start here each session
├── Inference Engineering.pdf ← local copy of the book (ignored; bring your own)
├── src/ ← Rust engine + thin CLI
├── scripts/ ← uv-managed Python oracle/data scripts
├── tests/golden/ ← committed golden fixtures for verification
├── tools/ ← site/sync helper scripts
├── docs/ ← the learning site + notes (served at /fs via Pages)
│ ├── index.html ← learning-site landing page (rich HTML)
│ ├── prerequisites.md ← what to know before diving in (read this first)
│ ├── 00-map.md ← THE BIG PICTURE of an inference engine
│ ├── 01-tokenizer.md ← M0 writeup (.md + rich .html version)
│ ├── dev-loop.md ← how to resume work after a break
│ ├── testing.md ← verification strategy and golden-vector plan
│ ├── diagrams.html ← shared diagram gallery
│ ├── RESOURCES.md ← cross-reference index (book §§, ds4 files, Raschka)
│ ├── learnings/ ← bite-sized notes on what we figured out & why
│ └── assets/ ← logo + site assets
├── reference/ds4/ ← antirez's ds4 — pinned git submodule (read-only ref)
└── models/ ← downloaded model assets (ignored; generated locally)
- Read
docs/prerequisites.md— the honest "what to know before you dive in" (spoiler: inference is the forward pass only — no training, no calculus), with brush-up resources and a knowledge-map. - Read
docs/00-map.md— the end-to-end picture of an inference engine, with an "abstraction ladder" so you can stop digging at whatever depth interests you. - Skim
PLAN.md— the milestones. - Each session, open
PROGRESS.mdto see what's next. - If resuming development, use
docs/dev-loop.mdanddocs/testing.mdfor the local checks and verification strategy.
🌱 M0 — Tokenizer: ✅ done. M1 — Load the weights: in progress. The
byte-level BPE tokenizer is implemented and verified — fs tokenize /
fs detokenize run end-to-end against Qwen3-0.6B, loading vocab + merges + regex
- special tokens from the single
tokenizer.json(14/14 golden cases pass; seedocs/01-tokenizer.md). Next step: parse the safetensors weights andconfig.jsonsofs inspect model/can print the architecture and tensor table.
Milestones (the full curriculum, with cross-links, lives in PLAN.md):
- M0 — Tokenizer — text ↔ token IDs, verified against the real vocab
- M1 — Load the weights ← current
- M2 — Forward pass → logits
- M3 — Sampling → generation
- M4 — KV cache
- M5 — Quantization
- M6 — Metal acceleration
- M7+ — Stretch goals
This is a slow, multi-session learning project. It is not (yet) fast, capable, or finished — that's the point. Local models keep getting better; the bet is that a clean, well-documented small engine becomes more useful to more people over time.