Skip to content

curtisalexander/fs

Repository files navigation

Failed Star logo: a sad, hunched cartoon star in a faint brown-dwarf glow

Failed Star (fs)

A small, self-contained LLM inference engine for Apple Silicon —
built from scratch, in the open, to learn and teach how inference engineering works.

📖 Read the learning site →

A failed star (a brown dwarf) is smaller than a dwarf star: not enough mass to sustain fusion. This project is the smaller sibling of Dwarf Star (ds4), antirez's self-contained inference engine for DeepSeek-V4. Where ds4 targets big MoE models on 96GB+ Macs, Failed Star runs a tiny model on a 64GB MacBook Pro (M5) — and trades raw capability for something else: every line is meant to be read, understood, and learned from.

Why this exists

The goal is understanding inference, by building it. Reading about attention is one thing; writing the kernel that computes it and watching tokens stream out of your own code is another. This repo is the second thing.

Three sources form its spine, cross-referenced throughout the docs. (The prerequisites point to a wider set of optional brush-up and go-deeper resources — those fill gaps; these three are what the docs lean on.)

  1. The conceptsInference Engineering by Philip Kiely (Baseten, 2026). The "why" and the vocabulary. (Peruse the free interactive guide, or get your own copy from Baseten Books; Inference Engineering.pdf is in this repo.)
  2. A real implementationds4, cloned into reference/ds4/. The "how a pro does it." Working code doesn't lie.
  3. Architecture context — Sebastian Raschka's free articles: the architecture comparison, gallery, and workflow for understanding LLMs. (His book is a good optional extra, not a dependency — see the prerequisites.)

How it's built

  • Host language: Rust. Model loading, tokenizer, orchestration, sampling, KV cache — all Rust.
  • GPU kernels: MSL (Metal Shading Language) — hand-written, one operation per file, just like ds4's metal/ shaders.
  • Metal via raw FFI / the Objective-C runtime — no convenience wrapper crate. We send messages to Metal ourselves so nothing is hidden. Tight, like ds4.
  • First model: Qwen3-0.6B — a tiny dense model with GQA, RoPE, SwiGLU, and RMSNorm; small enough to inspect and debug while still looking like a real modern LLM.
  • Correctness via golden vectors: match logits from the model's official implementation. (Python appears only as a one-shot oracle, never as a second engine.)

Repo layout

fs/
├── README.md                  ← you are here
├── PLAN.md                    ← the milestone curriculum (M0 … M7+)
├── PROGRESS.md                ← running session log; start here each session
├── Inference Engineering.pdf  ← local copy of the book (ignored; bring your own)
├── src/                       ← Rust engine + thin CLI
├── scripts/                   ← uv-managed Python oracle/data scripts
├── tests/golden/              ← committed golden fixtures for verification
├── tools/                     ← site/sync helper scripts
├── docs/                       ← the learning site + notes (served at /fs via Pages)
│   ├── index.html             ← learning-site landing page (rich HTML)
│   ├── prerequisites.md       ← what to know before diving in (read this first)
│   ├── 00-map.md              ← THE BIG PICTURE of an inference engine
│   ├── 01-tokenizer.md        ← M0 writeup (.md + rich .html version)
│   ├── dev-loop.md            ← how to resume work after a break
│   ├── testing.md             ← verification strategy and golden-vector plan
│   ├── diagrams.html          ← shared diagram gallery
│   ├── RESOURCES.md           ← cross-reference index (book §§, ds4 files, Raschka)
│   ├── learnings/             ← bite-sized notes on what we figured out & why
│   └── assets/                ← logo + site assets
├── reference/ds4/             ← antirez's ds4 — pinned git submodule (read-only ref)
└── models/                    ← downloaded model assets (ignored; generated locally)

Where to start

  1. Read docs/prerequisites.md — the honest "what to know before you dive in" (spoiler: inference is the forward pass only — no training, no calculus), with brush-up resources and a knowledge-map.
  2. Read docs/00-map.md — the end-to-end picture of an inference engine, with an "abstraction ladder" so you can stop digging at whatever depth interests you.
  3. Skim PLAN.md — the milestones.
  4. Each session, open PROGRESS.md to see what's next.
  5. If resuming development, use docs/dev-loop.md and docs/testing.md for the local checks and verification strategy.

Status

🌱 M0 — Tokenizer: ✅ done. M1 — Load the weights: in progress. The byte-level BPE tokenizer is implemented and verified — fs tokenize / fs detokenize run end-to-end against Qwen3-0.6B, loading vocab + merges + regex

  • special tokens from the single tokenizer.json (14/14 golden cases pass; see docs/01-tokenizer.md). Next step: parse the safetensors weights and config.json so fs inspect model/ can print the architecture and tensor table.

Milestones (the full curriculum, with cross-links, lives in PLAN.md):

  • M0 — Tokenizer — text ↔ token IDs, verified against the real vocab
  • M1 — Load the weightscurrent
  • M2 — Forward pass → logits
  • M3 — Sampling → generation
  • M4 — KV cache
  • M5 — Quantization
  • M6 — Metal acceleration
  • M7+ — Stretch goals

This is a slow, multi-session learning project. It is not (yet) fast, capable, or finished — that's the point. Local models keep getting better; the bet is that a clean, well-documented small engine becomes more useful to more people over time.

About

A small, self-contained LLM inference engine for Apple Silicon — built from scratch, in the open, to learn and teach how inference engineering works.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors