I make LLMs run well on real, constrained hardware — on-device, edge, Apple Silicon — and build the products around them.
The recurring question in my work: what actually limits LLM inference on a machine you own, and how do those limits change as you scale? Background: hardware (e-paper boards shipped to 20+ countries), Rust systems tooling, and AI-native full-stack development.
On-device / inference (Apple's MLX)
- Contributor to mlx-lm. Merged: #1349 — enables text-mode loading of Gemma 4 (
gemma4_unified) checkpoints on MLX. - #1329 (approved) — root-caused why Mistral/Devstral (tekken-v13) models emit
Ġinstead of spaces on Apple Silicon, and fixed the detokenizer routing. The writeup. - First merged contribution to vLLM on Apple Silicon (#382).
On Device — measuring the bottlenecks
- ondevice-bench — an open-source, verifiable LLM benchmark that runs on the laptop in front of you, not a datacenter GPU. Code is executed and checked; no rubric scoring.
- Recent findings, measured on a 16 GB M3 MacBook Air:
- Turning a model's reasoning off makes Qwen3-4B beat Llama-3.1-8B — at half the RAM.
- Gemma 4 12B: the multimodal tax — 11 GB and 2.7 tok/s for no text-quality gain.
- Three bugs in my own benchmark — how a false fail looks exactly like a real one.
Other work
- paperd.ink — open-source ESP32 e-paper dev board, in makers' hands across 20+ countries.
- vcfkit — genomics CLI in Rust; 4× faster than bcftools, single static binary.
- Hacker Newspaper — comments-first mobile Hacker News reader.
Writing about on-device LLMs at prasadkhake.com · On Device.