You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Production-inspired LLM inference engine built from scratch in PyTorch, inspired by vLLM architecture. Implements KV caching, continuous batching, paged memory management, and async serving — with benchmarks at each optimization layer.
11.1× system throughput — continuous batching + paged cache under 64 concurrent users
122 tests across 15 modules | GPT-2 124M on A100 80GB PCIe
Motivation
Production LLM serving systems (vLLM, TGI, TensorRT-LLM) are 100K+ line codebases mixing C++, CUDA, and Python. Understanding why they make specific design decisions — KV caching, continuous batching, paged memory — is difficult from reading production code alone.
This project builds the inference stack from scratch, implementing each optimization incrementally:
Transformer forward pass — understand the computation graph
Async serving — FastAPI server with backpressure and routing
Load testing — end-to-end benchmarks under concurrent load (11.1× system throughput)
Each layer builds on the previous one. Benchmarks at each checkpoint quantify the impact, creating a complete understanding of what matters and why in LLM inference.
Problem Statement
This project isolates and benchmarks each inference optimization individually — measuring what matters and by how much.
Batching is the single largest win. 118× from bs=1 to bs=512 — GPU utilization goes from <5% to saturated. Everything else is secondary.
KV cache gain is sequence-length-dependent. At 200 tokens: 1.02×. At 1000: 2.22×. The optimization only matters when recomputation cost dominates.
Paged cache only wins above batch ~32. Below that, contiguous allocation uses less memory (no block metadata overhead). Crossover at ~24–32 sequences.
Python scatter/gather is the paged cache bottleneck. ~1.4–1.8× throughput overhead vs contiguous — production systems solve this with fused PagedAttention CUDA kernels.
Paged memory stays nearly flat regardless of batch size (2,681→3,220 MB) — only allocates blocks actually used. Standard grows linearly and OOMs at batch 1024.
Long prompts don't hurt paged serving. 9.5× throughput at long sequences vs 11.1× at short — vectorized cache updates scale well.
Kernel dispatch changes with KV cache. Baseline is sgemm-dominated (full matrix × matrix); KV cache shifts to gemv (matrix × vector) — 47% CUDA time reduction.
Failure Analysis
What broke
Root cause
Lesson
Paged cache slower than contiguous at small batch
Block metadata + Python-level scatter/gather dominates when fragmentation isn't the bottleneck
KV Cache — pre-allocated tensors (B, n_heads, max_seq_len, head_dim) per layer; decode appends one token's K/V per step
Paged KV Cache — block pool (num_blocks, n_heads, block_size, head_dim), free-list allocator, block table for logical→physical mapping, PagedCacheContext adapter for drop-in compatibility
Continuous Batching — iteration-level scheduler that evicts completed sequences and fills vacant slots per decode step; ContinuousKVCache with reset_slot() for reuse
Serving Layer — FastAPI with asyncio.Semaphore (503 at capacity), asyncio.wait_for (504 on timeout), background generation loop, per-request futures
Profiling — torch.profiler with CUDA event timing, GPU utilization via pynvml, MFU calculation
Paged KV Cache vs PagedAttention
Paged KV Cache (this project)
PagedAttention (vLLM)
What
Memory management layer — KV entries stored in fixed-size blocks
Complete attention algorithm — fused CUDA kernel operating directly on non-contiguous blocks