Skip to content

guoqingbao/xinfer

Repository files navigation

xInfer
Blazing-fast LLM inference in pure Rust. No PyTorch. No Python runtime. Just fast, portable, production-ready inference.
English | 简体中文


✨ Why xInfer?

Feature Details
0️⃣ Zero Python dependencies Pure Rust backend — no PyTorch, no CUDA Python bindings
Fast Native Flash Attention, FlashInfer, CUDA Graphs, continuous batching, prefix caching, PD disaggregation. Up to 197 tok/s decode for 30B+ models on consumer GPUs
🪶 Tiny footprint Core scheduling + attention logic in < 5 000 lines of Rust
🌍 Cross-platform CUDA (Linux/Windows), Metal (macOS). Same binary, same API
🏭 Production-ready OpenAI/Anthropic-compatible APIs, built-in ChatGPT-style Web UI, MCP tool calling, structured outputs, embedding + tokenizer endpoints
🗜️ Aggressive KV compression TurboQuant (2–4 bit KV cache) extends context up to 4.3× with minimal quality loss. Run 30B+ MoE models with millions of context on single 24/32 GB GPUs
🔥 V100 + NVFP4 First-ever NVFP4 + low-bit KV cache on V100 — no hardware FP4 needed, coherent output on legacy GPUs
🐍 Lightweight Python bindings Optional PyO3 wheel when you need a Python entry point

🚀 Quick Start

📦 Install

Option 1 — Install DEB or Python package

curl -sSL https://guoqingbao.github.io/xinfer/install.sh | bash

Option 2 — npm

npm install -g xinfer-ai

▶️ Run

Using HuggingFace Model ID:

xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

Using local model path:

xinfer --w /home/Qwen3.6-35B-A3B --d 0,1 --ui-server

Python usage:

# python3 -m xinfer.chat
python3 -m xinfer.server --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

Tip: Open http://IP:8001 for the built-in chat UI, or use http://IP:8000/v1/ as your API Base URL.


🗜️ KV Cache Compression

Add --kvcache-dtype to compress KV cache and extend context length:

Flag (--kvcache-dtype) Compression Quality GPU Requirement
(default) 1× (BF16) Baseline All
fp8 Near-lossless SM70+ / Apple M1
turbo8 2.6× 79–100% throughput SM70+ / Apple M1
turbo4 3.7× Best balance SM70+ / Apple M1
turbo3 4.7× Max compression SM70+

📈 Performance

Tested on V100-32G, A100-40G, Hopper-80G and RTX 5090

Model Format Size Decoding Speed
Ministral-3-3B (Multimodal) ISQ (BF16→Q4K) 3B 193.67 tokens/s
Qwen3-VL-8B-Instruct (Multimodal) Q8_0 8B 112.51 tokens/s
Llama-3.1-8B ISQ (BF16→Q4K) 8B 133.10 tokens/s
DeepSeek-R1-0528-Qwen3-8B Q4_K_M 8B 139.25 tokens/s
GLM-4-9B-0414 Q4_K_M 9B 77.48 tokens/s
QwQ-32B Q4_K_M 32B 46.02 tokens/s
Qwen3-30B-A3B NVFP4 30B (MoE) 197.29 tokens/s (RTX 5090)
Qwen3-30B-A3B NVFP4 30B (MoE) 72.86 tokens/s (V100, Software FP4)
Qwen3.5-27B (Multimodal) Q4_K_M 27B (Dense) 49.33 tokens/s
Qwen3.5-27B/Qwen3.6-27B FP8 27B (Dense) 45 tokens/s (Hopper)
Qwen3.6-35B-A3B (Multimodal) FP8 35B (MoE) 110 tokens/s (Hopper)
GLM4.7 Flash NVFP4 30B (MoE) 79 tokens/s (Hopper, Software FP4)
Gemma4-31B ISQ (BF16→Q4K) 31B (Dense) 47 tokens/s (Hopper)
Gemma4-26B-A4B NVFP4 26B (MoE) 137.23 tokens/s (RTX 5090)
MiniMax-M2.5 NVFP4 229B (MoE) 64.50 tokens/s (Hopper, Software FP4, TP=2)
Apple Silicon (M4)
Model Batch Size Output Tokens Time (s) Throughput (tokens/s)
Qwen3-0.6B (BF16) 128 63488 83.13s 763.73
Qwen3-0.6B (BF16) 32 15872 23.53s 674.43
Qwen3-0.6B (BF16) 1 456 9.23s 49.42
Qwen3-4B (Q4_K_M) 1 1683 52.62s 31.98
Qwen3-8B (Q2_K) 1 1300 80.88s 16.07
Qwen3.5-4B (Q3_K_M) 1 1592 69.04s 23.06
Qwen3.5-2B (NVFP4) 1 1883 60.76s 30.99
Qwen3.5-2B (NVFP4) 2 3942 81.96s 48.10

Full benchmarks →


🧠 Supported Models

  • ✅ LLaMa (LLaMa2, LLaMa3, LLaMa4, IQuest-Coder)
  • ✅ Qwen (Qwen2, Qwen3)
  • ✅ Qwen2/Qwen3 MoE
  • ✅ Qwen3 Next
  • ✅ Qwen3.5/3.6 Dense/MoE (27B, 35B, 122B, 397B, Multimodal model)
  • ✅ Mistral v1, v2
  • ✅ Mistral-3-VL Reasoning (3B, 8B, 14B, Multimodal model)
  • ✅ GLM4 (0414, Not ChatGLM)
  • ✅ GLM4 MoE (4.6/4.7)
  • ✅ GLM4.7 Flash
  • ✅ DeepSeek V3/R1/V3.2
  • ✅ Phi3 / Phi4 (Phi-3, Phi-4, Phi-4-mini, etc.)
  • ✅ Gemma3/Gemma4 (Multimodal model)
  • ✅ Qwen3-VL (Dense, Multimodal model)
  • ✅ MiroThinker-v1.5 (30B, 235B)

Formats: Safetensors (BF16, FP8-blockwise, GPTQ, AWQ, MXFP4, NVFP4) | GGUF (all quant types) | ISQ (on-the-fly quantization)


TurboQuant KV Cache — Run 30B+ Models on Consumer GPUs

TurboQuant compresses KV cache to 2–4 bits via Walsh-Hadamard transform rotation + per-head absmax quantization. Max context tokens with turbo4:

Model KV budget BF16 turbo4 Gain
Qwen3.6-35B-A3B (NVFP4) 7 GB (24 GB GPU) 700k 2.7M 3.9×
15 GB (32 GB GPU) 1.5M 5.8M 3.9×
Qwen3.6-27B (FP8) 7 GB 112k 434k 3.9×
15 GB 240k 930k 3.9×
Qwen3-30B-A3B (Q4_K_M) 7 GB 74k 281k 3.8×
15 GB 160k 602k 3.8×
Gemma4-26B-A4B (NVFP4) 7 GB 32k 125k 3.9×
15 GB 70k 271k 3.9×

Hybrid models (Qwen3.6) have fewer full attention layers, making TurboQuant especially effective. MLA models (DeepSeek, GLM4.7 Flash) use fp8 instead. The KV budget in the table is the theoretical maximum; actual usage can only utilize up to 90% of the KV budget (--kv-fraction 0.9), leaving room for runtime and batching buffers.

# 35B MoE on single 24/32 GB GPU
xinfer --m unsloth/Qwen3.6-35B-A3B-NVFP4 --kvcache-dtype turbo4

# Production precision
xinfer --m Qwen/Qwen3.6-35B-A3B-FP8 --kvcache-dtype fp8

# 27B Dense + turbo4
xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4

# 30B MoE GGUF + turbo4
xinfer --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --kvcache-dtype turbo4

# Metal/MacOS
xinfer --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q4_K_M.gguf

📘 Usage

For Python installaion, running model with python3 -m xinfer.server

For Docker builds, refer to Run xInfer in Docker →

Running Models

Tip: By default, xInfer starts an OpenAI-compatible API server at http://localhost:8000. Add --ui-server to also launch the built-in ChatGPT-style Web UI at http://localhost:8001.

# FP8 model (sm90+ with cutlass) + web UI
xinfer --m Qwen/Qwen3.6-27B-FP8 --ui-server

# Unquantized Safetensors (multi-GPU)
xinfer --d 0,1 --m Qwen/Qwen3-30B-A3B-Instruct-2507 --kvcache-dtype fp8

# ISQ on-the-fly quantization
xinfer --m Qwen/Qwen3.6-35B-A3B --isq q4k

# NVFP4 model
xinfer --m unsloth/Qwen3.6-27B-NVFP4

# MXFP4
xinfer --m olka-fi/Qwen3.5-4B-MXFP4

# GGUF model (4-bit KvCache)
xinfer --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf --kvcache-dtype turbo4

# FP8 on Metal
xinfer --m Qwen/Qwen3.5-4B-FP8

# Gemma4 26B (NVFP4)
xinfer --m unsloth/gemma-4-26b-a4b-it-NVFP4

# MLA model (GLM4.7 Flash)
xinfer --m GadflyII/GLM-4.7-Flash-NVFP4

# Interactive CLI chat
xinfer --i --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf
ISQ (on-the-fly quantization) + KV cache compression
# ISQ Q4K + FP8 KV cache
xinfer --m Qwen/Qwen3.6-35B-A3B --isq q4k --kvcache-dtype fp8

# ISQ Q4K + TurboQuant KV cache
xinfer --m Qwen/Qwen3.6-35B-A3B --isq q4k --kvcache-dtype turbo4

# Metal ISQ
xinfer --w /path/Qwen3-4B --isq q6k
GGUF models
# Single GPU — GGUF
xinfer --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf

# Multi-GPU — GGUF
xinfer --d 0,1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf
TurboQuant KV cache (2–4 bit) — see TurboQuant section
# turbo4: 4-bit K+V — 3.7× compression, best tradeoff
xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4

# turbo3: 3-bit K + 4-bit V — 4.7× compression
xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo3

# turbo8: FP8 K + 4-bit V — 2.6× compression, highest quality
xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo8

# 35B MoE (NVFP4 + turbo4) — fits on single 24 GB GPU
xinfer --m unsloth/Qwen3.6-35B-A3B-NVFP4 --kvcache-dtype turbo4

# 30B MoE (GGUF Q4_K_M + turbo4) — consumer GPU
xinfer --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --kvcache-dtype turbo4
Multimodal models (Qwen3-VL, Gemma4, Mistral3-VL)
# Upload images via built-in Chat UI or send image_url in API requests

# Qwen3.6 35B MoE (FP8, multimodal)
xinfer --m Qwen/Qwen3.6-35B-A3B-FP8 --ui-server

# Qwen3-VL 8B (GGUF)
xinfer --m unsloth/Qwen3-VL-8B-Instruct-GGUF --f Qwen3-VL-8B-Instruct-Q8_0.gguf --ui-server

# Gemma4 26B MoE (NVFP4, multimodal)
xinfer --m unsloth/gemma-4-26b-a4b-it-NVFP4 --ui-server

# Mistral-3 VL 3B (BF16, multimodal)
xinfer --m mistralai/Ministral-3-3B --ui-server

📘 Build from source code

Option 1 — Cargo

# Prerequisites: Rust compiler, CUDA Toolkit (optional) or Metal Xcode command line tool
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

export XINFER_REPO="https://github.com/guoqingbao/xinfer"
# MacOS/Metal: replace features to `metal`
# SM_70/SM_75 (e.g., V100): remove `flashinfer` and `cutlass` features
cargo install --git $XINFER_REPO xinfer --features cuda,nccl,flashinfer,cutlass

Option 2 — Docker

# Turing/V100 (sm_70/sm_75): remove `flashinfer` and `cutlass` features
./build_docker.sh "cuda,nccl,flashinfer,cutlass"

See Docker guide →

Build Python wheel from source
pip install maturin maturin[patchelf]

# FlashInfer backend (SM80+)
./build.sh --release --features cuda,nccl,flashinfer,cutlass,python

# Flash Attention backend
./build.sh --release --features cuda,nccl,flashattn,cutlass,python

# macOS Metal
maturin build --release --features metal,python

# Install
pip install target/wheels/xinfer*.whl --force-reinstall

See more Python examples →


🔀 Prefill-Decode Disaggregation

Split prefill (prompt processing) and decode (token generation) across GPUs or machines. Eliminates decode stalls during long-context prefilling. PD Server and PD Client must use same KvCache type (--kvcache-dtype). API request(s) must send to PD Client and the PD Server only process internal prefill requests sent from PD Client.

Mode Config Use Case
Local IPC (default, no flag) Same machine, CUDA
File IPC --pd-url file:///path Docker containers, shared volume
Remote TCP --pd-url tcp://host:port Different machines

Local IPC (multirank)

# PD Server (prefill GPU, default port 7000)
xinfer --d 0,1 --m Qwen/Qwen3-30B-A3B-Instruct-2507 --pd-server

# PD Client (decode GPU + API)
xinfer --d 2,3 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --ui-server --port 8000 --pd-client

Multinode (tcp mode)

# Server machine (192.168.1.100)
target/release/xinfer --d 0,1 --m Qwen/... --pd-server --pd-url tcp://0.0.0.0:8100

# Client machine
target/release/xinfer --d 0,1 --w /path/... --pd-client --pd-url tcp://192.168.1.100:8100 --ui-server --port 8000

Metal/macOS requires --pd-url (no LocalIPC support).

Multi-container (file:// mode)
mkdir -p /tmp/pd-sockets

# Server container
docker run --gpus '"device=0,1"' -v /tmp/pd-sockets:/sockets ...
target/release/xinfer --d 0,1 --m Qwen/... --pd-server --pd-url file:///sockets

# Client container
docker run --gpus '"device=2,3"' -v /tmp/pd-sockets:/sockets ...
target/release/xinfer --d 0,1 --w /path/... --pd-client --pd-url file:///sockets --ui-server --port 8000

🔌 MCP Tool Calling

xinfer --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --ui-server --mcp-config ./mcp.json

MCP documentation →


🔌 Structured Outputs

Constraint-based generation via llguidance — Lark grammars, regex, JSON Schema.

Structured outputs documentation →


📚 Documentation

Guide Description
Get Started Build, run, and configure
Docker Container builds and deployment
Performance Full benchmark tables
Prefix Cache Automatic KV cache reuse
Multimodal Vision-language models
Embedding Text embedding API
Tokenizer API Tokenize / detokenize endpoints
Tool Parsing Tool call detection and parsing
MCP Integration Model Context Protocol
Guided Decoding Structured outputs
Rust Crate Use as a library
Add a Model Port a new architecture (AI-assisted)
Test a Model Validate model quality (AI-assisted)

Using Agents under xInfer backend: xbot · OpenCode · Kilo Code · Claude Code · Goose


⚙️ CLI Reference

Flag Description
--m HuggingFace model ID (auto-download)
--w Local Safetensors model path
--f GGUF file path (or filename when --m is given)
--d Device IDs (e.g. --d 0,1)
--ui-server API server + built-in ChatGPT-style web UI
--server API server only (no web UI)
--i Interactive CLI chat
--isq On-the-fly quantization: q2k, q3k, q4k, q5k, q6k, q8_0
--kvcache-dtype KV cache quantization: fp8, turbo8, turbo4, turbo3
--max-num-seqs Max concurrent requests (default: 32, macOS: 8)
--max-tokens Max tokens per response (default: 16384)
--kv-fraction GPU memory fraction for KV cache
--cpu-mem-fold CPU swap memory ratio (default: 0.2)
--pd-server Run as PD prefill server
--pd-client Run as PD decode client
--pd-url PD connection URL (tcp://, http://, file://)
--disable-prefix-cache Disable prefix caching
--prefix-cache-max-tokens Cap prefix cache size
--prefill-chunk-size Cap prefill chunk size (default: CUDA 8K, Metal: 4k)
--disable-cuda-graph Disable CUDA graph capture
--yarn-scaling-factor YARN RoPE context extension factor
--temperature Sampling temperature (0–1)
--top-k / --top-p Top-k / nucleus sampling
--presence-penalty Penalize repeated tokens (−2 to 2)
--frequency-penalty Penalize frequent tokens (−2 to 2)
--mcp-config MCP servers JSON config
--mcp-command / --mcp-args Single MCP server command + args

📽️ Demo

Qwen3-32B-A3B-Rust-Server-Mode-2.mp4


🛠️ Roadmap

  • Batched inference (Metal)
  • GGUF format support
  • FlashAttention (CUDA)
  • CUDA Graph
  • OpenAI-compatible API (streaming support)
  • Continuous batching
  • Multi-gpu inference (Safetensors, GPTQ, AWQ, GGUF)
  • Speedup prompt processing on Metal/macOS
  • Chunked Prefill
  • Prefix cache (available on CUDA when prefix-cache enabled)
  • Model loading from hugginface hub
  • Model loading from ModelScope (China)
  • Prefix cache for Metal/macOS
  • FP8 KV Cache (CUDA, all backends including FlashInfer on SM80+)
  • FP8 KV Cache (Metal)
  • FP8 KV Cache (with FlashInfer, SM80+)
  • TurboQuant KV Cache (2-4 bit compression with WHT rotation)
  • FP8 Models (CUDA: MoE, Dense; Metal: Dense)
  • Additional model support (Kimi K2, GLM 5.1 etc.)
  • CPU KV Cache Offloading
  • Prefill-decode Disaggregation (CUDA)
  • Prefill-decode Disaggregation (Metal)
  • Built-in ChatGPT-like Web Server
  • Embedding API
  • Tokenize/Detokenize API
  • MCP Integration & Tool Calling
  • Prefix Caching
  • Claude/Anthropic-compatible API Server
  • Support CUDA 13
  • Support FlashInfer backend
  • Support DeepGEMM backend (Hopper)
  • MXFP4/NVFP4 Model Support
  • Support Turboquant (4-bit, 3-bit) KvCache
  • TentorRT-LLM

📚 References

Star History

Star History Chart

Like this project? Give it a ⭐ and contribute!

About

Blazing-fast LLM inference in pure Rust. No PyTorch and Python runtime.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors