ARLE
Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. arle serve is the OpenAI-compatible serving path; arle is the unified front door.
Quick Start · HTTP API · Support Matrix · Onboarding · Architecture · Roadmap · Changelog
English · 简体中文
# Apple Silicon — Homebrew
brew install cklxx/tap/arle
# Apple Silicon or Linux x86_64 — one-line installer
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh
# Linux + NVIDIA — Docker, no compile
docker run --rm --gpus all -p 8000:8000 -v /path/to/Qwen3.5-4B:/model:ro \
ghcr.io/cklxx/arle:latest serve --backend cuda --model-path /model
# From source (any backend)
cargo build --release --features cuda --bin arle # Linux + NVIDIA
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle # Apple SiliconFull install matrix + uninstall: docs/install.md.
Serve:
arle serve --backend cuda --model-path /path/to/Qwen3.5-4B --port 8000
arle serve --backend metal --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000Talk to it (OpenAI-compatible):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
model="qwen3.5-4b",
messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)Local agent / self-check:
arle # interactive REPL with python/shell tools
arle run --prompt "Summarize this repo" --model-path /path/to/Qwen3.5-4B
arle --doctor --json # CI-friendly self-checkMore copy-paste: examples/.
| Backend | Platform | Status | Headline |
|---|---|---|---|
| CUDA | Linux + NVIDIA | Stable | 197 tok/s on L4 (Qwen3.5-4B BF16, c=16) |
| Metal | Apple Silicon | Beta | 85.6 tok/s on M4 Pro (Qwen3.6 35B-A3B 4-bit) |
| Metal DFlash | Apple Silicon | Beta | Bit-identical spec decode for Qwen3.5 |
| OPD train (CUDA) | Linux + NVIDIA | Beta | 2.49–2.91× faster than HF TRL GKDTrainer; LoRA fits 4 GB cards |
| CPU | Portable | Dev-only | Smoke tests only |
Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B) on CUDA + Metal. DeepSeek-V4-Flash is in active multi-GPU bring-up (TP=8 / EP=8 FP8 MoE on 8×H20 — official DSA + DeepGEMM prefill default-on at ~23 ms, B=1 decode down to 15 ms/token via MTP batched verify); Qwen 3.6 is #2 (ROADMAP).
Full numbers and tier policy: support-matrix · stability-policy.
Agent and RL workloads waste compute re-processing the same prompt + history + tool output every turn. ARLE fixes this once and shares the fix across serving and training:
- KV stays hot across turns. Prior-turn KV is kept on GPU; spills to host / disk / cluster only when memory pressures it.
- Shared prefixes are cheap. Pages are reused across requests with the same prefix — no duplicate compute, no duplicate memory.
- In-memory KV is bounded. Metal auto-sizes the live prefix snapshot tier from available memory, live KV shape, and whether weights are wired;
--kv-memory-max-bytes 0disables it. - Local disk KV is bounded. Metal keeps SSD prefix snapshots under
~/.cache/arle/metal_kvwith a 20 GiB budget and LRU watermark eviction; use--no-kv-diskto disable or--kv-disk-max-bytesto override. - Disk KV is segment-backed. Metal commits small manifests plus sequential segment files; 64 KiB CRC32C-checked chunks dedupe prefix extensions without creating thousands of tiny files. Small session tails stay in memory and hit SSD only at a 64-token checkpoint cadence to avoid recurrent-state write amplification.
- One runtime, three surfaces. Serving, the local agent, and OPD training all run on the same Rust + model code. The OPD teacher is the production server.
Quantized KV is available on CUDA (--kv-cache-dtype int8|fp8|tq4). Metal uses
the model-native KV dtype today; MLX-side quantized KV is a separate follow-up.
Benchmark data: TTFT/TPOT steady sweep · Metal memory accounting note · Metal KV memory budget · Metal segmented SSD KV · Metal SSD KV write-amplification cadence.
flowchart TB
subgraph Surface["Entry surfaces"]
Serve["arle serve<br/>OpenAI HTTP"]
Agent["arle<br/>local agent / REPL"]
Train["arle train opd<br/>teacher rollouts + distillation"]
end
subgraph Runtime["Shared Rust runtime"]
Router["OpenAI router<br/>continuous scheduler"]
Model["Qwen model authority<br/>weights / tokenizer / decode"]
KV["KV memory plane<br/>prefix radix + paged KV + residency tiers"]
end
subgraph Backend["Execution backends"]
Metal["Metal<br/>MLX bridge + packed varlen decode"]
CUDA["CUDA<br/>TileLang AOT + custom kernels"]
end
Serve --> Router
Agent --> Router
Train --> Router
Router --> Model
Model <--> KV
Model --> Metal
Model --> CUDA
KV --> Metal
KV --> CUDA
Model -. teacher samples .-> Train
Deep dive: onboarding (30 min) · architecture · codebase-map.
arle is the single binary:
| Command | What it does |
|---|---|
arle (no args) |
Interactive agent REPL with python and shell tools. |
arle run --prompt "…" |
One-shot agent prompt. --no-tools to disable tools. |
arle serve --backend … |
OpenAI-compatible HTTP server. |
arle train opd |
On-Policy Distillation — teacher on the serving runtime (infer-api), student in train. CUDA path. Usage manual. |
arle --doctor [--json] |
Backend / hardware / model-resolution self-check. |
Operators wanting only serving can run arle serve — the same HTTP contract, without touching the agent / train surfaces.
2026-06-08 — DeepSeek-V4-Flash B=1 latency: prefill 23 ms, decode 27 → 15 ms (8×H20, TP=8 / EP=8, FP8 MoE). The official DSA indexer flattened decode across context (legacy csa_select 124 ms → ~26 ms @4k, 4.8×); the MLA / output projections moved from scalar GEMV to tensor-core DeepGEMM (−94% per stage → prefill ~23 ms); and MTP depth-1 batched verify amortized the serial critical path for +71% decode tok/s (39.9 → 64.2), byte-identical. Eight per-kernel levers (whole-step CUDA graph, mhc_params uint4, M=1 GEMV, comm-overlap, …) all washed — B=1 decode is GPU-bound on the critical path, so 15 ms is the sound single-request ceiling; 6 ms needs tree-EAGLE + mega-kernel fusion or batching (M=N). FINAL report.
2026-06-02 — Metal Qwen3.6 A/B refreshed. ARLE and mlx-lm are in the same steady TPOT band from 128 to 12k input tokens; the README chart now shows TTFT + steady TPOT only. Wins entry.
2026-06-02 — Metal SSD KV is segment-backed. Prefix snapshots now use CRC32C-checked 64 KiB chunks in sequential segment files; short generated tails stay in memory until the 64-token SSD checkpoint cadence. Segmented KV · WA cadence.
Older history: CHANGELOG.md.
- docs/http-api.md · HTTP contract & streaming
- docs/support-matrix.md · backend / model / quant tiers
- docs/architecture.md · package boundaries
- docs/codebase-map.md · workspace layout & execution paths
- docs/environment.md · env vars & runtime knobs
- docs/troubleshooting.md · common build/runtime errors
- docs/comparison.md · vs vLLM / SGLang / mistral.rs / llama.cpp
- CONTRIBUTING.md · contributor setup & validation
- examples/ · copy-paste smoke paths
- docs/index.md · maintainer-facing PARA index

