GitHub - cklxx/arle: Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

ARLE
Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. arle serve is the OpenAI-compatible serving path; arle is the unified front door.

Quick Start · HTTP API · Support Matrix · Onboarding · Architecture · Roadmap · Changelog

English · 简体中文

Quick Start

# Apple Silicon — Homebrew
brew install cklxx/tap/arle

# Apple Silicon or Linux x86_64 — one-line installer
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

# Linux + NVIDIA — Docker, no compile
docker run --rm --gpus all -p 8000:8000 -v /path/to/Qwen3.5-4B:/model:ro \
  ghcr.io/cklxx/arle:latest serve --backend cuda --model-path /model

# From source (any backend)
cargo build --release --features cuda --bin arle     # Linux + NVIDIA
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle  # Apple Silicon

Full install matrix + uninstall: docs/install.md.

Serve:

arle serve --backend cuda  --model-path /path/to/Qwen3.5-4B --port 8000
arle serve --backend metal --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000

Talk to it (OpenAI-compatible):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Local agent / self-check:

arle                              # interactive REPL with python/shell tools
arle run --prompt "Summarize this repo" --model-path /path/to/Qwen3.5-4B
arle --doctor --json              # CI-friendly self-check

More copy-paste: examples/.

Status at a glance

Backend	Platform	Status	Headline
CUDA	Linux + NVIDIA	Stable	197 tok/s on L4 (Qwen3.5-4B BF16, c=16)
Metal	Apple Silicon	Beta	85.6 tok/s on M4 Pro (Qwen3.6 35B-A3B 4-bit)
Metal DFlash	Apple Silicon	Beta	Bit-identical spec decode for Qwen3.5
OPD train (CUDA)	Linux + NVIDIA	Beta	2.49–2.91× faster than HF TRL `GKDTrainer`; LoRA fits 4 GB cards
CPU	Portable	Dev-only	Smoke tests only

Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B) on CUDA + Metal. DeepSeek-V4-Flash is in active multi-GPU bring-up (TP=8 / EP=8 FP8 MoE on 8×H20 — official DSA + DeepGEMM prefill default-on at ~23 ms, B=1 decode down to 15 ms/token via MTP batched verify); Qwen 3.6 is #2 (ROADMAP).

Full numbers and tier policy: support-matrix · stability-policy.

Why ARLE

Agent and RL workloads waste compute re-processing the same prompt + history + tool output every turn. ARLE fixes this once and shares the fix across serving and training:

KV stays hot across turns. Prior-turn KV is kept on GPU; spills to host / disk / cluster only when memory pressures it.
Shared prefixes are cheap. Pages are reused across requests with the same prefix — no duplicate compute, no duplicate memory.
In-memory KV is bounded. Metal auto-sizes the live prefix snapshot tier from available memory, live KV shape, and whether weights are wired; --kv-memory-max-bytes 0 disables it.
Local disk KV is bounded. Metal keeps SSD prefix snapshots under ~/.cache/arle/metal_kv with a 20 GiB budget and LRU watermark eviction; use --no-kv-disk to disable or --kv-disk-max-bytes to override.
Disk KV is segment-backed. Metal commits small manifests plus sequential segment files; 64 KiB CRC32C-checked chunks dedupe prefix extensions without creating thousands of tiny files. Small session tails stay in memory and hit SSD only at a 64-token checkpoint cadence to avoid recurrent-state write amplification.
One runtime, three surfaces. Serving, the local agent, and OPD training all run on the same Rust + model code. The OPD teacher is the production server.

Quantized KV is available on CUDA (--kv-cache-dtype int8|fp8|tq4). Metal uses the model-native KV dtype today; MLX-side quantized KV is a separate follow-up.

Benchmark data: TTFT/TPOT steady sweep · Metal memory accounting note · Metal KV memory budget · Metal segmented SSD KV · Metal SSD KV write-amplification cadence.

flowchart TB
  subgraph Surface["Entry surfaces"]
    Serve["arle serve<br/>OpenAI HTTP"]
    Agent["arle<br/>local agent / REPL"]
    Train["arle train opd<br/>teacher rollouts + distillation"]
  end

  subgraph Runtime["Shared Rust runtime"]
    Router["OpenAI router<br/>continuous scheduler"]
    Model["Qwen model authority<br/>weights / tokenizer / decode"]
    KV["KV memory plane<br/>prefix radix + paged KV + residency tiers"]
  end

  subgraph Backend["Execution backends"]
    Metal["Metal<br/>MLX bridge + packed varlen decode"]
    CUDA["CUDA<br/>TileLang AOT + custom kernels"]
  end

  Serve --> Router
  Agent --> Router
  Train --> Router
  Router --> Model
  Model <--> KV
  Model --> Metal
  Model --> CUDA
  KV --> Metal
  KV --> CUDA
  Model -. teacher samples .-> Train

Deep dive: onboarding (30 min) · architecture · codebase-map.

Entry surfaces

arle is the single binary:

Command	What it does
`arle` (no args)	Interactive agent REPL with `python` and `shell` tools.
`arle run --prompt "…"`	One-shot agent prompt. `--no-tools` to disable tools.
`arle serve --backend …`	OpenAI-compatible HTTP server.
`arle train opd`	On-Policy Distillation — teacher on the serving runtime (`infer-api`), student in `train`. CUDA path. Usage manual.
`arle --doctor [--json]`	Backend / hardware / model-resolution self-check.

Operators wanting only serving can run arle serve — the same HTTP contract, without touching the agent / train surfaces.

Latest Updates

2026-06-08 — DeepSeek-V4-Flash B=1 latency: prefill 23 ms, decode 27 → 15 ms (8×H20, TP=8 / EP=8, FP8 MoE). The official DSA indexer flattened decode across context (legacy csa_select 124 ms → ~26 ms @4k, 4.8×); the MLA / output projections moved from scalar GEMV to tensor-core DeepGEMM (−94% per stage → prefill ~23 ms); and MTP depth-1 batched verify amortized the serial critical path for +71% decode tok/s (39.9 → 64.2), byte-identical. Eight per-kernel levers (whole-step CUDA graph, mhc_params uint4, M=1 GEMV, comm-overlap, …) all washed — B=1 decode is GPU-bound on the critical path, so 15 ms is the sound single-request ceiling; 6 ms needs tree-EAGLE + mega-kernel fusion or batching (M=N). FINAL report.

2026-06-02 — Metal Qwen3.6 A/B refreshed. ARLE and mlx-lm are in the same steady TPOT band from 128 to 12k input tokens; the README chart now shows TTFT + steady TPOT only. Wins entry.

2026-06-02 — Metal SSD KV is segment-backed. Prefix snapshots now use CRC32C-checked 64 KiB chunks in sequential segment files; short generated tails stay in memory until the 64-token SSD checkpoint cadence. Segmented KV · WA cadence.

Older history: CHANGELOG.md.

Documentation map

docs/http-api.md · HTTP contract & streaming
docs/support-matrix.md · backend / model / quant tiers
docs/architecture.md · package boundaries
docs/codebase-map.md · workspace layout & execution paths
docs/environment.md · env vars & runtime knobs
docs/troubleshooting.md · common build/runtime errors
docs/comparison.md · vs vLLM / SGLang / mistral.rs / llama.cpp
CONTRIBUTING.md · contributor setup & validation
examples/ · copy-paste smoke paths
docs/index.md · maintainer-facing PARA index

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4,383 Commits
.cargo		.cargo
.claude		.claude
.githooks		.githooks
.github		.github
bench-output		bench-output
benchmarks		benchmarks
crates		crates
diagrams		diagrams
docs		docs
examples		examples
memory		memory
scripts		scripts
src		src
tests		tests
traces		traces
web		web
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
pyproject.toml		pyproject.toml
requirements-bench.txt		requirements-bench.txt
requirements-build.txt		requirements-build.txt
rust-toolchain.toml		rust-toolchain.toml
setup.sh		setup.sh
skills-lock.json		skills-lock.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Status at a glance

Why ARLE

Entry surfaces

Latest Updates

Documentation map

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Status at a glance

Why ARLE

Entry surfaces

Latest Updates

Documentation map

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages