Skip to content

cklxx/arle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4,383 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ARLE
Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. arle serve is the OpenAI-compatible serving path; arle is the unified front door.

Website CI CUDA CI Metal CI MIT License Release

Quick Start · HTTP API · Support Matrix · Onboarding · Architecture · Roadmap · Changelog

English · 简体中文


Quick Start

# Apple Silicon — Homebrew
brew install cklxx/tap/arle

# Apple Silicon or Linux x86_64 — one-line installer
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

# Linux + NVIDIA — Docker, no compile
docker run --rm --gpus all -p 8000:8000 -v /path/to/Qwen3.5-4B:/model:ro \
  ghcr.io/cklxx/arle:latest serve --backend cuda --model-path /model

# From source (any backend)
cargo build --release --features cuda --bin arle     # Linux + NVIDIA
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle  # Apple Silicon

Full install matrix + uninstall: docs/install.md.

Serve:

arle serve --backend cuda  --model-path /path/to/Qwen3.5-4B --port 8000
arle serve --backend metal --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000

Talk to it (OpenAI-compatible):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Local agent / self-check:

arle                              # interactive REPL with python/shell tools
arle run --prompt "Summarize this repo" --model-path /path/to/Qwen3.5-4B
arle --doctor --json              # CI-friendly self-check

More copy-paste: examples/.


Status at a glance

Backend Platform Status Headline
CUDA Linux + NVIDIA Stable 197 tok/s on L4 (Qwen3.5-4B BF16, c=16)
Metal Apple Silicon Beta 85.6 tok/s on M4 Pro (Qwen3.6 35B-A3B 4-bit)
Metal DFlash Apple Silicon Beta Bit-identical spec decode for Qwen3.5
OPD train (CUDA) Linux + NVIDIA Beta 2.49–2.91× faster than HF TRL GKDTrainer; LoRA fits 4 GB cards
CPU Portable Dev-only Smoke tests only

Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B) on CUDA + Metal. DeepSeek-V4-Flash is in active multi-GPU bring-up (TP=8 / EP=8 FP8 MoE on 8×H20 — official DSA + DeepGEMM prefill default-on at ~23 ms, B=1 decode down to 15 ms/token via MTP batched verify); Qwen 3.6 is #2 (ROADMAP).

Full numbers and tier policy: support-matrix · stability-policy.


Why ARLE

Agent and RL workloads waste compute re-processing the same prompt + history + tool output every turn. ARLE fixes this once and shares the fix across serving and training:

  • KV stays hot across turns. Prior-turn KV is kept on GPU; spills to host / disk / cluster only when memory pressures it.
  • Shared prefixes are cheap. Pages are reused across requests with the same prefix — no duplicate compute, no duplicate memory.
  • In-memory KV is bounded. Metal auto-sizes the live prefix snapshot tier from available memory, live KV shape, and whether weights are wired; --kv-memory-max-bytes 0 disables it.
  • Local disk KV is bounded. Metal keeps SSD prefix snapshots under ~/.cache/arle/metal_kv with a 20 GiB budget and LRU watermark eviction; use --no-kv-disk to disable or --kv-disk-max-bytes to override.
  • Disk KV is segment-backed. Metal commits small manifests plus sequential segment files; 64 KiB CRC32C-checked chunks dedupe prefix extensions without creating thousands of tiny files. Small session tails stay in memory and hit SSD only at a 64-token checkpoint cadence to avoid recurrent-state write amplification.
  • One runtime, three surfaces. Serving, the local agent, and OPD training all run on the same Rust + model code. The OPD teacher is the production server.

Quantized KV is available on CUDA (--kv-cache-dtype int8|fp8|tq4). Metal uses the model-native KV dtype today; MLX-side quantized KV is a separate follow-up.

Benchmark data: TTFT/TPOT steady sweep · Metal memory accounting note · Metal KV memory budget · Metal segmented SSD KV · Metal SSD KV write-amplification cadence.

ARLE Metal vs mlx-lm TTFT and TPOT sweep

flowchart TB
  subgraph Surface["Entry surfaces"]
    Serve["arle serve<br/>OpenAI HTTP"]
    Agent["arle<br/>local agent / REPL"]
    Train["arle train opd<br/>teacher rollouts + distillation"]
  end

  subgraph Runtime["Shared Rust runtime"]
    Router["OpenAI router<br/>continuous scheduler"]
    Model["Qwen model authority<br/>weights / tokenizer / decode"]
    KV["KV memory plane<br/>prefix radix + paged KV + residency tiers"]
  end

  subgraph Backend["Execution backends"]
    Metal["Metal<br/>MLX bridge + packed varlen decode"]
    CUDA["CUDA<br/>TileLang AOT + custom kernels"]
  end

  Serve --> Router
  Agent --> Router
  Train --> Router
  Router --> Model
  Model <--> KV
  Model --> Metal
  Model --> CUDA
  KV --> Metal
  KV --> CUDA
  Model -. teacher samples .-> Train
Loading

Deep dive: onboarding (30 min) · architecture · codebase-map.


Entry surfaces

arle is the single binary:

Command What it does
arle (no args) Interactive agent REPL with python and shell tools.
arle run --prompt "…" One-shot agent prompt. --no-tools to disable tools.
arle serve --backend … OpenAI-compatible HTTP server.
arle train opd On-Policy Distillation — teacher on the serving runtime (infer-api), student in train. CUDA path. Usage manual.
arle --doctor [--json] Backend / hardware / model-resolution self-check.

Operators wanting only serving can run arle serve — the same HTTP contract, without touching the agent / train surfaces.


Latest Updates

2026-06-08 — DeepSeek-V4-Flash B=1 latency: prefill 23 ms, decode 27 → 15 ms (8×H20, TP=8 / EP=8, FP8 MoE). The official DSA indexer flattened decode across context (legacy csa_select 124 ms → ~26 ms @4k, 4.8×); the MLA / output projections moved from scalar GEMV to tensor-core DeepGEMM (−94% per stage → prefill ~23 ms); and MTP depth-1 batched verify amortized the serial critical path for +71% decode tok/s (39.9 → 64.2), byte-identical. Eight per-kernel levers (whole-step CUDA graph, mhc_params uint4, M=1 GEMV, comm-overlap, …) all washed — B=1 decode is GPU-bound on the critical path, so 15 ms is the sound single-request ceiling; 6 ms needs tree-EAGLE + mega-kernel fusion or batching (M=N). FINAL report.

DeepSeek-V4-Flash B=1 latency optimization journey: decode context-scaling fix, prefill DeepGEMM projections, and the MTP-amortized decode wall

2026-06-02 — Metal Qwen3.6 A/B refreshed. ARLE and mlx-lm are in the same steady TPOT band from 128 to 12k input tokens; the README chart now shows TTFT + steady TPOT only. Wins entry.

2026-06-02 — Metal SSD KV is segment-backed. Prefix snapshots now use CRC32C-checked 64 KiB chunks in sequential segment files; short generated tails stay in memory until the 64-token SSD checkpoint cadence. Segmented KV · WA cadence.

Older history: CHANGELOG.md.


Documentation map


License

MIT

About

Pure-Rust runtime for serving, local agents, On-Policy Distillation, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors