Rapid-MLX: faster Apple Silicon backend for Goose (2-4x vs Ollama) #8436

raullenchai · 2026-04-09T11:36:16Z

raullenchai
Apr 9, 2026

Summary

Rapid-MLX is an OpenAI-compatible inference server built on Apple's MLX framework. It works with Goose via the Ollama provider — just point OLLAMA_HOST at Rapid-MLX.

Setup

pip install rapid-mlx
rapid-mlx serve mlx-community/Qwen3.5-27B-4bit

# Run Goose with Rapid-MLX as backend
GOOSE_PROVIDER=ollama \
OLLAMA_HOST=http://localhost:8000 \
GOOSE_MODEL=default \
goose run --text "hello"

Why faster?

Rapid-MLX uses Apple's MLX framework with native Metal compute kernels, purpose-built for unified memory. Benchmarks on Mac Studio M3 Ultra (Ollama 0.20.4 with MLX backend):

Model	Rapid-MLX	Ollama	Speedup
Qwen3.5-9B	108 tok/s	41 tok/s	2.6x
Gemma 4 26B	85 tok/s	68 tok/s	1.3x
Phi-4 Mini 14B	180 tok/s	56 tok/s	3.2x

Tested with Goose

Basic prompt: ✅
Shell tool use (cat file): ✅
Works via Ollama provider with OLLAMA_HOST pointing to Rapid-MLX

Also includes prompt cache (sub-100ms TTFT on follow-up turns), 17 tool call parsers, and reasoning separation for thinking models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rapid-MLX: faster Apple Silicon backend for Goose (2-4x vs Ollama) #8436

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Rapid-MLX: faster Apple Silicon backend for Goose (2-4x vs Ollama) #8436

Uh oh!

raullenchai Apr 9, 2026

Summary

Setup

Why faster?

Tested with Goose

Replies: 0 comments

raullenchai
Apr 9, 2026