What mlx-knife is — at a glance. Release notes: CHANGELOG.md.
Current Version: 2.0.6 (stable)
Release Notes: See CHANGELOG.md for detailed changes, fixes, and migration guides.
A text-first model CLI for Apple Silicon. Every standard text model
(Llama, Mistral, Qwen, Phi, Gemma, etc.) works out of the box via list,
clone, run, convert, quantize, and serve.
Alongside text, mlx-knife has curated support for a verified set of
vision and audio model types (Whisper, Pixtral, Gemma-3, Qwen2-VL,
VibeVoice, and others). The current per-release list with status per
operation lives in docs/MODEL-COVERAGE.md.
A universal multimodal wrapper. Vision and audio support tracks a chosen
subset of mlx-vlm / mlx-audio upstream capabilities — not every model
those libraries load will be accepted here. Model types outside the
verified set are rejected explicitly at convert --quantize, not
silently converted (which would destroy the multimodal config and
produce a broken workspace).
- Run Models - Native MLX execution with streaming, chat modes, vision, and audio
- Audio Transcription (STT) - Whisper speech-to-text via
--audioflag - Vision with EXIF Metadata - Image analysis with automatic GPS/date/camera extraction
- Clone & Convert - Local model workflows without HuggingFace round-trips
- Model Repair - Fix broken mlx-vlm models with
--repair-index - List & Health - Browse cache, verify integrity, check MLX runtime compatibility
- Resumable Downloads - Interrupted clone/pull operations continue automatically
- Safe Vision Chunking - Automatic batching prevents Metal OOM crashes
- Unix Pipes (Beta) - Chain models without temp files (
cat | mlx-run model -) - Privacy - No background network or telemetry; explicit HuggingFace interactions only
The feature release (in active development — 2.0.6 was integrity and capability-honesty fixes; 2.0.7 adds new capabilities):
- Embeddings (experimental). Generate OpenAI-style text-embedding vectors for
semantic search and RAG, on-device:
mlxk embed <model> "text"— embed a string, stdin (-), or a--batchJSONL stream; emitted as JSONL (or the standard envelope with--json).mlxk embed-serve <model>— a single-model backend exposing an OpenAI-compatiblePOST /v1/embeddings, in its own process so the main server's memory gates stay intact.mlxk serve --embed-backend URL— the main server proxies/v1/embeddingsto that backend, so a client uses one base URL for both chat and embeddings.- Gated by
MLXK2_ENABLE_ALPHA_FEATURES=1while the surface settles. See Embeddings; the server side is in the Server Handbook.
- Whisper translation — translate multilingual speech to English with a
multilingual (non-turbo) Whisper model, on the CLI
(
mlxk run … --audio FILE --translate) or the server (POST /v1/audio/translations, OpenAI-compatible). Models that can't translate (whisper-turbo,.en, non-Whisper STT) are rejected up front with a hint — never a silent transcription.
Along for the ride:
- MLX-stack refresh — mlx-vlm 0.6.2 / mlx-audio 0.4.4 (re-verified per ADR-023).
mlxk embed … --jsonrenders embeddings in the standard JSON envelope.
Chain models with standard Unix pipes - no temp files needed:
export MLXK2_ENABLE_PIPES=1
# Model chaining
cat article.txt | mlx-run translator_model - | mlx-run summarizer_model - "3 bullets"
# Works with Unix tools
mlx-run chat_model "explain quicksort" | tee explanation.txt | head -20Robust handling of SIGPIPE and early pipe termination (| head, | grep -m1).
- macOS with Apple Silicon
- Python 3.10-3.12 (see Python Compatibility below)
- 8GB+ RAM recommended + RAM to run LLM
mlx-knife is a tooling layer for running ML models (e.g. from Hugging Face) locally.
The project does not distribute any model weights and does not decide which models you use or how you use them.
Please note:
- Each model (weights, tokenizer, configuration, etc.) is governed by its own license.
- When
mlx-knifedownloads a model from a third-party service (e.g. Hugging Face), it does so on your behalf. - You are responsible for:
- reading and understanding the license of each model you use,
- complying with any restrictions (e.g. Non-Commercial, Research Only, RAIL, etc.),
- ensuring that your use of a given model (private, research, commercial, on-prem services, etc.) is legally permitted.
The mlx-knife source code itself is provided under the open-source license specified in this repository.
This license applies only to the mlx-knife code and does not extend to any external models.
This is not legal advice. Always refer to the original model license text and, if necessary, seek professional legal counsel.
✅ Python 3.10 - 3.12 - Full support (Text + Vision + Audio) ❌ Python 3.9 - Use version 2.0.3 (text + cache management only) ❌ Python 3.13+ - Not supported (miniaudio lacks pre-built wheels)
Recommended: Python 3.10 or 3.11 for best compatibility.
pip install mlx-knife
mlxk --version # → mlxk 2.0.6Requirements: macOS Apple Silicon, Python 3.10-3.12 Includes: Text, Vision, Audio (Whisper STT), EXIF metadata, Unix pipes
git clone https://github.com/mzau/mlx-knife.git
cd mlx-knife
pip install -e ".[dev,test]"
mlxk --version # → mlxk 2.0.6
pytest -vRequirements: macOS Apple Silicon, Python 3.10-3.12
If you're upgrading from MLX Knife 1.x, see MIGRATION.md for important information about the license change (MIT → Apache 2.0) and behavior changes.
# List models (human-readable)
mlxk list
mlxk list --health
mlxk list --verbose --health
# Check cache health
mlxk health
# Show model details
mlxk show "mlx-community/Phi-3-mini-4k-instruct-4bit"
# Pull a model
mlxk pull "mlx-community/Llama-3.2-3B-Instruct-4bit"
# Resume interrupted download (skip prompt)
mlxk pull "model-name" --force-resume
# Run interactive chat
mlxk run "Phi-3-mini" -c
# Start OpenAI-compatible server
mlxk serve --port 8080| Command | Description |
|---|---|
list |
Model discovery with JSON output; supports cache and workspace paths |
show |
Detailed model information with --files, --config |
health |
Corruption detection and cache analysis |
pull |
HuggingFace model downloads with corruption detection |
rm |
Model deletion with lock cleanup and fuzzy matching |
run |
Interactive and single-shot model execution with streaming/batch modes |
server/serve |
OpenAI-compatible API server; SIGINT-robust (Supervisor); SSE streaming |
clone |
Model workspace cloning - create local editable copy from cache |
push |
Upload to HuggingFace Hub (requires --private flag for safety) |
convert |
Workspace transformations: --repair-index, --quantize <bits> |
🔬 embed |
Experimental - text embeddings (OpenAI-style vectors) for search/RAG; JSONL or --json; requires MLXK2_ENABLE_ALPHA_FEATURES=1 |
🔬 embed-serve |
Experimental - single-model embeddings HTTP backend (/v1/embeddings); pairs with serve --embed-backend; requires MLXK2_ENABLE_ALPHA_FEATURES=1 |
🔒 pipe mode |
Beta feature - Unix pipes with mlxk run <model> - ...; requires MLXK2_ENABLE_PIPES=1 |
MLX-Knife supports multiple ways to reference models:
| Format | Example | Description |
|---|---|---|
| Full name | mlx-community/Phi-4-4bit |
Exact HuggingFace repo ID |
| Short name | Phi-4 |
Fuzzy match against cache |
| With hash | Phi-4@e96f3b2 |
Specific commit/version |
mlxk run "mlx-community/Phi-4-4bit" "Hello"
mlxk run "Phi-4" "Hello" # Fuzzy match
mlxk show "Qwen3@e96" --json # Specific version| Format | Example |
|---|---|
| Relative | ./my-workspace |
| Absolute | /Volumes/External/model |
| Prefix match | ./gemma- (all workspaces starting with "gemma-") |
| Directory | . (all workspaces in current directory) |
# List workspaces
mlxk list . # All workspaces in current directory
mlxk list ./gemma- # Prefix match: gemma-3n-4bit, gemma-3n-FIXED-4bit, ...
mlxk list $PWD/models # Absolute path → absolute output
# Clone → Run
mlxk clone org/model ./workspace
mlxk run ./workspace "Hello"
# Convert → Run
mlxk convert ./broken ./fixed --repair-index
mlxk run ./fixed "Test"Output format: List output mirrors input format - relative patterns produce relative names (like ls), absolute paths produce absolute names.
Disambiguating paths vs cache names: When a local directory exists with the same name as a cached model, use ./ prefix to force workspace resolution. Otherwise, cache lookup is attempted first.
Machine-readable output stamps a portable identity, never a local filesystem path:
model— theorg/nameform. Cache models use the HuggingFace repo ID; workspace models use the sentinel'ssource_repo(recorded at clone time), so a workspace and its cache origin read identically and nothing leaks your directory layout.content_hash— the model's content fingerprint. Workspace models carry the ADR-025sha256:…hash; cache models carry the snapshot revision (git SHA). Its shape also tells a consumer which source a record came from.
Same-model rule. Embeddings are only comparable within one model's vector space. A consumer
compares (model, content_hash) across records to detect a mismatch before mixing them.
Determinism caveat (embeddings). Embeddings are not bit-reproducible across devices or
library builds: CPU and GPU vectors of the same model and text diverge (≈0.98 cosine on a 4-bit
model) — small for ranking, but enough to break dedup/threshold logic. Build a vector store with
one model and one device; the embedding metadata stamps device (cpu/gpu) so a mixed
store is detectable. Less aggressively quantized models are more device-stable.
The HuggingFace cache (~/.cache/huggingface) is a shared namespace. Multiple libraries write to it during inference, often without notice. This makes it hard to know what a model actually needs, whether it works offline, or if something changed since you last tested it.
Workspaces solve this by giving each model its own directory with full isolation:
- Self-contained: All files in one place — model weights, config, tokenizer, and any runtime downloads (captured in
.hf_cache/) - Reproducible: After one successful run, everything is local. Archive it, move it to another machine, run it offline
- Platform-independent: Workspaces are plain directories with standard formats (safetensors + config.json). They work regardless of where the model originally came from
- Transparent:
mlxk showandmlxk healthtell you exactly what's inside, whether anything changed, and if the model is ready to run
For quick testing, mlxk pull + mlxk run still works. Workspaces are for when you want control.
# Set up a workspace directory (add to your shell profile)
export MLXK_WORKSPACE_HOME=~/mlx-models
# Clone a model (target auto-derived from model name, org prefix stripped)
mlxk clone mlx-community/Llama-3.2-1B-Instruct-4bit
# → ~/mlx-models/Llama-3.2-1B-Instruct-4bit
# Run it (fuzzy match finds the workspace)
mlxk run Llama-3.2 "Hello"
# See your portfolio (workspaces + cache)
mlxk listWhen
MLXK_WORKSPACE_HOMEis set,clonederives the target directory automatically (strips org prefix, stays flat). Explicit targets still work:mlxk clone model ./local-dirormlxk clone model custom-name.
Full local cycle for model experimentation, repair, quantization, and testing:
mlxk clone mlx-community/model # Clone to workspace home
mlxk convert ./model ./fixed --repair-index # Fix broken index
mlxk convert ./model ./quantized --quantize 4 # Or quantize to 4-bit
mlxk list . # See all local workspaces
mlxk run ./fixed "test prompt" # Local inference
mlxk server --model ./fixed # Dev server
mlxk push ./fixed "your-org/model" # Optional publishKey capabilities:
- Model repair: Fix index/shard mismatches from mlx-vlm conversions
- Quantization: Convert bf16/fp16 models to 2, 3, 4, 6, or 8-bit
- Cross-volume: Clone/convert works across APFS volumes, SMB, NFS
- Local testing: Run/server/show without pushing to HuggingFace
- Rapid iteration: Clone → Modify → Test loop
| Command | Workspace Support | Example |
|---|---|---|
run |
✅ Yes | mlxk run ./workspace "prompt" |
show |
✅ Yes | mlxk show ./workspace --files |
health |
✅ Yes | mlxk health ./workspace |
server/serve |
✅ Yes | mlxk serve --model ./workspace |
embed |
✅ Yes 🔬 | mlxk embed ./workspace "text" (experimental; MLXK2_ENABLE_ALPHA_FEATURES=1) |
embed-serve |
✅ Yes 🔬 | mlxk embed-serve ./workspace (experimental; MLXK2_ENABLE_ALPHA_FEATURES=1) |
clone |
✅ Creates | mlxk clone org/model (shorthand) or mlxk clone org/model ./workspace |
convert |
✅ Repair/Quantize | mlxk convert ./in ./out --quantize 4 |
push |
✅ Yes | mlxk push ./workspace "org/name" |
list |
✅ Yes | mlxk list . or mlxk list ./gemma- |
pull |
❌ Cache only | Downloads to HuggingFace cache |
rm |
❌ Cache only | Use rm -rf ./workspace for local directories |
For a web-based chat UI, use nChat - a lightweight web interface for the BROKE ecosystem:
# Clone once (local setup):
git clone https://github.com/mzau/broke-nchat.git
cd broke-nchat
# Start mlx-knife server:
mlxk serve
# Open web UI:
open webui/index.htmlOn-Prem: Pure HTML/CSS/JS - runs entirely locally, zero dependencies.
Note: nChat is a separate project designed for the entire BROKE ecosystem (MLX Knife + BROKE Cluster). See nChat README for CORS configuration.
MLX Knife supports multiple input modalities beyond text. All multi-modal features share a common output pattern: model responses are followed by collapsible metadata tables for transparency and traceability.
Image analysis via the --image flag (CLI and server). Requires Python 3.10+. Stable since 2.0.4.
- Python 3.10+ (mlx-vlm dependency)
- Backend: mlx-vlm 0.4+ (included in base install)
# Image analysis with custom prompt
mlxk run "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit" \
--image photo.jpg "Describe what you see in detail"
# Multiple images (space-separated or glob)
mlxk run vision-model --image img1.jpg img2.jpg img3.jpg "Compare these images"
mlxk run vision-model --image photos/*.jpg "Which images show outdoor scenes?"
# Auto-prompt (default: "Describe the image.")
mlxk run vision-model --image cat.jpg
# Text-only on vision model (no --image flag)
mlxk run "mlx-community/Llama-3.2-11B-Vision-Instruct-4bit" "What is 2+2?"Terminology Note: mlx-knife uses "batch" in the traditional computing sense (sequential job processing in groups), not ML inference batching (parallel batch_size > 1 in a single forward pass). Images are processed sequentially in groups for memory safety, not performance parallelization.
Default behavior: Vision processing defaults to one image at a time for maximum stability on all systems. Use --chunk N to process multiple images per batch when your system can handle it.
# Default: one image at a time (most robust, automatic chunking)
mlxk run pixtral "Describe image" --image photos/*.jpg
# Faster: 5 images per batch (requires more RAM, may trigger model-specific issues)
mlxk run pixtral "Describe images" --chunk 5 --image photos/*.jpg
# Alternative: Use --prompt flag (useful when experimenting with different prompts)
mlxk run pixtral --chunk 5 --image photos/*.jpg --prompt "Describe images"
# Set default chunk size via environment variable
export MLXK2_VISION_CHUNK_SIZE=3
mlxk run pixtral "Describe images" --image photos/*.jpgWhy chunking?
- Safety: Prevents Metal OOM crashes by limiting images per processing group (
--chunk N) - Isolation: Fresh inference session per chunk (KV cache cleared, conversation context reset)
- Trade-off: ~2-3s model load overhead per chunk vs guaranteed isolation
Reliability: Vision models can sometimes describe details they didn't actually see. MLX Knife prevents this automatically:
- Default (chunk=1): Most reliable - each image processed independently
- Larger chunks: Still safe, but models may occasionally confuse details between images in the same batch
For maximum accuracy, use the default chunk=1 (no configuration needed).
Server API:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "pixtral", "chunk": 3, "messages": [...]}'Note: chunk is an mlx-knife extension parameter. See SERVER-HANDBOOK.md for details.
When processing images, MLX Knife automatically prepends metadata in a collapsible table (collapsed by default) before the model output:
<details>
<summary>📸 Chunk 1/3: Images 1-4</summary>
| Image | Filename | Original | Location | Date | Camera |
|-------|----------|----------|----------|------|--------|
| 1 | image_abc123.jpeg | beach.jpg | 📍 32.7900°N, 16.9200°W | 📅 2023-12-06 12:19 | 📷 Apple iPhone SE |
| 2 | image_def456.jpeg | mountain.jpg | 📍 32.8700°N, 17.1700°W | 📅 2023-12-10 15:42 | 📷 Apple iPhone SE |
| 3 | image_xyz789.jpeg | sunset.jpg | 📍 32.8200°N, 17.0500°W | 📅 2023-12-08 18:30 | 📷 Apple iPhone SE |
| 4 | image_uvw456.jpeg | forest.jpg | 📍 32.8800°N, 17.1200°W | 📅 2023-12-09 10:15 | 📷 Apple iPhone SE |
</details>
A beach with palm trees and clear blue water. A mountain landscape with snow-capped peaks...
Chunk information in summary:
- Shows current chunk and total chunks (e.g., "Chunk 1/3")
- Shows image range in current chunk (e.g., "Images 1-4")
- Helps track progress in WebUI and prevents confusion about which images are being described
Why metadata comes first:
- The model sees GPS, date, and camera info when analyzing images (enables location/time-aware descriptions)
- The markdown table shows you exactly what the model knows about each image
- Helps verify which description belongs to which file
Metadata includes:
- Image ID → Filename mapping (identify which description belongs to which file)
- GPS coordinates (latitude/longitude, if available in EXIF)
- Precision: 4 decimal places (~11m accuracy) for street-level context
- Capture date/time (ISO 8601 format)
- Camera model (device info)
Privacy control:
EXIF extraction is enabled by default. To disable (e.g., for privacy-sensitive images):
export MLXK2_EXIF_METADATA=0
mlxk run vision-model --image photo.jpg "describe"Output is the same for CLI and server - metadata tables work in terminals, web UIs (nChat), and can be parsed programmatically.
- Image limits: Model-dependent due to Metal / unified-memory constraints and peak activation usage
- pixtral-12b-8bit: Up to 5 images tested on M2 Max 64GB (multi-image capable)
- Llama-3.2-11B / Other models: Single-image only
- Larger models (24B+): Limited to 1-2 images on 64GB RAM
- Default server guardrails: 20 MB per image, 50 MB total (configurable). Base64 encoding adds ~33% overhead.
Vision models work with OpenAI-compatible /v1/chat/completions endpoint using base64-encoded images:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama-vision",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}]
}'Vision support routes through mlx-vlm upstream. File integrity is verified by mlxk health, but runtime behavior depends on the per-model upstream state.
For the authoritative per-release status of every supported
model_type, seedocs/MODEL-COVERAGE.md.
A reasonable starting point: Pixtral 12B 8-bit (mlx-community/pixtral-12b-8bit, ~13.5 GB) — multi-image capable, strong text recognition, verified across recent releases. Other verified model families, repair workflows for legacy conversions, and known runtime issues are tracked release-by-release in the coverage matrix.
Some legacy mlx-vlm conversions need a one-time index repair before they load — use mlxk convert <src> <dst> --repair-index. The coverage matrix lists which families still need this and which are clean since mlx-vlm 0.4+.
🎙️ Audio Transcription: Speech-to-text via Whisper models (mlx-audio backend). Works out-of-the-box with PyPI install. Backward compatible with Gemma-3n multimodal audio (mlx-vlm).
Requirements:
- Python 3.10+ (mlx-audio dependency, included in base install)
- No system dependencies: MP3/WAV decoding via embedded libsndfile (no ffmpeg or Homebrew required)
Reference model: mlx-community/whisper-large-v3-turbo-4bit (~464 MB, supports >10 min audio, 4bit + 8bit variants both verified). See docs/MODEL-COVERAGE.md for the full per-release verified list.
🔧 Backend Architecture:
mlx-knife automatically routes audio models to the optimal backend:
- Dedicated STT (Whisper / Voxtral family) → mlx-audio (long-form, best accuracy)
- Multimodal LLMs with audio input → mlx-vlm (token-limited duration, secondary path)
⚙️ Audio Defaults:
| Setting | Audio | Text/Vision | Reason |
|---|---|---|---|
| Temperature | 0.0 | 0.7 | Greedy decoding (STT best practice) |
| Default Prompt | "Transcribe this audio." | - | Minimal prompt for pure transcription |
💡 Quick Start:
# Pull a Whisper model (one-time setup)
mlxk pull mlx-community/whisper-large-v3-turbo-4bit
# Transcribe audio (WAV, MP3, M4A - native on macOS)
mlxk run whisper-large --audio speech.mp3
# → Automatic greedy decoding (temp=0.0)
# With language hint for better accuracy
mlxk run whisper-large --audio speech.mp3 --language en
# Longer audio (>10 minutes supported)
mlxk run whisper-large --audio podcast.wav- Duration: depends on model architecture — dedicated STT models (Whisper) support >10 min; multimodal LLMs with audio are typically token-limited to ~30 s
- File size: 50 MB max per request (configurable)
- Formats: WAV, MP3, M4A on macOS (M4A via Core Audio); Linux needs ffmpeg for non-WAV
- Legacy weights:
.npz-only models are not supported — use.safetensorsvariants
🎯 Advanced Usage:
# Explicit temperature control (0.0 = greedy, deterministic)
mlxk run whisper-large --audio speech.wav --temperature 0.0
# Force specific language (improves accuracy)
mlxk run whisper-large --audio german.mp3 --language de
# Segment metadata (MLXK2_AUDIO_SEGMENTS=1 for timestamps)
MLXK2_AUDIO_SEGMENTS=1 mlxk run whisper-large --audio meeting.wav🌍 Speech Translation (→ English):
Whisper can translate non-English speech directly to English text. This is a
fixed-target feature of the architecture — output is always English, regardless
of source language. Only multilingual, non-turbo Whisper variants support it
(e.g. whisper-large-v3-4bit); .en variants lack the translate token and turbo
variants have a reduced decoder that cannot translate reliably. Such models are
rejected up front with a hint — mlxk never silently falls back to transcription.
# CLI: translate German (or any language) speech to English
mlxk run whisper-large-v3-4bit --audio german-news.mp3 --translate
# → English text (--translate and --translate en are equivalent)
# Optional explicit source-language hint
mlxk run whisper-large-v3-4bit --audio news.mp3 --translate --language de# Server: OpenAI-compatible POST /v1/audio/translations
curl http://localhost:8000/v1/audio/translations \
-F file=@german-news.mp3 \
-F model=mlx-community/whisper-large-v3-4bit
# → {"text": "..."} (English)# Server: OpenAI SDK (drop-in)
client.audio.translations.create(
model="mlx-community/whisper-large-v3-4bit",
file=open("german-news.mp3", "rb"),
)A non-audio model returns HTTP 400; an audio model that can't translate returns HTTP 422. See the Server Handbook for the full endpoint contract.
Gated by
MLXK2_ENABLE_ALPHA_FEATURES=1while the surface settles (ADR-015).
Turn text into vectors for semantic search, clustering, or RAG — on-device, no cloud
round-trip. mlxk supports verified encoder models (BERT-family, e.g.
mlx-community/bge-small-en-v1.5-4bit) and decoder embedders (e.g. Qwen3-Embedding).
export MLXK2_ENABLE_ALPHA_FEATURES=1
# One string -> one JSONL record (vector + portable model/content_hash/device stamp)
mlxk embed mlx-community/bge-small-en-v1.5-4bit "machine learning on Apple Silicon"Also reads stdin (-), a --batch JSONL stream, a retrieval --query, or the standard
envelope with --json — see mlxk embed --help. A worked end-to-end RAG loop
(index → search → retrieve) lives in examples/rag-server.
Two consumer rules matter: embeddings are only comparable within one model's vector space, and they are not bit-reproducible across CPU/GPU — build a store with one model and one device (see Model Identity in Output).
Serving over HTTP: mlxk embed-serve <model> + mlxk serve --embed-backend URL give a
client one OpenAI-compatible base URL for both chat and embeddings — see the
Server Handbook.
📋 Complete API Specification: See JSON API Specification for comprehensive schema, error codes, and examples.
All commands support both human-readable and JSON output (--json flag) for automation and scripting, enabling seamless integration with CI/CD pipelines and cluster management systems.
Experimental:
mlxk embed(gated byMLXK2_ENABLE_ALPHA_FEATURES=1) renders embedding records as JSONL by default;--jsonwraps them in the standard envelope.
All commands support JSON output via --json flag:
mlxk list --json | jq '.data.models[].name'
mlxk health --json | jq '.data.summary'
mlxk show "Phi-3-mini" --json | jq '.data.model'Response Format:
{
"status": "success|error",
"command": "list|health|show|pull|rm|clone|convert|version|push|run|server",
"data": { /* command-specific data */ },
"error": null | { "type": "...", "message": "..." }
}mlxk list --json
# Output:
{
"status": "success",
"command": "list",
"data": {
"models": [
{
"name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
"hash": "a5339a41b2e3abcdef1234567890ab12345678ef",
"size_bytes": 4613734656,
"last_modified": "2024-10-15T08:23:41Z",
"framework": "MLX",
"model_type": "chat",
"capabilities": ["text-generation", "chat"],
"health": "healthy",
"runtime_compatible": true,
"reason": null,
"cached": true
}
],
"count": 1
},
"error": null
}mlxk health --json
# Output:
{
"status": "success",
"command": "health",
"data": {
"healthy": [
{
"name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
"status": "healthy",
"reason": "Model is healthy"
}
],
"unhealthy": [],
"summary": { "total": 1, "healthy_count": 1, "unhealthy_count": 0 }
},
"error": null
}mlxk show "Phi-3-mini" --json --files
# Output (simplified):
{
"status": "success",
"command": "show",
"data": {
"model": {
"name": "mlx-community/Phi-3-mini-4k-instruct-4bit",
"hash": "a5339a41b2e3abcdefgh1234567890ab12345678",
"size_bytes": 4613734656,
"framework": "MLX",
"model_type": "chat",
"capabilities": ["text-generation", "chat"],
"last_modified": "2024-10-15T08:23:41Z",
"health": "healthy",
"runtime_compatible": true,
"reason": null,
"cached": true
},
"files": [
{"name": "config.json", "size": "1.2KB", "type": "config"},
{"name": "model.safetensors", "size": "2.3GB", "type": "weights"}
],
"metadata": null
},
"error": null
}# Get available model names for scheduling
MODELS=$(mlxk list --json | jq -r '.data.models[].name')
# Check cache health before deployment
HEALTH=$(mlxk health --json | jq '.data.summary.healthy_count')
if [ "$HEALTH" -eq 0 ]; then
echo "No healthy models available"
exit 1
fi
# Download required models
mlxk pull "mlx-community/Phi-3-mini-4k-instruct-4bit" --json# Verify model integrity in CI
mlxk health --json | jq -e '.data.summary.unhealthy_count == 0'
# Clean up CI artifacts
mlxk rm "test-model-*" --json --force
# Pre-warm cache for deployment
mlxk pull "production-model" --json# Find models by pattern
LARGE_MODELS=$(mlxk list --json | jq -r '.data.models[] | select(.name | contains("30B")) | .name')
# Show detailed info for analysis
for model in $LARGE_MODELS; do
mlxk show "$model" --json --config | jq '.data.model_config'
doneMLX Knife provides rich human-readable output by default (without --json flag).
Error Handling (2.0.3+): Errors print to stderr for clean pipe workflows:
mlxk show badmodel | grep ... # Errors don't contaminate stdout
mlxk pull badmodel > log 2> err # Capture errors separatelymlxk list
mlxk list --health
mlxk health
mlxk show "mlx-community/Phi-3-mini-4k-instruct-4bit"
mlxk pull "mlx-community/Llama-3.2-3B-Instruct-4bit"Download models from HuggingFace:
mlxk pull "mlx-community/Phi-3-mini-4k-instruct-4bit"Interrupted downloads (2.0.4-beta.5+): If a download fails (network issue, Ctrl-C), mlxk pull will detect this and prompt to resume:
$ mlxk pull "model-name"
Model 'model-name' has partial download:
No model weights found. Use --force-resume to attempt resume or 'mlxk rm' to delete.
Resume download? [Y/n]: yAutomation/scripting: Use --force-resume to skip the prompt:
mlxk pull "model-name" --force-resumelist: Shows MLX chat models only (compact names, safe default)list --verbose: Shows all MLX models (chat + base) with full org/names and Framework columnlist --all: Shows all frameworks (MLX, GGUF, PyTorch)- Flags are combinable:
--all --verbose,--all --health,--verbose --health
mlxk list shows workspace and cache models in one portfolio view. Output expands by mode; --health is a modular extension on top of the base columns.
Default columns (compact mode):
| Column | Meaning |
|---|---|
Name |
Model identifier (compact form; workspaces use display_name) |
Hash |
First 7 chars of content hash |
Size |
Combined weights + config size |
Modified |
Last filesystem modification |
Src |
cache · ws (clean workspace) · ws* (modified) · ws? (v1 legacy — run mlxk show <name> --recalc-hash to migrate) |
Type |
Capability label (e.g. chat, chat+vision, audio) |
--verbose (or --all) adds the Clean column (workspace integrity: ✓ clean · ✗ modified · — cache or migration-pending) and the Framework column (MLX / PyTorch / GGUF).
--health adds health diagnostics. In compact mode this is a single Health column; in verbose mode it splits into Integrity, Runtime, and Reason.
| Health value | Meaning |
|---|---|
healthy |
file integrity OK and MLX runtime compatible |
healthy* |
files intact but MLX runtime can't execute (wrong framework, incompatible model_type, or mlx-lm too old) |
unhealthy |
file integrity failed or unknown format |
Default — workspace + cache models, MLX-only, healthy + runnable:
mlxk listName | Hash | Size | Modified | Src | Type
----------------------+---------+-------+----------+-------+-----
Llama-3.2-3B-Instruct | a1b2c3d | 2.1GB | 2d ago | cache | chat
my-llama-workspace | 9f8e7d6 | 2.1GB | 1h ago | ws | chat
legacy-workspace | 7a6b5c4 | 4.8GB | 5d ago | ws? | chat
--verbose — adds Clean and Framework columns:
mlxk list --verboseName | Hash | Size | Modified | Src | Clean | Framework | Type
----------------------+---------+-------+----------+-------+-------+-----------+-----
Llama-3.2-3B-Instruct | a1b2c3d | 2.1GB | 2d ago | cache | — | MLX | chat
my-llama-workspace | 9f8e7d6 | 2.1GB | 1h ago | ws | ✓ | MLX | chat
edited-workspace | 9f8e7d6 | 2.1GB | 5m ago | ws* | ✗ | MLX | chat
--verbose --health — verbose mode splits health into three columns:
mlxk list --verbose --healthName | Hash | Size | Modified | Src | Clean | Framework | Type | Integrity | Runtime | Reason
----------------------+---------+-------+----------+-------+-------+-----------+------+-----------+---------+-------
Llama-3.2-3B-Instruct | a1b2c3d | 2.1GB | 2d ago | cache | — | MLX | chat | healthy | yes | -
my-llama-workspace | 9f8e7d6 | 2.1GB | 1h ago | ws | ✓ | MLX | chat | healthy | yes | -
--json is unaffected by the human-mode healthy + runtime_compatible filter and always returns the full model list with health/runtime fields.
MLX Knife 2.0 provides structured logging with configurable output formats and levels.
Control verbosity with --log-level (server mode):
# Default: Show startup, model loading, and errors
mlxk serve --log-level info
# Quiet: Only warnings and errors
mlxk serve --log-level warning
# Silent: Only errors
mlxk serve --log-level error
# Verbose: All logs including HTTP requests
mlxk serve --log-level debugLog Level Behavior:
debug: All logs + Uvicorn HTTP access logs (GET /v1/models, etc.)info: Application logs (startup, model switching, errors) + HTTP access logswarning: Only warnings and errors (no startup messages, no HTTP access logs)error: Only error messages
Enable structured JSON output for log aggregation tools:
# JSON logs (recommended - CLI flag)
mlxk serve --log-json
# JSON logs (alternative - environment variable)
MLXK2_LOG_JSON=1 mlxk serveNote: --log-json also formats Uvicorn access logs as JSON for consistent output.
JSON Format:
{"ts": 1760830072.96, "level": "INFO", "msg": "MLX Knife Server 2.0 starting up..."}
{"ts": 1760830073.14, "level": "INFO", "msg": "Switching to model: mlx-community/...", "model": "..."}
{"ts": 1760830074.52, "level": "ERROR", "msg": "Model type bert not supported.", "logger": "root"}Fields:
ts: Unix timestamplevel: Log level (INFO, WARN, ERROR, DEBUG)msg: Log message (HF tokens and user paths automatically redacted)logger: Source logger (mlxk2= application,root= external libraries like mlx-lm)- Additional fields:
model,request_id,detail,duration_ms(context-dependent)
Sensitive data is automatically removed from logs:
- HuggingFace tokens (
hf_...) →[REDACTED_TOKEN] - User home paths (
/Users/john/...) →~/...
Example:
# Original (unsafe):
Using token hf_AbCdEfGhIjKlMnOpQrStUvWxYz123456 from /Users/john/models
# Logged (safe):
Using token [REDACTED_TOKEN] from ~/modelsMLX Knife supports comprehensive runtime configuration via environment variables. All settings can be controlled without code changes.
| Variable | Description | Default | Since |
|---|---|---|---|
MLXK_WORKSPACE_HOME |
Directory for workspace model portfolio. Enables clone shorthand (no target needed), portfolio discovery for list, health, run, serve. |
(none) | 2.0.5 |
# Recommended: add to shell profile
export MLXK_WORKSPACE_HOME=~/mlx-models
mlxk clone mlx-community/model # → ~/mlx-models/model (org stripped)
mlxk list # Shows workspaces + cache models
mlxk run whisper "transcribe" # Finds whisper in workspace home
mlxk health # Checks workspaces + cacheEnable experimental features:
| Variable | Description | Default | Since |
|---|---|---|---|
MLXK2_ENABLE_PIPES |
Enable Unix pipe integration (mlxk run <model> -) |
0 (disabled) |
2.0.4 |
MLXK2_EXIF_METADATA |
Extract EXIF metadata from images (Vision models) | 1 (enabled) |
2.0.4 |
Examples:
# Enable pipe mode for stdin processing
export MLXK2_ENABLE_PIPES=1
echo "Hello" | mlxk run model - "translate to Spanish"
# Disable EXIF extraction for privacy (enabled by default)
export MLXK2_EXIF_METADATA=0
mlxk run vision-model --image photo.jpg "describe this"Control server behavior without command-line flags:
| Variable | Description | Default | Since |
|---|---|---|---|
MLXK2_HOST |
Server bind address | 127.0.0.1 |
2.0.0 |
MLXK2_PORT |
Server port | 8000 |
2.0.0 |
MLXK2_PRELOAD_MODEL |
Model to load at startup (set by --model flag) |
(none) | 2.0.0-beta |
MLXK2_MAX_TOKENS |
Override default max_tokens for all requests | (auto) | 2.0.4 |
MLXK2_RELOAD |
Enable Uvicorn auto-reload (development only) | 0 (disabled) |
2.0.0 |
Control vision model behavior (Python 3.10+, beta):
| Variable | Description | Default | Since |
|---|---|---|---|
MLXK2_VISION_CHUNK_SIZE |
Default chunk size for vision image processing | 1 |
2.0.4-beta.7 |
Examples:
# Process 3 images per chunk instead of 1 (faster but requires more RAM)
export MLXK2_VISION_CHUNK_SIZE=3
mlxk run pixtral --image photos/*.jpg "Describe images"
# CLI flag overrides environment variable
mlxk run pixtral --chunk 5 --image photos/*.jpg "Describe images" # Uses 5, not 3# Custom host/port binding
MLXK2_HOST=0.0.0.0 MLXK2_PORT=9000 mlxk serve
# Preload model for faster first request
MLXK2_PRELOAD_MODEL="mlx-community/Qwen2.5-3B-Instruct-4bit" mlxk serve
# Override max_tokens for all requests
MLXK2_MAX_TOKENS=4096 mlxk serve
# Development mode with auto-reload
MLXK2_RELOAD=1 mlxk serveControl log output format and verbosity:
| Variable | Description | Default | Since |
|---|---|---|---|
MLXK2_LOG_JSON |
Enable JSON log format | 0 (text) |
2.0.0 |
MLXK2_LOG_LEVEL |
Log level (debug, info, warning, error) |
info |
2.0.0 |
Examples:
# JSON logs for log aggregation tools
MLXK2_LOG_JSON=1 mlxk serve
# Quiet mode (warnings and errors only)
MLXK2_LOG_LEVEL=warning mlxk serve
# Verbose debug output
MLXK2_LOG_LEVEL=debug mlxk serveNote: CLI flags (--log-json, --log-level) take precedence over environment variables.
Control HuggingFace Hub authentication and cache:
| Variable | Description | Default | Since |
|---|---|---|---|
HF_HOME |
HuggingFace cache directory | ~/.cache/huggingface |
N/A |
HF_TOKEN |
HuggingFace API token (for private models, push) |
(none) | N/A |
HUGGINGFACE_HUB_TOKEN |
Alternative token variable (fallback) | (none) | N/A |
Examples:
# Custom cache location
HF_HOME=/data/models mlxk list
# Authentication for private models
HF_TOKEN=hf_... mlxk pull org/private-model
# Upload to HuggingFace Hub
HF_TOKEN=hf_... mlxk push ./workspace org/model --privateWhen multiple sources define the same setting, precedence order is:
- CLI flags (highest priority) - e.g.,
--log-json,--port - Environment variables - e.g.,
MLXK2_LOG_JSON=1 - Defaults (lowest priority) - documented above
Example:
# CLI flag wins over environment variable
MLXK2_PORT=9000 mlxk serve --port 8080 # Uses port 8080, not 9000MLX-Knife 2.0 respects standard HuggingFace cache structure and practices:
- Read operations (
list,health,show) always safe with concurrent processes - Write operations (
pull,rm) coordinate during maintenance windows - Lock cleanup automatic but avoid during active downloads
- Your responsibility: Coordinate with team, use good timing
# Check what's in cache (always safe)
mlxk list --json | jq '.data.count'
# Maintenance window - coordinate with team
mlxk rm "corrupted-model" --json --force
mlxk pull "replacement-model" --json
# Back to normal operations
mlxk health --json | jq '.data.summary'A workspace is a self-contained directory containing model files in a flat structure (not the HuggingFace cache format). Workspaces are portable, editable, and can be health-checked standalone.
Structure:
workspace/
├── config.json # Model configuration
├── tokenizer.json # Tokenizer definition
├── tokenizer_config.json # Tokenizer settings
├── model.safetensors # Weights (single file)
├── (or model-*.safetensors) # Weights (multi-shard)
└── README.md # Optional documentation
Key characteristics:
| Aspect | Workspace | HuggingFace Cache |
|---|---|---|
| Structure | Flat, self-contained | Nested (hub/models--org--repo/snapshots/...) |
| Models | Exactly one model per workspace | Many models (models--org--repo1, models--org--repo2, ...) |
| Purpose | Portable working directory | Download cache (managed) |
| Health Check | Standalone (no cache needed) | Requires cache structure |
| Portability | Goal: USB stick, SMB share, any volume | Fixed location (HF_HOME) |
| Ownership | User owns files | Managed by HuggingFace Hub |
| Operations | clone (creates), push (uploads from) |
pull (downloads to) |
Portability (Phase 1 limitation):
- Current: Same APFS volume as cache (CoW optimization)
- Community Goal: Any location (USB stick, SMB share, different volumes)
- Future: Cross-volume support planned
Typical workflow:
mlxk pull org/model→ Downloads to cachemlxk clone org/model workspace/→ Creates editable workspace copy- Edit files in
workspace/(modify config, quantize, etc.) mlxk push workspace/ org/new-model→ Upload modified version- (Optional) Copy workspace to USB stick for sharing
mlxk clone creates a local workspace from a cached model for modification and development.
- Creates isolated workspace from cached models
- Shorthand: When
MLXK_WORKSPACE_HOMEis set, target is optional (org prefix stripped automatically) - Supports APFS copy-on-write optimization on same-volume scenarios; falls back to regular copy cross-volume
- Includes health check integration for workspace validation
- Resumable: Interrupted pulls resume automatically
Examples:
# Shorthand (MLXK_WORKSPACE_HOME set):
mlxk clone mlx-community/pixtral-12b-bf16 # → $MLXK_WORKSPACE_HOME/pixtral-12b-bf16
mlxk clone mlx-community/pixtral-12b-bf16 my-name # → $MLXK_WORKSPACE_HOME/my-name
# Explicit path:
mlxk clone org/model ./workspace # → ./workspacemlxk push uploads a local folder to a Hugging Face model repository using huggingface_hub/upload_folder.
- Requires
HF_TOKEN(write-enabled). - Default branch:
main(explicitly override with--branch). - Safety:
--privateis required to avoid accidental public uploads. - No validation or manifests. Basic hard excludes are applied by default:
.git/**,.DS_Store,__pycache__/, common virtualenv folders (.venv/,venv/), and*.pyc. .hfignore(gitignore-like) in the workspace is supported and merged with the defaults.- Repo creation: use
--createif the target repo does not exist; harmless on existing repos. Missing branches are created during upload. - JSON output: includes
commit_sha,commit_url,no_changes,uploaded_files_count(when available),local_files_count(approx),change_summaryand a shortmessage. - Quiet JSON by default: with
--json(without--verbose) progress bars/console logs are suppressed; hub logs are still captured indata.hf_logs. - Human output: derived from JSON; add
--verboseto include extras such as the commit URL or a short message variant. JSON schema is unchanged. - Local workspace check: use
--check-onlyto validate a workspace without uploading. Producesworkspace_healthin JSON (no token/network required). - Dry-run planning: use
--dry-runto compute a plan vs remote without uploading. Returnsdry_run: true,dry_run_summary {added, modified:null, deleted}, and sampleadded_files/deleted_files. - Testing: see TESTING.md ("Push Testing (2.0)") for offline tests and opt-in live checks with markers/env.
- Carefully review the result on the Hub after pushing.
- Responsibility: You are responsible for complying with Hugging Face Hub policies and applicable laws (e.g., copyright/licensing) for any uploaded content.
Example:
# Upload to private repo
mlxk push --private ./workspace org/model --create --commit "init"Transform models with --repair-index (fix broken mlx-vlm conversions) or --quantize (reduce model size).
Repair workflow (mlx-vlm #624 affected models):
# With MLXK_WORKSPACE_HOME (bare names):
mlxk clone mlx-community/Qwen2.5-VL-7B-Instruct-4bit
mlxk convert Qwen2.5-VL-7B-Instruct-4bit ws-qwen-fixed --repair-index
mlxk health ws-qwen-fixed
# Or with explicit paths:
mlxk clone mlx-community/Qwen2.5-VL-7B-Instruct-4bit ./ws-qwen
mlxk convert ./ws-qwen ./ws-qwen-fixed --repair-indexQuantize workflow:
# With MLXK_WORKSPACE_HOME (bare names):
mlxk clone mlx-community/Llama-3.2-1B-Instruct
mlxk convert Llama-3.2-1B-Instruct llama-4bit --quantize 4
# Or with custom group size (32 = better quality, larger file)
mlxk convert Llama-3.2-1B-Instruct llama-4bit-g32 --quantize 4 --q-group-size 32Supported bits: 2, 3, 4, 6, 8
Key features:
- Cache sanctity: Hard blocks writes to HF cache (workspaces only)
- Cross-volume: Works across APFS volumes, SMB, NFS (with fallback copy)
- Health check integration: Automatic validation (skip with
--skip-health) - APFS CoW: Instant, space-efficient cloning when on same volume
Future modes: --dequantize, vision model quantization (planned).
Pipe mode is beta (feature complete) and requires MLXK2_ENABLE_PIPES=1. It lets mlxk run (and mlx-run) read stdin when you pass - as the prompt.
- Status: Beta (feature complete), API stable (syntax will not change)
- Gate:
MLXK2_ENABLE_PIPES=1(will become default in a future stable release) - Auto-batch: When stdout is a pipe (non-TTY), streaming is disabled automatically for clean output
- Robust: Handles SIGPIPE and BrokenPipeError gracefully (
| head,| grep -m1work correctly) - Scope: Applies to
mlxk runandmlx-run; other commands unchanged - Usage examples (replace
<model>with a cached MLX chat model):
# stdin + trailing text (batch when piped)
MLXK2_ENABLE_PIPES=1 echo "from stdin" | mlxk run "<model>" - "append extra context"
# list → run summarization
MLXK2_ENABLE_PIPES=1 mlxk list --json \
| MLXK2_ENABLE_PIPES=1 mlxk run "<model>" - "Summarize the model list as a concise table." >my-hf-table.md
# Wrapper shorthand
MLXK2_ENABLE_PIPES=1 mlx-run "<model>" - "translate into german" < README.md
# Vision → Text chain: Photo tour review
MLXK2_ENABLE_PIPES=1 mlxk run pixtral --image photos/*.jpg "Describe each picture" \
| MLXK2_ENABLE_PIPES=1 mlxk run qwen3 - \
"Write a tour review. Create a table with picture names, metadata, and descriptions." \
> tour-review.mdThe 2.0 test suite runs by default (pytest discovery points to tests_2.0/):
# Run 2.0 tests (default)
pytest -v
# Explicitly run legacy 1.x tests (not maintained on this branch)
pytest tests/ -v
# Test categories (2.0 example):
# - ADR-002 edge cases
# - Integration scenarios
# - Model naming logic
# - Robustness testing
# Current status: all current 2.0 tests pass (some optional schema tests may be skipped without extras)Test Architecture:
- Isolated Cache System - Zero risk to user data
- Atomic Context Switching - Production/test cache separation
- Mock Models - Realistic test scenarios
- Edge Case Coverage - All documented failure modes tested
- Streaming note: Some UIs buffer SSE; verify real-time with
curl -N. Server sends clear interrupt markers on abort.
This branch follows the established MLX-Knife development patterns:
# Run quality checks
python test-multi-python.sh # Tests across Python 3.9-3.14
./run_linting.sh # Code quality validation
# Key files:
mlxk2/ # 2.0.0 implementation
tests_2.0/ # 2.0 test suite
docs/ADR/ # Architecture decision recordsSee CONTRIBUTING.md for detailed guidelines.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- API Specification: JSON API Specification
- Documentation: See
docs/directory for technical details - Security Policy: See SECURITY.md
Apache License 2.0 — see LICENSE (root) and mlxk2/NOTICE.
- Built for Apple Silicon using the MLX framework
- Models hosted by the MLX Community on HuggingFace
- Inspired by ollama's user experience
Made with ❤️ by The BROKE team ![]()
Version 2.0.6 | May 2026
Supported by Anthropic Claude Code
💬 Web UI: nChat - lightweight chat interface •
🔮 Multi-node: BROKE Cluster

