Qwenvert

Run Claude Code with a local LLM on your Mac. Keep your code private.

Qwenvert lets you use Claude Code CLI with a completely local LLM (Qwen2.5-Coder) instead of Anthropic's API. Your code never leaves your machine.

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Claude Code │ --> │   Qwenvert   │ --> │ Local Qwen  │
│     CLI     │     │   (adapter)  │     │    Model    │
└─────────────┘     └──────────────┘     └─────────────┘
                         :8088              (via Ollama)

Why? Privacy. Security. Compliance. Zero inference costs. No internet required.

⚡ 5-Minute Quick Start

1. Install

Requirements:

Mac with M1/M2/M3 chip (8GB RAM minimum)
Python 3.9-3.12 (check: python3 --version)
Ollama or llama.cpp

Install from PyPI:

pip install qwenvert

Or install from source:

git clone https://github.com/kmesiab/qwenvert.git
cd qwenvert
pip install -e .

macOS Users (Python 3.11+): If you see an "externally-managed-environment" error, you have two options:

Option 1 (Recommended for development):
git clone https://github.com/kmesiab/qwenvert.git
cd qwenvert
make venv           # Creates .venv virtual environment
source .venv/bin/activate
make install-dev    # Installs qwenvert + dev dependencies
Option 2 (Recommended for end users):
pipx install qwenvert  # Installs in isolated environment
# Install pipx first if needed: brew install pipx
This is due to PEP 668 which protects system Python on modern macOS.

2. Setup (One Command - Zero Friction!)

qwenvert init

This will automatically (no prompts!):

✅ Detect your hardware (chip, RAM, cooling)
✅ Install llama-server binary if needed (~50MB)
✅ Pick the best model for your Mac
✅ Download the model from HuggingFace (~4GB)
✅ Configure everything automatically

First run takes 2-5 minutes (downloads binaries & models). Subsequent runs are instant.

Example output:

Qwenvert Initialization

✓ Detected: M1 Pro, 16GB RAM, 16 GPU cores, Active cooling
✓ Selected: Qwen2.5 Coder 7B Q5
✓ Downloading from HuggingFace...
✓ Model downloaded: ~/.qwenvert/models/qwen25-coder-7b-q5.gguf (4.2GB)
✓ Configuration saved: ~/.config/qwenvert/config.yaml

Next step: qwenvert start

3. Start Qwenvert

qwenvert start

You'll see:

Starting Qwenvert

✓ Backend: Ollama with qwen2.5-coder:7b
✓ Backend server: http://localhost:11434 (healthy)
✓ Qwenvert adapter: http://localhost:8088
✓ Ready for Claude Code!

Configure Claude Code:
  export ANTHROPIC_BASE_URL=http://localhost:8088
  export ANTHROPIC_API_KEY=local-qwen
  export ANTHROPIC_MODEL=qwenvert-default

Leave this terminal running.

Missing Dependencies? If Ollama isn't installed, qwenvert will offer to install it automatically:

qwenvert start

You'll see:

======================================================================
  Missing Dependency: Ollama
======================================================================

Ollama is not installed (required for running local models)

To install Ollama using Homebrew:
  1. Run: brew install ollama
  2. Wait for installation to complete
  3. Run: qwenvert init

Learn more: https://ollama.ai

======================================================================

Would you like to install Ollama automatically using Homebrew? [Y/n]:

Non-interactive mode:

qwenvert start --auto-install

Automatically installs missing dependencies via Homebrew without prompting.

Note: Auto-installation only works for supported dependencies (Ollama, llama.cpp) when Homebrew is available.

4. Configure Claude Code (New Terminal)

export ANTHROPIC_BASE_URL=http://localhost:8088
export ANTHROPIC_API_KEY=local-qwen
export ANTHROPIC_MODEL=qwenvert-default

claude

That's it! Claude Code now uses your local model. Your code stays on your machine.

What Just Happened?

Without qwenvert (default):

Claude Code → api.anthropic.com → Claude Sonnet/Opus
              (internet)           (cloud)
              💰 Costs money      ☁️ Code leaves machine

With qwenvert (configured):

Claude Code → localhost:8088 → Ollama → Qwen Model
              (no internet)     (local)  (your Mac)
              💰 Free            🔒 Code stays local

Claude Code doesn't know the difference - it just uses whatever ANTHROPIC_BASE_URL points to!

📖 How to Use

Basic Workflow

# Start qwenvert (terminal 1)
qwenvert start

# Use Claude Code (terminal 2)
export ANTHROPIC_BASE_URL=http://localhost:8088
export ANTHROPIC_API_KEY=local-qwen
export ANTHROPIC_MODEL=qwenvert-default
claude

# When done, stop qwenvert
qwenvert stop

Make Environment Variables Permanent

Add to your ~/.zshrc or ~/.bashrc:

# Qwenvert - Local Claude Code
export ANTHROPIC_BASE_URL=http://localhost:8088
export ANTHROPIC_API_KEY=local-qwen
export ANTHROPIC_MODEL=qwenvert-default

Then reload: source ~/.zshrc

Now claude will automatically use qwenvert!

Verify Claude Code is Using Qwenvert

After setting environment variables, verify the setup:

# Check environment variables are set
echo $ANTHROPIC_BASE_URL
# Should show: http://localhost:8088

echo $ANTHROPIC_API_KEY
# Should show: local-qwen

echo $ANTHROPIC_MODEL
# Should show: qwenvert-default

# Make sure qwenvert is running
curl http://localhost:8088/health
# Should return: {"status":"healthy","backend":"connected"}

# Test with Claude Code
claude
# In Claude Code, ask: "What model are you?"
# It should respond as Qwen2.5-Coder (though it might say Claude)

How to tell it's working:

✅ Claude Code starts without asking for an API key
✅ Responses come quickly (no network delay)
✅ qwenvert monitor shows requests appearing
✅ Works offline (disconnect wifi and try)

If it's NOT working:

❌ "Invalid API key" error → Check ANTHROPIC_API_KEY=local-qwen
❌ "Connection refused" → Check ANTHROPIC_BASE_URL and qwenvert is running
❌ "Model not found" → Check ANTHROPIC_MODEL=qwenvert-default

🎯 Common Commands

Check Status

qwenvert status

Output:

Qwenvert Status

Configuration
  Model:              qwen2.5-coder-7b-q5
  Backend:            ollama
  Backend URL:        http://localhost:11434
  Adapter:            http://localhost:8088
  Context Length:     32,768 tokens

Server Health:
  Backend:  ✓ Running
  Adapter:  ✓ Running

Monitor Performance (Optional)

qwenvert monitor

Shows a live dashboard with:

Requests per second
Token generation speed
System resources (CPU, memory, temp)
Recent request history

OpenTelemetry Support: The monitor now uses OpenTelemetry-compliant metrics. Enable OTLP export for integration with observability platforms:

# Enable with local OTLP collector (secure)
export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
qwenvert monitor --enable-otel

See TELEMETRY_SECURITY.md for complete security details.

Press Ctrl+C to exit.

Binary Management Commands

Check llama-server installation:

qwenvert binary info

Output:

┌──────────────┬────────────────────────────────────────┐
│ Property     │ Value                                  │
├──────────────┼────────────────────────────────────────┤
│ Path         │ ~/.cache/qwenvert/bin/llama-server     │
│ Version      │ b3600                                  │
│ Source       │ downloaded                             │
│ Architecture │ arm64                                  │
│ Valid        │ ✓ Yes                                  │
└──────────────┴────────────────────────────────────────┘

List available versions:

qwenvert binary list

Install specific version:

qwenvert binary install --version b3600

Update to latest:

qwenvert binary update

Verify integrity:

qwenvert binary verify

Rollback to backup:

qwenvert binary rollback

Detect Available Backends

qwenvert backends

Shows which backends (MLX, llama.cpp, Ollama) are available on your system and recommends the fastest option.

Example output on Apple Silicon:

Available Backends:
✓ MLX v0.10.0 (recommended - fastest on Apple Silicon)
✓ llama.cpp b3600 (available)
✗ Ollama (not installed)

List Available Models

qwenvert models list

Output:

Available Models

ID                           Size    RAM    Context
qwen2.5-coder-7b-q4          4.1GB   8GB    32K
qwen2.5-coder-7b-q5          4.8GB   16GB   32K
qwen2.5-coder-14b-q4         8.5GB   16GB   32K
qwen2.5-coder-14b-q5         10GB    32GB   32K

Clean Up Downloaded Models

Remove downloaded model files to free disk space:

# Interactive selection
qwenvert models clean

# Remove specific model
qwenvert models clean --model-id qwen2.5-coder-7b-instruct-q4_k_m.gguf

# Remove all models (with confirmation)
qwenvert models clean --all

# Preview what would be deleted (dry run)
qwenvert models clean --dry-run

Example output:

Model Cleanup

Models disk usage: 12.3 GB
Available disk space: 45.2 GB

Downloaded models:

  1. qwen2.5-coder-7b-instruct-q4_k_m.gguf (4.2 GB)
  2. qwen2.5-coder-14b-instruct-q5_k_m.gguf (8.1 GB)
  3. All models
  4. Cancel

Enter number(s) separated by commas: 1

Models to be deleted:

Filename                                   Size
qwen2.5-coder-7b-instruct-q4_k_m.gguf     4.2 GB

Total space to free: 4.2 GB

Delete these models? [y/N]: y

✓ Cleanup complete! Deleted 1 model(s), freed 4.2 GB

Check Your Hardware

qwenvert hardware

Output:

Hardware Information

Chip:               M1 Pro
Total Memory:       16GB
GPU Cores:          16
Performance Cores:  8
Cooling:            Active (fan)
Recommended:        32K tokens context

📦 Dependencies & Auto-Installation

Required Dependencies

Qwenvert requires one of these backends to run:

Ollama (recommended) - Easy to install via Homebrew: brew install ollama
llama.cpp - Manual build required, see llama.cpp docs

Supported Auto-Install Dependencies

When you run qwenvert start, it automatically detects missing dependencies and offers to install them via Homebrew. The following dependencies support auto-installation:

Dependency	Package Name	Installation Command
Ollama	`ollama`	`brew install ollama`
llama.cpp	`llama.cpp`	(Not yet supported for auto-install)

Security Note: Auto-installation only works for whitelisted dependencies defined in ALLOWED_AUTO_INSTALL_DEPENDENCIES. This prevents accidental installation of arbitrary packages.

Auto-Install Modes

Interactive (default):

qwenvert start
# Prompts: "Would you like to install Ollama automatically using Homebrew? [Y/n]:"

Non-interactive (CI/automation):

qwenvert start --auto-install
# Automatically installs without prompting

Manual installation (traditional):

# Install Ollama manually
brew install ollama

# Then start qwenvert
qwenvert start

Checking Dependencies

To check if dependencies are installed, qwenvert automatically detects them when you run commands. You can also manually check:

which ollama        # Check if Ollama is in PATH
ollama --version    # Verify Ollama version

Adding More Dependencies

Currently, only Ollama and llama.cpp are supported as backends. Other dependencies (like Homebrew itself) require manual installation.

If you need support for additional backends, please open an issue.

🔧 Advanced Usage

Use a Specific Model

# List models
qwenvert models list

# Re-initialize with different model
qwenvert init --model qwen2.5-coder-14b-q5

# Restart
qwenvert stop
qwenvert start

Use llama.cpp Instead of Ollama

# Initialize with llama.cpp backend
qwenvert init --backend llamacpp

# Start (same command)
qwenvert start

Why llama.cpp?

More control over inference parameters
Slightly faster on some Macs
Lower memory overhead

Why Ollama? (default)

Easier to install
Better model management
More beginner-friendly

Custom Context Length

# Longer context = more memory
qwenvert init --context-length 65536  # 64K tokens

# Shorter context = less memory
qwenvert init --context-length 16384  # 16K tokens

Rule of thumb:

8GB Mac: 16K max
16GB Mac: 32K safe
32GB+ Mac: 64K works

❓ Troubleshooting

"Connection refused" when starting Claude Code

Check if qwenvert is running:

curl http://localhost:8088/health

Should return:

{"status": "healthy", "backend": "connected"}

If not running:

qwenvert start

Model download fails

Problem: HuggingFace download interrupted

Solution:

# Try again (downloads resume automatically)
qwenvert init

# Or download manually and place in ~/.qwenvert/models/

Slow response times

Check memory usage:

qwenvert status

Solutions:

Use smaller model:

qwenvert init --model qwen2.5-coder-7b-q4

Reduce context length:
```
qwenvert init --context-length 16384
```
Close other apps to free RAM

Expected speeds:

8GB Mac: 15-20 tokens/sec
16GB Mac: 25-35 tokens/sec
32GB+ Mac: 30-40 tokens/sec

MacBook Air overheating

Enable thermal pacing:

Edit ~/.config/qwenvert/config.yaml:

thermal_pacing: true
thermal_threshold: 70  # Celsius

Or re-run init with thermal protection:

qwenvert init --thermal-pacing

Can't install - Python version error

Problem: Python 3.13 not supported yet

Solution: Use Python 3.12 or earlier

# Check version
python3 --version

# Install Python 3.12 via Homebrew
brew install python@3.12

# Use it
pip3.12 install -e .

Environment variables not persisting

Problem: Variables reset when you close terminal

Solution: Add to shell config

# Open your shell config
nano ~/.zshrc  # or ~/.bashrc for bash

# Add these lines
export ANTHROPIC_BASE_URL=http://localhost:8088
export ANTHROPIC_API_KEY=local-qwen
export ANTHROPIC_MODEL=qwenvert-default

# Save and reload
source ~/.zshrc

"externally-managed-environment" error on install

Problem: pip install fails with error about externally managed environment

macOS Python 3.11+ Context: Apple now protects system Python to prevent breaking macOS tools. This is PEP 668 in action.

Solution 1 - Virtual Environment (Recommended for development):

# Clone the repository
git clone https://github.com/kmesiab/qwenvert.git
cd qwenvert

# Create and activate virtual environment
make venv
source .venv/bin/activate

# Install
make install-dev

Solution 2 - pipx (Recommended for end users):

# Install pipx if needed
brew install pipx

# Install qwenvert in isolated environment
pipx install qwenvert

Solution 3 - Disable protection (NOT recommended):

# This breaks the system protection - avoid unless you know what you're doing
pip install qwenvert --break-system-packages

Why virtual environments?

Isolated dependencies (won't conflict with other projects)
Easy to delete and recreate if something breaks
Standard Python best practice
Doesn't require disabling system protections

🔒 Privacy & Security

What Data Stays Local?

Everything. Qwenvert is designed for security-conscious developers.

✅ Your code - Never sent to any server ✅ Prompts - Processed only on your Mac ✅ Responses - Generated locally ✅ Model weights - Stored in ~/.qwenvert/models/

How We Guarantee This

Localhost-only binding - Adapter listens on 127.0.0.1 only (not accessible from network)
No external calls - Code explicitly blocks external connections
Telemetry security - All telemetry exporters disabled by default; OTLP endpoints validated to be localhost-only (see TELEMETRY_SECURITY.md)
Test-proven - 23 dedicated security tests verify isolation and telemetry safety
Transparent code - Full source available for audit

Perfect for:

HIPAA/SOC2 compliance
Proprietary code bases
Air-gapped development
Security research
Offline work

📊 Performance Expectations

What to Expect

Mac Type	Model	Speed	Memory	Context
8GB M1 (Air)	7B Q4	15-20 t/s	~4GB	16K tokens
16GB M1 Pro	7B Q5	25-35 t/s	~6GB	32K tokens
32GB M1 Max	14B Q5	20-30 t/s	~12GB	64K tokens

t/s = tokens per second

Compared to Cloud APIs

Feature	Qwenvert	Claude API
Speed	20-35 t/s	40-60 t/s
Latency	~0ms (local)	100-300ms (network)
Cost	$0/month	$15-300/month
Privacy	100% local	Cloud
Offline	✅ Yes	❌ No
Code quality	Good	Excellent

Best for: Security/privacy-critical work, cost-sensitive projects, offline development

Not ideal for: Highest code quality, fastest possible responses

🎓 Understanding Qwenvert

What Is It?

Qwenvert is an HTTP adapter that sits between Claude Code CLI and your local LLM:

Claude Code → Qwenvert → Ollama/llama.cpp → Qwen Model

Not just config - It's a full translation layer:

Translates Anthropic API → Ollama/llama.cpp format
Converts responses back to Anthropic format
Handles streaming (Server-Sent Events)
Manages backend processes
Monitors performance

Why Not Use Ollama Directly?

Ollama has basic Anthropic API support, but:

❌ Limited streaming support
❌ Missing some API features
❌ No thermal management
❌ No hardware optimization
❌ Can't switch backends easily

Qwenvert provides:

✅ Full Anthropic Messages API
✅ Works with Ollama or llama.cpp
✅ Thermal monitoring for MacBook Air
✅ Hardware-aware model selection
✅ Easy to extend with new backends

Performance & Backend Comparison

Qwenvert supports three backends for running local LLMs: MLX (fastest on Apple Silicon), llama.cpp (fast and cross-platform), and Ollama (easiest setup).

Benchmark Results

Backend	Throughput	Performance vs Ollama	Best For
MLX	~230 tok/s	1.5-2x faster than llama.cpp	Apple Silicon (M1-M5), Python integration
llama.cpp	~150 tok/s	3-7x faster than Ollama	Production, cross-platform
Ollama	20-40 tok/s	Baseline	Quick testing, simple setup

Benchmarks from vLLM-MLX (2026) and Comparative Study (2025)

Why MLX is Fastest on Apple Silicon

MLX (Apple's ML framework) is purpose-built for Apple Silicon and provides:

Native Metal GPU Acceleration: Direct access to M-series GPU/Neural Engine
Unified Memory Optimization: Efficient use of Apple's unified memory architecture
M5 Neural Accelerators: Only framework that leverages M5's new GPU Neural Accelerators (3.5-4x faster prefill)
Content-Based Prefix Caching: 28x speedup on repeated image queries (multimodal models)
Lower Latency: ~1.5-2x faster than llama.cpp on same hardware

MLX is automatically recommended on Apple Silicon if available.

Why llama.cpp is Faster than Ollama

llama.cpp provides direct Metal GPU acceleration for Apple Silicon, while Ollama adds a Go wrapper layer that introduces overhead:

Metal Acceleration: 2.4x speedup over CPU-only inference (source)
Optimized for Apple Silicon: Full GPU layer offload (-ngl 99)
Continuous Batching: Better throughput for multiple requests
Lower Memory Overhead: Direct model access without wrapper

Apple Silicon Performance by Model

Mac Model	RAM	Model Size	MLX Throughput	llama.cpp Throughput	Expected Response Time
M1 Air	8GB	1.5B Q4	45-60 tok/s	30-40 tok/s	<1 second
M1 Pro	16GB	7B Q4	63-70 tok/s	28-35 tok/s	1-2 seconds
M2 Max	32GB	14B Q4	48-55 tok/s	22-30 tok/s	2-3 seconds
M3 Pro/Max	18GB+	7B Q4	65-75 tok/s	28-35 tok/s	1-2 seconds
M4 Max	48GB+	7B Q4	525 tok/s	150 tok/s	<1 second
M5 Pro/Max	24GB+	7B Q4	800+ tok/s*	150 tok/s	<1 second

*M5 performance based on Apple's official benchmarks with MLX as canonical runtime Performance data from vLLM-MLX research, llama.cpp benchmarks, and Apple ML Research

Choosing a Backend

Use MLX (fastest) if:

✅ You're on Apple Silicon (M1-M5)
✅ You want maximum performance (1.5-2x faster than llama.cpp)
✅ You want native Metal GPU acceleration
✅ You need multimodal support (vision models)

Use llama.cpp (cross-platform) if:

✅ You want great performance (3-7x faster than Ollama)
✅ You need cross-platform support
✅ You're comfortable with command-line tools

Use Ollama (easiest) if:

✅ You prefer simpler setup (one-line install)
✅ You already have Ollama installed
✅ Performance is not critical

To switch backends:

qwenvert init --backend llamacpp  # Use llama.cpp (default, fastest production backend)
qwenvert init --backend ollama    # Use Ollama

MLX Backend (Experimental - Not Yet User-Selectable)

The MLX backend infrastructure is implemented but not yet available via CLI. MLX requires router/launcher integration for in-process execution. Once complete, it will provide 1.5-2x faster inference on Apple Silicon (M1-M5) compared to llama.cpp.

Current status:

✅ Backend detection and installation
✅ Model registry (5 MLX models)
❌ Router integration (blocked by in-process execution model)
❌ CLI selection (disabled until router complete)

For production use, stick with llama.cpp or Ollama backends.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Claude Code CLI                         │
└────────────────────────┬────────────────────────────────────┘
                         │
                    POST /v1/messages
                         │
┌────────────────────────▼────────────────────────────────────┐
│                 Qwenvert HTTP Adapter                        │
│                     (localhost:8088)                         │
│  • Validates requests                                        │
│  • Translates Anthropic → Backend format                    │
│  • Handles streaming (SSE)                                   │
│  • Monitors performance                                      │
└────────────────────────┬────────────────────────────────────┘
                         │
                Backend-specific API
                         │
┌────────────────────────▼────────────────────────────────────┐
│              Ollama or llama.cpp Server                      │
│                  (localhost:11434 or :8080)                  │
└────────────────────────┬────────────────────────────────────┘
                         │
                  ┌──────▼───────┐
                  │  Qwen Model  │
                  │    (GGUF)    │
                  └──────────────┘

🚀 Next Steps

After Installation

Optimize for your use case:
- Heavy coding? Use Q5 quantization for better quality
- Low RAM? Use Q4 quantization to save memory
- Need speed? Use llama.cpp backend

Set up convenience aliases:

# Add to ~/.zshrc
alias qw-start='qwenvert start'
alias qw-stop='qwenvert stop'
alias qw-status='qwenvert status'

Monitor performance:
```
qwenvert monitor
```
Read advanced docs:
- ARCHITECTURE.md - How it works
- SIMPLIFIED_ARCHITECTURE.md - Beginner-friendly overview

💡 Tips & Best Practices

For Best Performance

Close other apps when running inference
Use appropriate model size for your RAM
Monitor temperature on MacBook Air (use qwenvert monitor)
Don't use Rosetta - qwenvert is native Apple Silicon

For Best Code Quality

Use Q5 quantization if you have 16GB+ RAM
Give it more context - longer prompts = better results
Be specific in your prompts (same as with Claude)
Iterate - local models benefit from refinement

For Development

Keep qwenvert running in a dedicated terminal
Check logs if something seems wrong: qwenvert status
Update models periodically - new versions improve quality
Share feedback - open issues for bugs/improvements

📊 Performance Benchmarks

Measure qwenvert performance on your Mac:

# Start qwenvert
qwenvert start

# Run benchmarks (separate terminal)
make benchmark

What it tests:

Different prompt lengths (short, medium, long)
Streaming vs non-streaming
Different token limits (50, 100, 200)
Code generation tasks

Metrics:

Latency (ms)
Throughput (tokens/sec)
Time to first token (TTFT)
Success rate

Example output:

┌────────────────┬─────────┬──────┬─────────┬────────┬─────────┬────────┐
│ Benchmark      │ Backend │ Quant│ Latency │ Tokens │ Speed   │ Status │
├────────────────┼─────────┼──────┼─────────┼────────┼─────────┼────────┤
│ prompt_short   │ ollama  │ Q4_K │ 1234ms  │ 5      │ 4.1 t/s │   ✓    │
│ prompt_medium  │ ollama  │ Q4_K │ 2456ms  │ 89     │ 36.2t/s │   ✓    │
└────────────────┴─────────┴──────┴─────────┴────────┴─────────┴────────┘

Summary:
  Average latency: 1845ms
  Average throughput: 32.4 tokens/sec

Results saved to benchmarks/results/ for tracking over time.

See benchmarks/README.md for details.

🤝 Contributing

We welcome contributions! Areas where help is needed:

Model support - Add Qwen3-Coder, other model families
Backend support - vLLM, TensorRT-LLM integration
MLX enhancements - Continuous batching, multimodal support
Performance - Optimization for specific Mac models
Testing - More edge cases, hardware configurations
Documentation - Tutorials, examples, translations

See CONTRIBUTING.md for guidelines.

📚 More Documentation

ARCHITECTURE.md - System design and component details
TELEMETRY_SECURITY.md - OpenTelemetry security guarantees and configuration
SIMPLIFIED_ARCHITECTURE.md - Beginner-friendly architecture overview
AGENTS.md - AI agents for development and security auditing
TASKS.md - Development roadmap and task tracking
tests/ - Test suite with 23 dedicated security tests

🙏 Acknowledgments

Qwen Team (Alibaba) - Excellent Qwen2.5-Coder models
Apple ML Team - Metal acceleration, unified memory
llama.cpp community - High-performance inference engine
Ollama team - Making local LLMs accessible
Anthropic - Claude Code CLI and Messages API

📝 License

Apache 2.0 License - see LICENSE

⚠️ Limitations & Disclaimers

Known Limitations

Mac only - Designed for M1/M2/M3 Macs (Intel/Windows not supported)
Python 3.9-3.12 - Python 3.13 not yet compatible
Large downloads - Models are 4-10GB (one-time download)
Code quality - Good, but not as good as Claude Opus/Sonnet
First run slow - Model loading takes 10-30 seconds

Not Affiliated

Qwenvert is an independent project and is not affiliated with, endorsed by, or supported by Anthropic. Claude Code is a trademark of Anthropic.

📖 Research & Methodology

This project implements research-backed development practices for AI agent collaboration:

Repository-Level Instructions

Our AGENTS.md file follows findings from:

"Repository-Level Instructions Enhance AI Assistant Completion and Efficiency" Li et al., 2025. arXiv:2601.20404 https://arxiv.org/abs/2601.20404

Key findings from the research:

28.64% reduction in AI agent task completion time
16.58% reduction in token usage
Repository-level instructions significantly improve code generation accuracy

How we apply it:

Structured project conventions in AGENTS.md
Security-critical rules documented upfront
File modification requirements clearly specified
Specialized agent catalog with use cases

This approach makes qwenvert development more efficient and maintainable when working with AI coding assistants like Claude Code.

Questions? Issues? Feedback?

Open an issue: https://github.com/kmesiab/qwenvert/issues

Built with care for the Mac M1 community 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
benchmarks		benchmarks
configs		configs
docs		docs
homebrew		homebrew
qwenvert		qwenvert
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TELEMETRY_SECURITY.md		TELEMETRY_SECURITY.md
UPSTREAM_CHECKSUM_REQUEST.md		UPSTREAM_CHECKSUM_REQUEST.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Qwenvert

⚡ 5-Minute Quick Start

1. Install

2. Setup (One Command - Zero Friction!)

3. Start Qwenvert

4. Configure Claude Code (New Terminal)

What Just Happened?

📖 How to Use

Basic Workflow

Make Environment Variables Permanent

Verify Claude Code is Using Qwenvert

🎯 Common Commands

Check Status

Monitor Performance (Optional)

Binary Management Commands

Detect Available Backends

List Available Models

Clean Up Downloaded Models

Check Your Hardware

📦 Dependencies & Auto-Installation

Required Dependencies

Supported Auto-Install Dependencies

Auto-Install Modes

Checking Dependencies

Adding More Dependencies

🔧 Advanced Usage

Use a Specific Model

Use llama.cpp Instead of Ollama

Custom Context Length

❓ Troubleshooting

"Connection refused" when starting Claude Code

Model download fails

Slow response times

MacBook Air overheating

Can't install - Python version error

Environment variables not persisting

"externally-managed-environment" error on install

🔒 Privacy & Security

What Data Stays Local?

How We Guarantee This

📊 Performance Expectations

What to Expect

Compared to Cloud APIs

🎓 Understanding Qwenvert

What Is It?

Why Not Use Ollama Directly?

Performance & Backend Comparison

Benchmark Results

Why MLX is Fastest on Apple Silicon

Why llama.cpp is Faster than Ollama

Apple Silicon Performance by Model

Choosing a Backend

Architecture

🚀 Next Steps

After Installation

💡 Tips & Best Practices

For Best Performance

For Best Code Quality

For Development

📊 Performance Benchmarks

🤝 Contributing

📚 More Documentation

🙏 Acknowledgments

📝 License

⚠️ Limitations & Disclaimers

Known Limitations

Not Affiliated

📖 Research & Methodology

Repository-Level Instructions

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Packages