claude-local

Local Claude Code backup setups for AMD Strix Halo and Intel Core Ultra laptops running Bazzite (Fedora Atomic). Built around llama.cpp's native Anthropic Messages API — no proxies, no router middleware.

When the Anthropic API is down, rate-limited, or you're offline, claude-smart --local hands off to a local model with the same Claude Code experience.

Tested hardware

Machine	CPU/GPU	RAM	Model	Tokens/sec
ASUS ProArt PX13	Ryzen AI Max+ 395 / Radeon 8060S (gfx1151)	128 GB unified	Qwen3-Coder-30B-A3B Q4_K_M	319 prefill / 26 gen
ASUS Zenbook Duo	Core Ultra 9 185H (Meteor Lake) / Arc iGPU (Vulkan)	32 GB	Qwen2.5-Coder-3B Q4_K_M (via Aider)	250 prefill / 12 gen

Both machines run Bazzite. Setup details per machine in docs/.

Two-tool architecture: The PX13 runs Claude Code against the local llama-server (full agentic, ROCm GPU). The Duo runs Aider against a local Qwen2.5-Coder-3B on Vulkan (lighter agentic, Arc iGPU). Claude Code's internal request timeout (~10-12 min) is incompatible with iGPU prefill for its 30-55K context injection — Aider's ~5-10K context sidesteps this entirely. See docs/ZENBOOK-DUO.md for the full explanation.

Quick start

Pick your hardware guide:
- PX13 (Strix Halo) — primary, full-power setup (Claude Code + ROCm GPU)
- Zenbook Duo (Vulkan iGPU) — Aider + Qwen2.5-Coder-3B + Arc iGPU, offline backup
- Aider setup — Aider installation and configuration for the Duo
- Vulkan notes — Intel Arc iGPU gotchas and verification steps
- Tailscale bridge — point the Duo at the PX13 over Tailscale, get full 30B perf anywhere
Walk through kernel args, distrobox container, model download per your hardware doc
Run the installer:
```
./install.sh
```

Start the service and use it:

systemctl --user enable --now llama-server.service
claude-smart --local

Selecting a model

claude-smart keeps a small registry of local models (MODELS near the top of bin/claude-smart). Each entry maps a short key to an ANTHROPIC_MODEL id and the systemd unit that serves it. Because every model binds :8080, selecting one swaps the active unit — the previously running model is stopped and the chosen one is started.

claude-smart --list-models              # show registered models + which unit is active
claude-smart --local                    # interactive menu (default highlighted)
claude-smart --local --model qwen3-coder-30b   # pick directly, no prompt
CLAUDE_LOCAL_MODEL=qwen2.5-coder-7b claude-smart --local   # same, via env var

Notes:

Each model needs its own unit installed (llama-server-<key>.service). install.sh installs the model-named unit for your machine's mode alongside the generic llama-server.service. To make a second model selectable, install its unit too (download the model + install/copy a matching unit file).
When stdin/stdout isn't a terminal (scripts, cron), selection is skipped and DEFAULT_MODEL_KEY is used.
Add models or change the default without editing the script by creating ~/.config/claude-smart/models.conf — it's sourced as bash and may redefine the MODELS array and DEFAULT_MODEL_KEY. For example:
```
# ~/.config/claude-smart/models.conf
MODELS[gpt-oss-20b]="gpt-oss-20b|llama-server-gpt-oss-20b.service"
DEFAULT_MODEL_KEY="qwen3-coder-30b"
```

Local image & video generation (`local-gen`)

bin/local-gen generates images and short video clips entirely on the local GPU via ComfyUI running in the same llama-rocm distrobox container. It talks to ComfyUI's HTTP API (auto-starting the comfyui.service systemd unit if needed) and shows a live progress bar over ComfyUI's WebSocket.

local-gen --image "a neon-lit alley in the rain, cyberpunk, cinematic"
local-gen --video "a kite surfing at sunset, slow motion"
local-gen --status                 # is ComfyUI up? how much VRAM is free?
local-gen --down                   # unload all models, free VRAM/RAM
local-gen --help-gen               # full option list

Mode	Model	Notes
`--image`	FLUX.1-schnell	~20s at 512², ~2min at 1024²
`--video`	Wan2.1-T2V-1.3B	832×480, ~5min for a 3s clip at 12 steps

Common options (--width, --height, --steps, --seed, and --duration for video) override the defaults; see --help-gen. Output is written to ~/Pictures/local-gen/ (images as .png, video as animated .webp) and opened with the system viewer.

gfx1151 gotchas (Radeon 8060S / Strix Halo, the hard-won bits):

FLUX and ComfyUI segfault (exit 139) during inference unless the service sets HSA_OVERRIDE_GFX_VERSION=11.0.0 so ROCm 6.2 treats the card as gfx1100. This is baked into systemd/comfyui.service.
The Wan2.1 14B model SIGABRTs (exit 134) on the first compute step even with the override — the transformer's compute path overwhelms the gfx1100 spoof. The 1.3B model runs fine. Avoid the 14B on this hardware.
The VAE must match the model version. Wan2.1 produces 16-channel latents; the Wan2.2 VAE expects 48. Pairing them fails at decode with a tensor-size mismatch. Use Wan2_1_VAE_bf16.safetensors with the 2.1 model.

Log Viewer Application

A new log viewer application has been added to provide better monitoring of your LLaMA server:

Features:

Real-time logs from journalctl --user -u llama-server -f
Tabbed interface for easy navigation:
- Logs: System logs from your LLaMA server
- Slots: Detailed, formatted slot information with status and parameters
- Health: Organized health status information including memory, GPU, model, and system details

Usage:

Start the log viewer:
```
npm start
```
Open your browser and navigate to http://localhost:4000
Use the tabs to view different information sources

The application automatically formats JSON responses from the /slots and /health endpoints for better readability, showing:

Slot ID, context size, and processing status
Key parameters and next token information
Memory, GPU, model, and system health details

What's in this repo

claude-local/
├── README.md                       ← this file
├── install.sh                      ← idempotent installer
├── bin/
│   ├── claude-smart                ← the wrapper script
│   ├── local-gen                   ← local image/video generation (ComfyUI)
│   └── rebuild-llama               ← rebuild llama.cpp in the ROCm container
├── docs/
│   ├── PX13-BAZZITE.md             ← Strix Halo / Radeon 8060S setup
│   ├── ZENBOOK-DUO.md              ← Core Ultra 9 / Vulkan setup (Aider + Qwen2.5-Coder-3B)
│   ├── AIDER-SETUP.md              ← Aider installation and config for Duo
│   ├── VULKAN-NOTES.md             ← Intel Arc iGPU Vulkan gotchas
│   ├── TAILSCALE-BRIDGE.md         ← remote-access pattern
│   └── JOURNAL.md                  ← Day 1 war stories (testing history)
├── systemd/
│   ├── llama-server-px13.service   ← reference systemd unit for PX13
│   ├── llama-server-duo.service    ← reference systemd unit for Duo
│   ├── llama-server-gemma4-12b.service ← Gemma 4 12B unit (claude-smart --model)
│   └── comfyui.service             ← ComfyUI server for local-gen
├── log-viewer.js                   ← Node.js log viewer application
├── index.html                      ← Web interface for the log viewer
├── package.json                    ← Application dependencies
├── LICENSE                         ← MIT
└── CONTRIBUTING.md                 ← if you want to upstream changes

How it works

┌─────────────┐     claude              ┌─────────────────┐
│ your shell  │ ──────────────────────► │ Anthropic API   │
│             │                          └─────────────────┘
│             │     claude-smart
│             │ ──────────┬──── (auto) probe Anthropic, fall back to local
│             │           ├──── --local force local
│             │           └──── --remote force Anthropic
│             │
│             │     claude-smart --local
│             │ ──────────────────────► localhost:8080 (llama-server)
└─────────────┘                          │
                                          ├─► distrobox container
                                          │     └─► llama.cpp + ROCm 7.2.3
                                          │           └─► gfx1151 GPU
                                          │
                                          └─► serves Anthropic Messages API
                                               natively (no proxy)

The claude-smart wrapper sets three env vars (ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL) for one invocation and execs into claude. The Claude Code CLI doesn't know it's talking to a local model — from its perspective, it's just hitting an Anthropic API endpoint.

Lessons learned the hard way

This setup looks simple now, but the path here was littered with dead ends. The full debug story is preserved in docs/PX13-BAZZITE.md under "What we ruled out (so you don't waste time)". Highlights:

The rocm-7rc-rocwmma kyuz0 toolbox image ships HSA runtime 1.18.0, which segfaults during tensor upload on gfx1151. Use rocm-7.2.3 instead.
-c N --parallel P divides context across slots. Claude Code + claude-mem injects ~30-40K tokens of system prompt + tools on first request. With --parallel 2, you need -c 131072 minimum.
Qwen3.6-35B-A3B is hybrid Transformer+Mamba. ROCm doesn't support Mamba SSM kernels (as of mid-2026). Stick with Qwen3-Coder for now.
Unsloth Dynamic 2.0 quants of Qwen3-Coder-30B have a Llama-3-style tokenizer artifact that crashes loading. Use the LM Studio Community quant.
The <tool_call> "control token" warning during model load is harmless cosmetic noise. Not the cause of any crash.
Always check dmesg first when GPU stuff silently dies. The HSA runtime segfault is invisible in llama.cpp's stdout but obvious in kernel logs.
On Bazzite/Fedora, drop --group-add sudo from kyuz0's example. That's Ubuntu-only; Fedora uses wheel, and distrobox doesn't need either.

Credits

kyuz0/amd-strix-halo-toolboxes — the toolbox images that make this work
pablo-ross/strix-halo-gmktec-evo-x2 — the original benchmark + setup that started this project
ggml-org/llama.cpp — the inference engine, plus PR #17570 for native Anthropic Messages API support
LM Studio Community — stable Qwen3-Coder GGUF quants

Day 1 War Stories

The Duo inference path took a full day of testing to land on Aider + Qwen2.5-Coder-3B + Vulkan. Models tried and rejected: Qwen2.5-Coder-7B (context too small), DeepSeek-Coder-V2-Lite (hallucinates tool calls), Qwen3.5-9B/4B/0.8B (CPU timeout or too small), Qwen3.5-4B on Vulkan (hybrid attention breaks KV cache reuse). Key findings: Claude Code's internal request timeout (~10-12 min) is incompatible with iGPU prefill for its context size; use :server-vulkan not :server; Qwen2.5 standard attention is required for cache reuse to work. Aider + Qwen2.5-Coder-3B + Vulkan is the result.

Full investigative journal with the model-by-model failure analysis: docs/JOURNAL.md

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-local

Tested hardware

Quick start

Selecting a model

Local image & video generation (`local-gen`)

Log Viewer Application

Features:

Usage:

What's in this repo

How it works

Lessons learned the hard way

Credits

Day 1 War Stories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bin		bin
docs		docs
systemd		systemd
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TODO-CLAUDE.md		TODO-CLAUDE.md
index.html		index.html
install.sh		install.sh
log-viewer.js		log-viewer.js
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

claude-local

Tested hardware

Quick start

Selecting a model

Local image & video generation (local-gen)

Log Viewer Application

Features:

Usage:

What's in this repo

How it works

Lessons learned the hard way

Credits

Day 1 War Stories

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Local image & video generation (`local-gen`)

Packages