Local Claude Code backup setups for AMD Strix Halo and Intel Core Ultra laptops running Bazzite (Fedora Atomic). Built around llama.cpp's native Anthropic Messages API — no proxies, no router middleware.
When the Anthropic API is down, rate-limited, or you're offline, claude-smart --local hands off to a local model with the same Claude Code experience.
| Machine | CPU/GPU | RAM | Model | Tokens/sec |
|---|---|---|---|---|
| ASUS ProArt PX13 | Ryzen AI Max+ 395 / Radeon 8060S (gfx1151) | 128 GB unified | Qwen3-Coder-30B-A3B Q4_K_M | 319 prefill / 26 gen |
| ASUS Zenbook Duo | Core Ultra 9 185H (Meteor Lake) / Arc iGPU (Vulkan) | 32 GB | Qwen2.5-Coder-3B Q4_K_M (via Aider) | 250 prefill / 12 gen |
Both machines run Bazzite. Setup details per machine in docs/.
Two-tool architecture: The PX13 runs Claude Code against the local
llama-server (full agentic, ROCm GPU). The Duo runs Aider against a local
Qwen2.5-Coder-3B on Vulkan (lighter agentic, Arc iGPU). Claude Code's
internal request timeout (~10-12 min) is incompatible with iGPU prefill for
its 30-55K context injection — Aider's ~5-10K context sidesteps this entirely.
See docs/ZENBOOK-DUO.md for the full explanation.
-
Pick your hardware guide:
- PX13 (Strix Halo) — primary, full-power setup (Claude Code + ROCm GPU)
- Zenbook Duo (Vulkan iGPU) — Aider + Qwen2.5-Coder-3B + Arc iGPU, offline backup
- Aider setup — Aider installation and configuration for the Duo
- Vulkan notes — Intel Arc iGPU gotchas and verification steps
- Tailscale bridge — point the Duo at the PX13 over Tailscale, get full 30B perf anywhere
-
Walk through kernel args, distrobox container, model download per your hardware doc
-
Run the installer:
./install.sh
-
Start the service and use it:
systemctl --user enable --now llama-server.service claude-smart --local
claude-smart keeps a small registry of local models (MODELS near the top of
bin/claude-smart). Each entry maps a short key to an ANTHROPIC_MODEL id and
the systemd unit that serves it. Because every model binds :8080, selecting one
swaps the active unit — the previously running model is stopped and the chosen
one is started.
claude-smart --list-models # show registered models + which unit is active
claude-smart --local # interactive menu (default highlighted)
claude-smart --local --model qwen3-coder-30b # pick directly, no prompt
CLAUDE_LOCAL_MODEL=qwen2.5-coder-7b claude-smart --local # same, via env varNotes:
- Each model needs its own unit installed (
llama-server-<key>.service).install.shinstalls the model-named unit for your machine's mode alongside the genericllama-server.service. To make a second model selectable, install its unit too (download the model + install/copy a matching unit file). - When
stdin/stdoutisn't a terminal (scripts, cron), selection is skipped andDEFAULT_MODEL_KEYis used. - Add models or change the default without editing the script by creating
~/.config/claude-smart/models.conf— it's sourced as bash and may redefine theMODELSarray andDEFAULT_MODEL_KEY. For example:# ~/.config/claude-smart/models.conf MODELS[gpt-oss-20b]="gpt-oss-20b|llama-server-gpt-oss-20b.service" DEFAULT_MODEL_KEY="qwen3-coder-30b"
bin/local-gen generates images and short video clips entirely on the local
GPU via ComfyUI running in the same
llama-rocm distrobox container. It talks to ComfyUI's HTTP API (auto-starting
the comfyui.service systemd unit if needed) and shows a live progress bar over
ComfyUI's WebSocket.
local-gen --image "a neon-lit alley in the rain, cyberpunk, cinematic"
local-gen --video "a kite surfing at sunset, slow motion"
local-gen --status # is ComfyUI up? how much VRAM is free?
local-gen --down # unload all models, free VRAM/RAM
local-gen --help-gen # full option list| Mode | Model | Notes |
|---|---|---|
--image |
FLUX.1-schnell | ~20s at 512², ~2min at 1024² |
--video |
Wan2.1-T2V-1.3B | 832×480, ~5min for a 3s clip at 12 steps |
Common options (--width, --height, --steps, --seed, and --duration
for video) override the defaults; see --help-gen. Output is written to
~/Pictures/local-gen/ (images as .png, video as animated .webp) and opened
with the system viewer.
gfx1151 gotchas (Radeon 8060S / Strix Halo, the hard-won bits):
- FLUX and ComfyUI segfault (exit 139) during inference unless the service
sets
HSA_OVERRIDE_GFX_VERSION=11.0.0so ROCm 6.2 treats the card as gfx1100. This is baked intosystemd/comfyui.service. - The Wan2.1 14B model SIGABRTs (exit 134) on the first compute step even with the override — the transformer's compute path overwhelms the gfx1100 spoof. The 1.3B model runs fine. Avoid the 14B on this hardware.
- The VAE must match the model version. Wan2.1 produces 16-channel latents;
the Wan2.2 VAE expects 48. Pairing them fails at decode with a tensor-size
mismatch. Use
Wan2_1_VAE_bf16.safetensorswith the 2.1 model.
A new log viewer application has been added to provide better monitoring of your LLaMA server:
- Real-time logs from
journalctl --user -u llama-server -f - Tabbed interface for easy navigation:
- Logs: System logs from your LLaMA server
- Slots: Detailed, formatted slot information with status and parameters
- Health: Organized health status information including memory, GPU, model, and system details
- Start the log viewer:
npm start
- Open your browser and navigate to
http://localhost:4000 - Use the tabs to view different information sources
The application automatically formats JSON responses from the /slots and /health endpoints for better readability, showing:
- Slot ID, context size, and processing status
- Key parameters and next token information
- Memory, GPU, model, and system health details
claude-local/
├── README.md ← this file
├── install.sh ← idempotent installer
├── bin/
│ ├── claude-smart ← the wrapper script
│ ├── local-gen ← local image/video generation (ComfyUI)
│ └── rebuild-llama ← rebuild llama.cpp in the ROCm container
├── docs/
│ ├── PX13-BAZZITE.md ← Strix Halo / Radeon 8060S setup
│ ├── ZENBOOK-DUO.md ← Core Ultra 9 / Vulkan setup (Aider + Qwen2.5-Coder-3B)
│ ├── AIDER-SETUP.md ← Aider installation and config for Duo
│ ├── VULKAN-NOTES.md ← Intel Arc iGPU Vulkan gotchas
│ ├── TAILSCALE-BRIDGE.md ← remote-access pattern
│ └── JOURNAL.md ← Day 1 war stories (testing history)
├── systemd/
│ ├── llama-server-px13.service ← reference systemd unit for PX13
│ ├── llama-server-duo.service ← reference systemd unit for Duo
│ ├── llama-server-gemma4-12b.service ← Gemma 4 12B unit (claude-smart --model)
│ └── comfyui.service ← ComfyUI server for local-gen
├── log-viewer.js ← Node.js log viewer application
├── index.html ← Web interface for the log viewer
├── package.json ← Application dependencies
├── LICENSE ← MIT
└── CONTRIBUTING.md ← if you want to upstream changes
┌─────────────┐ claude ┌─────────────────┐
│ your shell │ ──────────────────────► │ Anthropic API │
│ │ └─────────────────┘
│ │ claude-smart
│ │ ──────────┬──── (auto) probe Anthropic, fall back to local
│ │ ├──── --local force local
│ │ └──── --remote force Anthropic
│ │
│ │ claude-smart --local
│ │ ──────────────────────► localhost:8080 (llama-server)
└─────────────┘ │
├─► distrobox container
│ └─► llama.cpp + ROCm 7.2.3
│ └─► gfx1151 GPU
│
└─► serves Anthropic Messages API
natively (no proxy)
The claude-smart wrapper sets three env vars (ANTHROPIC_BASE_URL,
ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL) for one invocation and execs
into claude. The Claude Code CLI doesn't know it's talking to a local model
— from its perspective, it's just hitting an Anthropic API endpoint.
This setup looks simple now, but the path here was littered with dead ends.
The full debug story is preserved in docs/PX13-BAZZITE.md
under "What we ruled out (so you don't waste time)". Highlights:
- The
rocm-7rc-rocwmmakyuz0 toolbox image ships HSA runtime 1.18.0, which segfaults during tensor upload on gfx1151. Userocm-7.2.3instead. -c N --parallel Pdivides context across slots. Claude Code + claude-mem injects ~30-40K tokens of system prompt + tools on first request. With--parallel 2, you need-c 131072minimum.- Qwen3.6-35B-A3B is hybrid Transformer+Mamba. ROCm doesn't support Mamba SSM kernels (as of mid-2026). Stick with Qwen3-Coder for now.
- Unsloth Dynamic 2.0 quants of Qwen3-Coder-30B have a Llama-3-style tokenizer artifact that crashes loading. Use the LM Studio Community quant.
- The
<tool_call>"control token" warning during model load is harmless cosmetic noise. Not the cause of any crash. - Always check
dmesgfirst when GPU stuff silently dies. The HSA runtime segfault is invisible in llama.cpp's stdout but obvious in kernel logs. - On Bazzite/Fedora, drop
--group-add sudofrom kyuz0's example. That's Ubuntu-only; Fedora useswheel, and distrobox doesn't need either.
- kyuz0/amd-strix-halo-toolboxes — the toolbox images that make this work
- pablo-ross/strix-halo-gmktec-evo-x2 — the original benchmark + setup that started this project
- ggml-org/llama.cpp — the inference engine, plus PR #17570 for native Anthropic Messages API support
- LM Studio Community — stable Qwen3-Coder GGUF quants
The Duo inference path took a full day of testing to land on Aider +
Qwen2.5-Coder-3B + Vulkan. Models tried and rejected: Qwen2.5-Coder-7B
(context too small), DeepSeek-Coder-V2-Lite (hallucinates tool calls),
Qwen3.5-9B/4B/0.8B (CPU timeout or too small), Qwen3.5-4B on Vulkan (hybrid
attention breaks KV cache reuse). Key findings: Claude Code's internal request
timeout (~10-12 min) is incompatible with iGPU prefill for its context size;
use :server-vulkan not :server; Qwen2.5 standard attention is required for
cache reuse to work. Aider + Qwen2.5-Coder-3B + Vulkan is the result.
Full investigative journal with the model-by-model failure analysis:
docs/JOURNAL.md
MIT. See LICENSE.