Skip to content

lrsanchez/claude-local

Repository files navigation

claude-local

Local Claude Code backup setups for AMD Strix Halo and Intel Core Ultra laptops running Bazzite (Fedora Atomic). Built around llama.cpp's native Anthropic Messages API — no proxies, no router middleware.

When the Anthropic API is down, rate-limited, or you're offline, claude-smart --local hands off to a local model with the same Claude Code experience.

Tested hardware

Machine CPU/GPU RAM Model Tokens/sec
ASUS ProArt PX13 Ryzen AI Max+ 395 / Radeon 8060S (gfx1151) 128 GB unified Qwen3-Coder-30B-A3B Q4_K_M 319 prefill / 26 gen
ASUS Zenbook Duo Core Ultra 9 185H (Meteor Lake) / Arc iGPU (Vulkan) 32 GB Qwen2.5-Coder-3B Q4_K_M (via Aider) 250 prefill / 12 gen

Both machines run Bazzite. Setup details per machine in docs/.

Two-tool architecture: The PX13 runs Claude Code against the local llama-server (full agentic, ROCm GPU). The Duo runs Aider against a local Qwen2.5-Coder-3B on Vulkan (lighter agentic, Arc iGPU). Claude Code's internal request timeout (~10-12 min) is incompatible with iGPU prefill for its 30-55K context injection — Aider's ~5-10K context sidesteps this entirely. See docs/ZENBOOK-DUO.md for the full explanation.

Quick start

  1. Pick your hardware guide:

  2. Walk through kernel args, distrobox container, model download per your hardware doc

  3. Run the installer:

    ./install.sh
  4. Start the service and use it:

    systemctl --user enable --now llama-server.service
    claude-smart --local

Selecting a model

claude-smart keeps a small registry of local models (MODELS near the top of bin/claude-smart). Each entry maps a short key to an ANTHROPIC_MODEL id and the systemd unit that serves it. Because every model binds :8080, selecting one swaps the active unit — the previously running model is stopped and the chosen one is started.

claude-smart --list-models              # show registered models + which unit is active
claude-smart --local                    # interactive menu (default highlighted)
claude-smart --local --model qwen3-coder-30b   # pick directly, no prompt
CLAUDE_LOCAL_MODEL=qwen2.5-coder-7b claude-smart --local   # same, via env var

Notes:

  • Each model needs its own unit installed (llama-server-<key>.service). install.sh installs the model-named unit for your machine's mode alongside the generic llama-server.service. To make a second model selectable, install its unit too (download the model + install/copy a matching unit file).
  • When stdin/stdout isn't a terminal (scripts, cron), selection is skipped and DEFAULT_MODEL_KEY is used.
  • Add models or change the default without editing the script by creating ~/.config/claude-smart/models.conf — it's sourced as bash and may redefine the MODELS array and DEFAULT_MODEL_KEY. For example:
    # ~/.config/claude-smart/models.conf
    MODELS[gpt-oss-20b]="gpt-oss-20b|llama-server-gpt-oss-20b.service"
    DEFAULT_MODEL_KEY="qwen3-coder-30b"

Local image & video generation (local-gen)

bin/local-gen generates images and short video clips entirely on the local GPU via ComfyUI running in the same llama-rocm distrobox container. It talks to ComfyUI's HTTP API (auto-starting the comfyui.service systemd unit if needed) and shows a live progress bar over ComfyUI's WebSocket.

local-gen --image "a neon-lit alley in the rain, cyberpunk, cinematic"
local-gen --video "a kite surfing at sunset, slow motion"
local-gen --status                 # is ComfyUI up? how much VRAM is free?
local-gen --down                   # unload all models, free VRAM/RAM
local-gen --help-gen               # full option list
Mode Model Notes
--image FLUX.1-schnell ~20s at 512², ~2min at 1024²
--video Wan2.1-T2V-1.3B 832×480, ~5min for a 3s clip at 12 steps

Common options (--width, --height, --steps, --seed, and --duration for video) override the defaults; see --help-gen. Output is written to ~/Pictures/local-gen/ (images as .png, video as animated .webp) and opened with the system viewer.

gfx1151 gotchas (Radeon 8060S / Strix Halo, the hard-won bits):

  • FLUX and ComfyUI segfault (exit 139) during inference unless the service sets HSA_OVERRIDE_GFX_VERSION=11.0.0 so ROCm 6.2 treats the card as gfx1100. This is baked into systemd/comfyui.service.
  • The Wan2.1 14B model SIGABRTs (exit 134) on the first compute step even with the override — the transformer's compute path overwhelms the gfx1100 spoof. The 1.3B model runs fine. Avoid the 14B on this hardware.
  • The VAE must match the model version. Wan2.1 produces 16-channel latents; the Wan2.2 VAE expects 48. Pairing them fails at decode with a tensor-size mismatch. Use Wan2_1_VAE_bf16.safetensors with the 2.1 model.

Log Viewer Application

A new log viewer application has been added to provide better monitoring of your LLaMA server:

Features:

  • Real-time logs from journalctl --user -u llama-server -f
  • Tabbed interface for easy navigation:
    • Logs: System logs from your LLaMA server
    • Slots: Detailed, formatted slot information with status and parameters
    • Health: Organized health status information including memory, GPU, model, and system details

Usage:

  1. Start the log viewer:
    npm start
  2. Open your browser and navigate to http://localhost:4000
  3. Use the tabs to view different information sources

The application automatically formats JSON responses from the /slots and /health endpoints for better readability, showing:

  • Slot ID, context size, and processing status
  • Key parameters and next token information
  • Memory, GPU, model, and system health details

What's in this repo

claude-local/
├── README.md                       ← this file
├── install.sh                      ← idempotent installer
├── bin/
│   ├── claude-smart                ← the wrapper script
│   ├── local-gen                   ← local image/video generation (ComfyUI)
│   └── rebuild-llama               ← rebuild llama.cpp in the ROCm container
├── docs/
│   ├── PX13-BAZZITE.md             ← Strix Halo / Radeon 8060S setup
│   ├── ZENBOOK-DUO.md              ← Core Ultra 9 / Vulkan setup (Aider + Qwen2.5-Coder-3B)
│   ├── AIDER-SETUP.md              ← Aider installation and config for Duo
│   ├── VULKAN-NOTES.md             ← Intel Arc iGPU Vulkan gotchas
│   ├── TAILSCALE-BRIDGE.md         ← remote-access pattern
│   └── JOURNAL.md                  ← Day 1 war stories (testing history)
├── systemd/
│   ├── llama-server-px13.service   ← reference systemd unit for PX13
│   ├── llama-server-duo.service    ← reference systemd unit for Duo
│   ├── llama-server-gemma4-12b.service ← Gemma 4 12B unit (claude-smart --model)
│   └── comfyui.service             ← ComfyUI server for local-gen
├── log-viewer.js                   ← Node.js log viewer application
├── index.html                      ← Web interface for the log viewer
├── package.json                    ← Application dependencies
├── LICENSE                         ← MIT
└── CONTRIBUTING.md                 ← if you want to upstream changes

How it works

┌─────────────┐     claude              ┌─────────────────┐
│ your shell  │ ──────────────────────► │ Anthropic API   │
│             │                          └─────────────────┘
│             │     claude-smart
│             │ ──────────┬──── (auto) probe Anthropic, fall back to local
│             │           ├──── --local force local
│             │           └──── --remote force Anthropic
│             │
│             │     claude-smart --local
│             │ ──────────────────────► localhost:8080 (llama-server)
└─────────────┘                          │
                                          ├─► distrobox container
                                          │     └─► llama.cpp + ROCm 7.2.3
                                          │           └─► gfx1151 GPU
                                          │
                                          └─► serves Anthropic Messages API
                                               natively (no proxy)

The claude-smart wrapper sets three env vars (ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL) for one invocation and execs into claude. The Claude Code CLI doesn't know it's talking to a local model — from its perspective, it's just hitting an Anthropic API endpoint.

Lessons learned the hard way

This setup looks simple now, but the path here was littered with dead ends. The full debug story is preserved in docs/PX13-BAZZITE.md under "What we ruled out (so you don't waste time)". Highlights:

  • The rocm-7rc-rocwmma kyuz0 toolbox image ships HSA runtime 1.18.0, which segfaults during tensor upload on gfx1151. Use rocm-7.2.3 instead.
  • -c N --parallel P divides context across slots. Claude Code + claude-mem injects ~30-40K tokens of system prompt + tools on first request. With --parallel 2, you need -c 131072 minimum.
  • Qwen3.6-35B-A3B is hybrid Transformer+Mamba. ROCm doesn't support Mamba SSM kernels (as of mid-2026). Stick with Qwen3-Coder for now.
  • Unsloth Dynamic 2.0 quants of Qwen3-Coder-30B have a Llama-3-style tokenizer artifact that crashes loading. Use the LM Studio Community quant.
  • The <tool_call> "control token" warning during model load is harmless cosmetic noise. Not the cause of any crash.
  • Always check dmesg first when GPU stuff silently dies. The HSA runtime segfault is invisible in llama.cpp's stdout but obvious in kernel logs.
  • On Bazzite/Fedora, drop --group-add sudo from kyuz0's example. That's Ubuntu-only; Fedora uses wheel, and distrobox doesn't need either.

Credits

Day 1 War Stories

The Duo inference path took a full day of testing to land on Aider + Qwen2.5-Coder-3B + Vulkan. Models tried and rejected: Qwen2.5-Coder-7B (context too small), DeepSeek-Coder-V2-Lite (hallucinates tool calls), Qwen3.5-9B/4B/0.8B (CPU timeout or too small), Qwen3.5-4B on Vulkan (hybrid attention breaks KV cache reuse). Key findings: Claude Code's internal request timeout (~10-12 min) is incompatible with iGPU prefill for its context size; use :server-vulkan not :server; Qwen2.5 standard attention is required for cache reuse to work. Aider + Qwen2.5-Coder-3B + Vulkan is the result.

Full investigative journal with the model-by-model failure analysis: docs/JOURNAL.md

License

MIT. See LICENSE.

About

Local Claude Code backup setups for AMD Strix Halo and Intel Core Ultra laptops running Bazzite (Fedora Atomic). Provides complete solution for running Claude Code locally using llama.cpp's native Anthropic Messages API with automatic fallback to local models when cloud service is unavailable.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors