Skip to content

Run local inference via llama.cpp (request-scoped llama-server), drop the daemon#36

Draft
CoreyRDean wants to merge 3 commits into
mainfrom
claude/llama-cli-local-inference-NaJIq
Draft

Run local inference via llama.cpp (request-scoped llama-server), drop the daemon#36
CoreyRDean wants to merge 3 commits into
mainfrom
claude/llama-cli-local-inference-NaJIq

Conversation

@CoreyRDean

@CoreyRDean CoreyRDean commented Jun 5, 2026

Copy link
Copy Markdown
Owner

What

Replaces the server-daemon local-inference path with local llama.cpp, installed on demand via the system package manager. Local inference is now backed by a request-scoped llama-server child process — warm across the engine's tool-call loop, killed when the invocation exits, with a one-shot llama-cli fallback.

Previously local models ran as a long-lived llamafile --server supervised by the intentd daemon (a persistent background process + control socket). That whole layer is gone.

Architecture

For one i invocation:

  • On the first inference, a llama-server child is started on a private loopback port and held warm; subsequent tool-call steps reuse the loaded weights + KV cache (no per-step model reload).
  • It speaks the OpenAI-compatible /v1/chat/completions API, so the native messages array (system/user/assistant/tool) is sent as-is — no flattening of multi-turn history.
  • It's not a daemon: bound to loopback, owned by the single invocation, killed on Close() (deferred at the call sites), and Pdeathsig=SIGKILL'd by the kernel on Linux if intent dies. Nothing persists between commands; no socket to manage.

Resolution ladder in the backend builder: llama-serverllama-cli (one-shot fallback) → mock (when nothing's installed yet, so a fresh install doesn't hard-fail).

Why this shape

"Efficient + no flattening + no server" splits by scope. Keeping weights warm within an invocation (across tool steps) is fully solved here without a daemon. Keeping them warm across separate invocations fundamentally requires a resident process — i.e. a daemon — which is exactly what this avoids. The request-scoped child is the sweet spot: robust clean-JSON transport, warm tool loop, no background state.

Components

  • internal/model/llamaserver — lazy-start co-process backend (Backend/StructuredBackend/io.Closer); delegates to the existing OpenAI-compatible HTTP client; Close() kills the process group. Platform procAttr (Pdeathsig on Linux).
  • internal/model/llamacli — one-shot fallback backend (kept; flattens, since llama-cli takes a single turn).
  • internal/runtime — manages both llama-server and llama-cli (same brew install llama.cpp); EnsureLlamaRuntime / HaveLlamaRuntime / resolvers. Dropped the llamafile-binary download; GGUF model management unchanged.
  • internal/daemon + i daemon subcommand — removed. ensure/doctor/init/model self-heal now revolves around runtime + model install.
  • Network backends (llamafile-network, ollama, openai) untouched.

Tool calls

Preserved. The engine's tool-call loop is backend-agnostic and the JSON-schema grammar still includes the tool_call branch (the full tool enum). With the co-process the multi-turn history reaches the model as a native role-tagged messages array, identical to the old server.

Tests

  • Unit tests for the co-process helpers (port allocation, capped log buffer, cache identity, safe Close before start) and the one-shot JSON extraction / flattening.
  • Updated CLI tests (backend fallback, doctor, config) for the daemon-free world.
  • gofmt/go vet/go test ./... pass; cross-compiles on linux+darwin / amd64+arm64; binary smoke-tested (i doctor reports the llama.cpp runtime; i daemon is no longer a subcommand). Prior commit's CI run was fully green.

Verification note

I could not run a true end-to-end model call in this sandbox (no llama-server/GGUF installed). One compatibility item to confirm on a real machine: the HTTP client sends the grammar as response_format.schema (a llama.cpp extension llamafile honored). Upstream llama-server shares that codebase and should accept it; worth a live check.

https://claude.ai/code/session_01Kg6sPc8a48ivHXjpxmwDAe

claude added 3 commits June 5, 2026 20:17
Local models previously ran as a long-lived `llamafile --server`
subprocess supervised by an `intentd` daemon, with the CLI talking to
it over a loopback OpenAI-compatible HTTP endpoint. This replaces that
whole path with one-shot `llama.cpp` `llama-cli` invocations: each
request spawns the binary, constrains output with the same JSON-schema
grammar, parses the JSON it prints, and exits. No daemon, no socket, no
HTTP, nothing to supervise or leave running.

The runtime is installed on demand through the system package manager
(Homebrew first — it ships an up-to-date `llama.cpp` formula on macOS
and Linux — then apt/dnf/pacman/zypper as best-effort fallbacks) when
`llama-cli` isn't already on PATH.

Changes:
- new internal/model/llamacli backend (Backend + StructuredBackend)
- new internal/runtime llama-cli resolution + package-manager install;
  drop the llamafile binary download (GGUF model management stays)
- backend resolver builds the llama-cli backend for "llama-cli" and the
  back-compat alias "llamafile-local"; default backend is now llama-cli
- remove internal/daemon and the `i daemon` subcommand; rework
  ensure/doctor/init/model self-heal around runtime+model install
- network backends (llamafile-network, ollama, openai) are unchanged

The local path no longer binds any network socket, so the loopback host
validation is gone (moot). Tests updated; new unit tests cover the
llama-cli JSON extraction and message flattening.
…loop

Refines the local backend per review discussion: instead of spawning
`llama-cli` one-shot per engine step (which reloads the model every step
and flattens the multi-turn history into a single prompt), prefer a
request-scoped `llama-server` child.

The server starts lazily on the first inference of an `intent`
invocation and is held warm for the rest of it, so the engine's
tool-call loop reuses the loaded weights and KV cache. Because it speaks
the OpenAI-compatible /v1/chat/completions API, the native messages
array (system/user/assistant/tool) is sent as-is — no flattening. It is
not a daemon: bound to a private loopback port, owned by the single
invocation, killed on Close (and SIGKILL'd by the kernel via Pdeathsig
if intent dies on Linux). The one-shot `llama-cli` path remains as a
fallback when llama-server isn't installed.

- new internal/model/llamaserver: lazy-start co-process backend that
  delegates to the existing OpenAI-compatible HTTP client; Close() kills
  the process group
- runtime: manage both llama-server and llama-cli (same brew package);
  EnsureLlamaRuntime / HaveLlamaRuntime / Have*Server*/*CLI* resolvers
- backend resolver ladder: llama-server -> llama-cli -> mock; defer
  closeBackend at the intent/explain/report call sites; verbose wrapper
  forwards Close
- doctor/ensure/init/model report the unified llama.cpp runtime

Trade-off unchanged across separate invocations (a fresh `i` reloads the
model) — keeping weights warm between commands would require a resident
daemon, which is exactly what this design avoids.
@CoreyRDean CoreyRDean changed the title Run local inference via llama-cli, drop the server daemon Run local inference via llama.cpp (request-scoped llama-server), drop the daemon Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants