Run local inference via llama.cpp (request-scoped llama-server), drop the daemon by CoreyRDean · Pull Request #36 · CoreyRDean/intent

CoreyRDean · 2026-06-05T20:17:43Z

What

Replaces the server-daemon local-inference path with local llama.cpp, installed on demand via the system package manager. Local inference is now backed by a request-scoped llama-server child process — warm across the engine's tool-call loop, killed when the invocation exits, with a one-shot llama-cli fallback.

Previously local models ran as a long-lived llamafile --server supervised by the intentd daemon (a persistent background process + control socket). That whole layer is gone.

Architecture

For one i invocation:

On the first inference, a llama-server child is started on a private loopback port and held warm; subsequent tool-call steps reuse the loaded weights + KV cache (no per-step model reload).
It speaks the OpenAI-compatible /v1/chat/completions API, so the native messages array (system/user/assistant/tool) is sent as-is — no flattening of multi-turn history.
It's not a daemon: bound to loopback, owned by the single invocation, killed on Close() (deferred at the call sites), and Pdeathsig=SIGKILL'd by the kernel on Linux if intent dies. Nothing persists between commands; no socket to manage.

Resolution ladder in the backend builder: llama-server → llama-cli (one-shot fallback) → mock (when nothing's installed yet, so a fresh install doesn't hard-fail).

Why this shape

"Efficient + no flattening + no server" splits by scope. Keeping weights warm within an invocation (across tool steps) is fully solved here without a daemon. Keeping them warm across separate invocations fundamentally requires a resident process — i.e. a daemon — which is exactly what this avoids. The request-scoped child is the sweet spot: robust clean-JSON transport, warm tool loop, no background state.

Components

internal/model/llamaserver — lazy-start co-process backend (Backend/StructuredBackend/io.Closer); delegates to the existing OpenAI-compatible HTTP client; Close() kills the process group. Platform procAttr (Pdeathsig on Linux).
internal/model/llamacli — one-shot fallback backend (kept; flattens, since llama-cli takes a single turn).
internal/runtime — manages both llama-server and llama-cli (same brew install llama.cpp); EnsureLlamaRuntime / HaveLlamaRuntime / resolvers. Dropped the llamafile-binary download; GGUF model management unchanged.
internal/daemon + i daemon subcommand — removed. ensure/doctor/init/model self-heal now revolves around runtime + model install.
Network backends (llamafile-network, ollama, openai) untouched.

Tool calls

Preserved. The engine's tool-call loop is backend-agnostic and the JSON-schema grammar still includes the tool_call branch (the full tool enum). With the co-process the multi-turn history reaches the model as a native role-tagged messages array, identical to the old server.

Tests

Unit tests for the co-process helpers (port allocation, capped log buffer, cache identity, safe Close before start) and the one-shot JSON extraction / flattening.
Updated CLI tests (backend fallback, doctor, config) for the daemon-free world.
gofmt/go vet/go test ./... pass; cross-compiles on linux+darwin / amd64+arm64; binary smoke-tested (i doctor reports the llama.cpp runtime; i daemon is no longer a subcommand). Prior commit's CI run was fully green.

Verification note

I could not run a true end-to-end model call in this sandbox (no llama-server/GGUF installed). One compatibility item to confirm on a real machine: the HTTP client sends the grammar as response_format.schema (a llama.cpp extension llamafile honored). Upstream llama-server shares that codebase and should accept it; worth a live check.

https://claude.ai/code/session_01Kg6sPc8a48ivHXjpxmwDAe

Local models previously ran as a long-lived `llamafile --server` subprocess supervised by an `intentd` daemon, with the CLI talking to it over a loopback OpenAI-compatible HTTP endpoint. This replaces that whole path with one-shot `llama.cpp` `llama-cli` invocations: each request spawns the binary, constrains output with the same JSON-schema grammar, parses the JSON it prints, and exits. No daemon, no socket, no HTTP, nothing to supervise or leave running. The runtime is installed on demand through the system package manager (Homebrew first — it ships an up-to-date `llama.cpp` formula on macOS and Linux — then apt/dnf/pacman/zypper as best-effort fallbacks) when `llama-cli` isn't already on PATH. Changes: - new internal/model/llamacli backend (Backend + StructuredBackend) - new internal/runtime llama-cli resolution + package-manager install; drop the llamafile binary download (GGUF model management stays) - backend resolver builds the llama-cli backend for "llama-cli" and the back-compat alias "llamafile-local"; default backend is now llama-cli - remove internal/daemon and the `i daemon` subcommand; rework ensure/doctor/init/model self-heal around runtime+model install - network backends (llamafile-network, ollama, openai) are unchanged The local path no longer binds any network socket, so the loopback host validation is gone (moot). Tests updated; new unit tests cover the llama-cli JSON extraction and message flattening.

…loop Refines the local backend per review discussion: instead of spawning `llama-cli` one-shot per engine step (which reloads the model every step and flattens the multi-turn history into a single prompt), prefer a request-scoped `llama-server` child. The server starts lazily on the first inference of an `intent` invocation and is held warm for the rest of it, so the engine's tool-call loop reuses the loaded weights and KV cache. Because it speaks the OpenAI-compatible /v1/chat/completions API, the native messages array (system/user/assistant/tool) is sent as-is — no flattening. It is not a daemon: bound to a private loopback port, owned by the single invocation, killed on Close (and SIGKILL'd by the kernel via Pdeathsig if intent dies on Linux). The one-shot `llama-cli` path remains as a fallback when llama-server isn't installed. - new internal/model/llamaserver: lazy-start co-process backend that delegates to the existing OpenAI-compatible HTTP client; Close() kills the process group - runtime: manage both llama-server and llama-cli (same brew package); EnsureLlamaRuntime / HaveLlamaRuntime / Have*Server*/*CLI* resolvers - backend resolver ladder: llama-server -> llama-cli -> mock; defer closeBackend at the intent/explain/report call sites; verbose wrapper forwards Close - doctor/ensure/init/model report the unified llama.cpp runtime Trade-off unchanged across separate invocations (a fresh `i` reloads the model) — keeping weights warm between commands would require a resident daemon, which is exactly what this design avoids.

claude added 3 commits June 5, 2026 20:17

docs(readme): describe the request-scoped llama-server co-process

9aadab5

CoreyRDean changed the title ~~Run local inference via llama-cli, drop the server daemon~~ Run local inference via llama.cpp (request-scoped llama-server), drop the daemon Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run local inference via llama.cpp (request-scoped llama-server), drop the daemon#36

Run local inference via llama.cpp (request-scoped llama-server), drop the daemon#36
CoreyRDean wants to merge 3 commits into
mainfrom
claude/llama-cli-local-inference-NaJIq

CoreyRDean commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CoreyRDean commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Architecture

Why this shape

Components

Tool calls

Tests

Verification note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CoreyRDean commented Jun 5, 2026 •

edited

Loading