Run local inference via llama.cpp (request-scoped llama-server), drop the daemon#36
Draft
CoreyRDean wants to merge 3 commits into
Draft
Run local inference via llama.cpp (request-scoped llama-server), drop the daemon#36CoreyRDean wants to merge 3 commits into
CoreyRDean wants to merge 3 commits into
Conversation
Local models previously ran as a long-lived `llamafile --server` subprocess supervised by an `intentd` daemon, with the CLI talking to it over a loopback OpenAI-compatible HTTP endpoint. This replaces that whole path with one-shot `llama.cpp` `llama-cli` invocations: each request spawns the binary, constrains output with the same JSON-schema grammar, parses the JSON it prints, and exits. No daemon, no socket, no HTTP, nothing to supervise or leave running. The runtime is installed on demand through the system package manager (Homebrew first — it ships an up-to-date `llama.cpp` formula on macOS and Linux — then apt/dnf/pacman/zypper as best-effort fallbacks) when `llama-cli` isn't already on PATH. Changes: - new internal/model/llamacli backend (Backend + StructuredBackend) - new internal/runtime llama-cli resolution + package-manager install; drop the llamafile binary download (GGUF model management stays) - backend resolver builds the llama-cli backend for "llama-cli" and the back-compat alias "llamafile-local"; default backend is now llama-cli - remove internal/daemon and the `i daemon` subcommand; rework ensure/doctor/init/model self-heal around runtime+model install - network backends (llamafile-network, ollama, openai) are unchanged The local path no longer binds any network socket, so the loopback host validation is gone (moot). Tests updated; new unit tests cover the llama-cli JSON extraction and message flattening.
…loop Refines the local backend per review discussion: instead of spawning `llama-cli` one-shot per engine step (which reloads the model every step and flattens the multi-turn history into a single prompt), prefer a request-scoped `llama-server` child. The server starts lazily on the first inference of an `intent` invocation and is held warm for the rest of it, so the engine's tool-call loop reuses the loaded weights and KV cache. Because it speaks the OpenAI-compatible /v1/chat/completions API, the native messages array (system/user/assistant/tool) is sent as-is — no flattening. It is not a daemon: bound to a private loopback port, owned by the single invocation, killed on Close (and SIGKILL'd by the kernel via Pdeathsig if intent dies on Linux). The one-shot `llama-cli` path remains as a fallback when llama-server isn't installed. - new internal/model/llamaserver: lazy-start co-process backend that delegates to the existing OpenAI-compatible HTTP client; Close() kills the process group - runtime: manage both llama-server and llama-cli (same brew package); EnsureLlamaRuntime / HaveLlamaRuntime / Have*Server*/*CLI* resolvers - backend resolver ladder: llama-server -> llama-cli -> mock; defer closeBackend at the intent/explain/report call sites; verbose wrapper forwards Close - doctor/ensure/init/model report the unified llama.cpp runtime Trade-off unchanged across separate invocations (a fresh `i` reloads the model) — keeping weights warm between commands would require a resident daemon, which is exactly what this design avoids.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the server-daemon local-inference path with local llama.cpp, installed on demand via the system package manager. Local inference is now backed by a request-scoped
llama-serverchild process — warm across the engine's tool-call loop, killed when the invocation exits, with a one-shotllama-clifallback.Previously local models ran as a long-lived
llamafile --serversupervised by theintentddaemon (a persistent background process + control socket). That whole layer is gone.Architecture
For one
iinvocation:llama-serverchild is started on a private loopback port and held warm; subsequent tool-call steps reuse the loaded weights + KV cache (no per-step model reload)./v1/chat/completionsAPI, so the native messages array (system/user/assistant/tool) is sent as-is — no flattening of multi-turn history.Close()(deferred at the call sites), andPdeathsig=SIGKILL'd by the kernel on Linux ifintentdies. Nothing persists between commands; no socket to manage.Resolution ladder in the backend builder:
llama-server→llama-cli(one-shot fallback) →mock(when nothing's installed yet, so a fresh install doesn't hard-fail).Why this shape
"Efficient + no flattening + no server" splits by scope. Keeping weights warm within an invocation (across tool steps) is fully solved here without a daemon. Keeping them warm across separate invocations fundamentally requires a resident process — i.e. a daemon — which is exactly what this avoids. The request-scoped child is the sweet spot: robust clean-JSON transport, warm tool loop, no background state.
Components
internal/model/llamaserver— lazy-start co-process backend (Backend/StructuredBackend/io.Closer); delegates to the existing OpenAI-compatible HTTP client;Close()kills the process group. PlatformprocAttr(Pdeathsig on Linux).internal/model/llamacli— one-shot fallback backend (kept; flattens, sincellama-clitakes a single turn).internal/runtime— manages bothllama-serverandllama-cli(samebrew install llama.cpp);EnsureLlamaRuntime/HaveLlamaRuntime/ resolvers. Dropped the llamafile-binary download; GGUF model management unchanged.internal/daemon+i daemonsubcommand — removed.ensure/doctor/init/modelself-heal now revolves around runtime + model install.llamafile-network,ollama,openai) untouched.Tool calls
Preserved. The engine's tool-call loop is backend-agnostic and the JSON-schema grammar still includes the
tool_callbranch (the full tool enum). With the co-process the multi-turn history reaches the model as a native role-tagged messages array, identical to the old server.Tests
Closebefore start) and the one-shot JSON extraction / flattening.gofmt/go vet/go test ./...pass; cross-compiles on linux+darwin / amd64+arm64; binary smoke-tested (i doctorreports thellama.cpp runtime;i daemonis no longer a subcommand). Prior commit's CI run was fully green.Verification note
I could not run a true end-to-end model call in this sandbox (no
llama-server/GGUF installed). One compatibility item to confirm on a real machine: the HTTP client sends the grammar asresponse_format.schema(a llama.cpp extension llamafile honored). Upstreamllama-servershares that codebase and should accept it; worth a live check.https://claude.ai/code/session_01Kg6sPc8a48ivHXjpxmwDAe