A business-agnostic, multi-tier local LLM agent core that teams can adopt
across domains. The valuable logic — grammar-constrained routing, an adaptive
reasoning loop, objective escalation and a deterministic policy gate — lives in
a reusable core/. A new domain is a thin usecases/<name>/ folder, never a
fork of the core (see ADR-001).
The shipped example use-case, tienda, is a WhatsApp store assistant.
This repository is the LLM plane of a deliberately connected ecosystem, and the third step of a single evolution:
- ML-MLOps Portfolio — three production ML services; the lessons were paid for here.
- ML-MLOps Production Template — those lessons encoded as a reusable, governed scaffold for tabular ML on Kubernetes.
agent-local(this repo) — the same governance philosophy (AUTO/CONSULT/STOP, eval-gated autonomy, policy-as-data, no fine-tuning yet) generalized to a new domain: local LLM agents.The two repos are siblings with an explicit, bidirectional contract, not copies:
agent-localreuses the template's Terraform/Kustomize when it needs cloud, and runs the template's ADR-028 day-2 maintenance lanes on its local tiers. The shared planACTION_PLAN_LLM_AGENT.mdgoverns both planes. See the template's "Local model plane" section and this repo's ADR-001.
Status: Phase 1 (read-only, fixtures). Routing quality gate PASSED (19/20) on the Tier-0 router. Code is structured for the full multi-tier stack.
Most "LLM agent" code couples the loop, prompts and business rules into one app. That doesn't scale to multiple use-cases: the safety-critical logic diverges across copies. Here, that logic is centralized and consumed by configuration:
core/ # business-agnostic engine — single source of truth
config.py # UsecaseConfig loader
schemas.py # typed Pydantic contracts
router.py # Tier-0 router (GBNF-constrained JSON)
tiers.py # tier clients (endpoints injected from config)
tools.py # ToolRegistry (per-use-case namespaces)
retrieval.py # BM25 index + semantic_retrieval factory
policy.py # deterministic policy gate (rules are data)
agent.py # the 7-station loop
__init__.py # load_agent(name)
usecases/
tienda/ # example use-case (config + tools + data + prompts + evals)
config.yaml # endpoints, allowed_intents, policy rules, prompt templates
tools.py # build_registry(config) -> ToolRegistry
prompts/ grammars/ data/ policies/ budgets.yaml evals/sets/
app/
main.py # FastAPI surface; loads a use-case via AGENT_USECASE
Customer ─▶ FastAPI ─▶ Agent.handle()
│
1. route (Tier 0, GBNF) → intent / tier / risk / confidence
2. plan (Tier N) → list of tool calls
3. tools (APP executes) → observations
4. reflect (conditional) → only on tool-failure or risk ≥ medium
5. generate (Tier N) → draft answer
6. critic (Tier N/N+1) → verify against observations (risk ≥ medium)
7. policy (deterministic) → MANDATORY gate; no response bypasses it
8. finalize → answer + metrics
Adaptive depth: simple smalltalk goes plan → tools → policy → final
without paying for reflection/critique.
Objective escalation (in code, never in the prompt): confidence < 0.70
bumps a tier; a critic rejection bumps once; Tier-3 requires explicit budget
permission.
- Python 3.11+
- A llama.cpp
llama-serverbuild and a GGUF router model (Tier 0).
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]" # or: pip install -r requirements-dev.txtllama-server -m /path/to/router-model.gguf --port 8091 -ngl 99 -c 8192 --host 127.0.0.1# Tests (no model required)
pytest
# Routing eval (gate: >= 18/20 intent accuracy)
python evals/run.py 01_intent.jsonl --usecase tienda
# Dev API
AGENT_USECASE=tienda python -m app.main
curl -X POST http://localhost:8000/dev/message \
-H "Content-Type: application/json" \
-d '{"text": "tienen coca de 600 fria?"}'cp .env.example .env # set MODELS_DIR to your host model directory
docker compose up --buildModels are mounted as a read-only volume — never baked into the image.
usecases/<name>/
├── __init__.py # from .tools import build_registry
├── config.yaml # endpoints, allowed_intents, policy rules, prompts
├── prompts/router.md
├── grammars/route.gbnf
├── tools.py # build_registry(config) -> ToolRegistry
├── data/ # fixtures (Phase 1) or API clients (Phase 2)
├── policies/*.md # BM25-indexed docs
├── budgets.yaml
└── evals/sets/*.jsonlThen: AGENT_USECASE=<name> python -m app.main. See the full authoring guide
docs/usecases.md (contract, consumption modes,
bring-your-own-models) and CONTRIBUTING.md.
| Phase | Gate | Status |
|---|---|---|
| F0 | Tier-0 router speed ≥ 25 tok/s | ✅ (see bench/RESULTS.md) |
| F1 | Routing intent accuracy ≥ 18/20 | ✅ 20/20 |
| F1 | All tools read-only (order_create dry-run) |
✅ |
| F1 | Deterministic policy gate enforced | ✅ |
| F2.0 | ExecutiveController + per-tier circuit breaker | ✅ |
| F2.0 | Tier-client retry/backoff (transient blips ≠ tier failure) | ✅ |
| F1.6 | Latency-budget enforced (safe degrade past deadline) | ✅ |
- No fine-tuning at this stage — routing + prompts + retrieval.
- The model never mutates critical state without the policy gate — enforced structurally by the fail-closed tool capability contract (ADR-006).
- Every lane needs an eval harness before increasing autonomy.
- The simplest loop that works.
- Inventory/price/stock are never held in model memory — always live tools.
- Local-first; cloud only as explicit, budgeted overflow.
- Phase 1 — Skeleton ✅ (this): core + use-case, routing gate, policy gate, Docker.
- Phase 2 — executive controller, versioned YAML policies, verifier pass, 10 eval sets, SQLite queue + sagas for multi-day flows.
- Phase 3 — telemetry (PII-redacted), shadow mode, retrieval growth loop.
- Phase 4 — QLoRA (strategic gate; requires ≥4 weeks of logs + a new ADR).
- ADR-001 — reusable platform, not a copy template
- ADR-002 — calibrated infrastructure
- ADR-003 — policy rules as versioned data
- ADR-004 — cross-tier verification
- ADR-005 — decision telemetry as a contract
- ADR-006 — fail-closed tool capability contract
- ADR-007 — structured tool-calling contract
- CHANGELOG.md — version history
- CONTRIBUTING.md — dev setup, adding use-cases, quality gates
- SECURITY.md — security model and reporting
bench/RESULTS.md— benchmark + routing gate evidence