From b5e65ca170f3f1351a5119a347452044c39748c8 Mon Sep 17 00:00:00 2001 From: Daniel Klevebring Date: Thu, 25 Jun 2026 09:11:08 +0200 Subject: [PATCH] feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Recall has a deterministic offline gate; storing — the half that was actually failing — had nothing. Add a model-in-the-loop harness that measures store precision/recall: for each labeled session fixture it runs a model with an engram_store tool + the production store nudge as the policy, and checks whether the model decides to call the tool vs the should_store label. - scripts/store_eval.py: `validate` (dependency-free fixture check) and `run` (needs anthropic + key). --model parametrizes and is recorded per run under eval/store-runs/; --policy-file A/Bs nudge wordings. - scripts/store_eval_fixtures.json: 13 starter fixtures (7 store / 6 skip), incl. explicit-remember, fact-buried-in-chatter, and near-miss negatives. - ADR 0025 documents the methodology, the sync requirement with production, and why it's on-demand (token cost, non-deterministic), not a CI gate. Co-Authored-By: Claude Opus 4.8 --- CLAUDE.md | 1 + docs/adr/0025-store-behavior-eval.md | 77 ++++++++ docs/adr/README.md | 1 + scripts/store_eval.py | 282 +++++++++++++++++++++++++++ scripts/store_eval_fixtures.json | 125 ++++++++++++ 5 files changed, 486 insertions(+) create mode 100644 docs/adr/0025-store-behavior-eval.md create mode 100644 scripts/store_eval.py create mode 100644 scripts/store_eval_fixtures.json diff --git a/CLAUDE.md b/CLAUDE.md index 6d7fdfd..cb6d655 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,6 +27,7 @@ and recall content. See `README.md` for architecture and build instructions. - `Sources/EngramCore/Facets.swift` — pure parser splitting tags into `key:value` facets vs freeform; folds `source` into `project` (ADR 0013) - `Engram/Info.plist` — partial plist merged into the generated one; carries the Sparkle `SU*` keys (custom keys can't go through `INFOPLIST_KEY_*`) - `Engram/scripts/bundle-cli.sh` — build phase that bundles the CLI into the app +- `scripts/store_eval.py` + `scripts/store_eval_fixtures.json` — LLM-in-the-loop **store-behavior** eval (ADR 0025): runs a model with an `engram_store` tool over labeled session fixtures and reports store precision/recall; `--model` parametrizes and records the model id per run under `eval/store-runs/`. `validate` checks fixtures with no API key; `run` needs `anthropic` + `ANTHROPIC_API_KEY`. Separate from the deterministic Swift `engram-eval`; not a CI gate (costs tokens, model-dependent). - `scripts/release.sh` + `scripts/bump_version.py` — local `make release-*` flow: gate, bump, tag, push (ADR 0010) - `scripts/update_appcast.py` — prepends a release entry to `docs/appcast.xml` (run by CI; stdlib-only so the runner needs no uv) - `.github/workflows/release.yml` + `.github/ExportOptions.plist` — CI that signs, notarizes, and publishes a release (ADR 0010) diff --git a/docs/adr/0025-store-behavior-eval.md b/docs/adr/0025-store-behavior-eval.md new file mode 100644 index 0000000..664fcf7 --- /dev/null +++ b/docs/adr/0025-store-behavior-eval.md @@ -0,0 +1,77 @@ +# 25. LLM-in-the-loop eval for store behavior (does the agent save the right memories?) + +- **Status:** Accepted +- **Date:** 2026-06-25 +- **Deciders:** Daniel Klevebring, Claude +- **Relates to:** ADR 0001 (model-driven store), ADR 0021 (retrieval eval), ADR 0023 (recall metrics) + +## Context + +Recall quality has an offline gate (ADR 0021's `engram-eval`): deterministic, +free, runs in CI. **Store** quality has had nothing. And storing is the half +that was actually failing in practice — investigation of ~1800 real sessions +found stores effectively stopped after an initial burst while recall kept +firing: a mix of the model not *deciding* to save and, when it did try, fumbling +the CLI invocation (the latter fixed separately by the `engram store` +arg-robustness change). The decision half — "given a session, does the agent +correctly choose to save the durable things and skip the noise?" — is the core +risk for whether Engram accrues anything to recall. + +That question can't be answered by a deterministic harness. Storing is +model-driven (ADR 0001): it depends on an LLM reading the reflection nudge / the +`/remember` guidance and choosing to act. Measuring it requires running an +actual model. That makes the eval fundamentally different from `engram-eval`: +non-deterministic, model-dependent, and it costs API tokens. + +## Decision + +Add a separate **LLM-in-the-loop** harness, `scripts/store_eval.py` (Python + +the `anthropic` SDK, matching `scripts/embeddings_eval.py`), that measures store +**precision/recall** over labeled session fixtures. + +- **Fixtures** (`scripts/store_eval_fixtures.json`): short coding-session + transcripts, each labeled `should_store` (a durable preference / decision / + project fact / gotcha) or not (routine chatter, transient state, general + knowledge, repo-derivable facts). Starter set covers both classes incl. an + explicit "remember this", a fact buried in chatter, and near-miss negatives. +- **Mechanism**: for each fixture, call the model with an `engram_store` tool + available and the production store signal as the policy (system prompt + the + recall hook's reflection nudge appended as the final user turn), `tool_choice` + auto. Whether the model emits an `engram_store` call is its decision; compare + to the label. We don't execute the tool — we only observe the call. +- **Metrics**: store precision (of what it saved, how much was warranted), + recall (of what it should have saved, how much it did), plus F1/accuracy and + the per-fixture stored content for eyeballing quality. +- **Model is a parameter and is recorded.** `--model` (default + `claude-opus-4-8`); every `--record` run writes + `eval/store-runs/-.json` with the git sha, model, a hash of the + policy, and per-fixture results — so runs are comparable across models and + across prompt revisions. +- **The policy is the thing under test, and is A/B-able.** The default `POLICY` + in the harness mirrors production (the nudge + the `/remember` criteria); + `--policy-file` swaps in a candidate wording. This is how we measure "how good + are we at *triggering* the agent to store" as we iterate the nudge. +- A dependency-free `validate` subcommand checks the fixtures and prints the + plan without an API key, so CI / contributors can sanity-check fixtures + without spending tokens. + +## Consequences + +- We can finally measure, and regression-test, the store decision — and A/B nudge + wordings against real model behavior across Opus/Sonnet/Haiku, tracking which + model each number came from. +- **Not a CI gate.** It costs tokens and is non-deterministic, so it runs + on-demand (like the embedding-model exploration harness), not on every PR. + Numbers are a relative A/B, not an absolute benchmark; the small fixture set + means per-fixture misses matter more than the aggregate. +- **The policy must be kept in sync with production.** The harness embeds a copy + of the nudge/guidance; if the production reflection nudge (`main.swift`) or the + `/remember` skill (`Setup.swift`) changes and the harness doesn't, the eval + stops predicting production. This duplication is the accepted cost of testing + the prompt as a unit. +- **Fidelity gap.** Production stores via the `/remember` skill through Claude + Code; the eval uses a single `engram_store` tool call via the API. It measures + the *decision* faithfully but not the full skill/CLI path (the arg-form + failures are covered by the deterministic CLI tests instead). +- Token cost scales with fixtures × models × policies; keep the fixture set + small and curated, and use `--limit` for smoke runs. diff --git a/docs/adr/README.md b/docs/adr/README.md index c5c4572..f53b3a6 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -33,6 +33,7 @@ supersedes the old one (and update the old one's status to `Superseded by NNNN`) | [0021](0021-embedder-relative-recall-gate.md) | Embedder-relative recall gate, calibrated by offline eval (refines 0005's gate) | Accepted | | [0022](0022-privileged-helper-for-cli-install.md) | Privileged CLI install via a one-shot authenticated `osascript` | Accepted | | [0023](0023-session-scoped-recall-cooldown.md) | Session-scoped recall re-injection cooldown (stop re-injecting the same memory every prompt) | Accepted | +| [0025](0025-store-behavior-eval.md) | LLM-in-the-loop eval for store behavior (store precision/recall, model-parametrized) | Accepted | ## Writing a new ADR diff --git a/scripts/store_eval.py b/scripts/store_eval.py new file mode 100644 index 0000000..ecd3b5e --- /dev/null +++ b/scripts/store_eval.py @@ -0,0 +1,282 @@ +#!/usr/bin/env python3 +"""Store-behavior eval — does the agent decide to save the right memories? + +LLM-in-the-loop (ADR 0025). For each labeled session fixture, this runs a model +with an `engram_store` tool available and Engram's production store-reflection +guidance as the policy, then checks whether the model *called* the tool — +comparing its decision against the fixture's `should_store` label. Reports store +**precision / recall** and per-fixture outcomes. + +Unlike the Swift `engram-eval` (a deterministic retrieval gate), this calls a +real model: results are model- and prompt-dependent and cost API tokens. Treat +it as a *relative* A/B across models and policy wordings — not an absolute +benchmark. The model id and a hash of the policy are recorded on every run, so +runs are comparable. + +The `POLICY` below mirrors the production store signal (the recall hook's +reflection nudge + the /remember guidance). **Keep it in sync** — when the +production nudge wording changes, update it here, or the eval stops predicting +production behavior. To A/B a candidate wording, pass `--policy-file`. + +Usage: + # validate fixtures + print the plan, no API key needed: + uv run scripts/store_eval.py validate + + # run against a model (needs ANTHROPIC_API_KEY); opus-4-8 is the default: + uv run --with anthropic scripts/store_eval.py run + uv run --with anthropic scripts/store_eval.py run --model claude-sonnet-4-6 + uv run --with anthropic scripts/store_eval.py run --record # → eval/store-runs/-.json +""" +from __future__ import annotations + +import hashlib +import json +import subprocess +import sys +import time +from pathlib import Path + +FIXTURES_PATH = Path(__file__).parent / "store_eval_fixtures.json" +RUNS_DIR = Path(__file__).parent.parent / "eval" / "store-runs" +DEFAULT_MODEL = "claude-opus-4-8" + +# The policy under test: the system prompt + the final-turn reflection nudge the +# agent sees. This mirrors the production store signal (the recall hook's nudge +# in Sources/engram/main.swift + the /remember guidance in Setup.swift). The +# only deliberate divergence: production saves via the `/remember` skill, here +# the model saves via the `engram_store` tool so the harness can observe the +# decision. Keep the wording in sync with production. +POLICY = { + "system": ( + "You are an AI assistant pair-working with a developer in a coding session. " + "You have access to Engram, a long-term memory store, via the `engram_store` " + "tool. Save durable, reusable knowledge — preferences, decisions, project " + "facts, and gotchas — that would help in future sessions. Do NOT save routine " + "task chatter, transient state, general knowledge, or anything already captured " + "in the repo or git history. If nothing durable surfaced, simply don't call the " + "tool." + ), + "nudge": ( + "Engram reflection check: glance back over the recent turns. If something " + "durable surfaced — a preference, a decision, a project fact, or a gotcha you'd " + "want recalled weeks from now — save it with the engram_store tool. Only save " + "what's genuinely reusable; skip routine task chatter and anything already in " + "the repo or git history. Nothing notable? Don't call the tool." + ), +} + +ENGRAM_STORE_TOOL = { + "name": "engram_store", + "description": ( + "Save a durable memory to Engram for recall in future sessions. Call this only " + "when something genuinely reusable surfaced — a preference, decision, project " + "fact, or gotcha worth recalling weeks later." + ), + "input_schema": { + "type": "object", + "properties": { + "content": { + "type": "string", + "description": "The memory as a short Markdown note (a one-line title then the fact).", + }, + "tags": {"type": "array", "items": {"type": "string"}}, + }, + "required": ["content"], + "additionalProperties": False, + }, +} + + +def _load_fixtures() -> list[dict]: + data = json.loads(FIXTURES_PATH.read_text()) + return data["fixtures"] + + +def _policy(policy_file: str | None) -> dict: + if policy_file is None: + return POLICY + loaded = json.loads(Path(policy_file).read_text()) + if "system" not in loaded or "nudge" not in loaded: + raise SystemExit(f"policy file {policy_file} must define 'system' and 'nudge'") + return loaded + + +def _policy_hash(policy: dict) -> str: + blob = json.dumps([policy["system"], policy["nudge"], ENGRAM_STORE_TOOL], sort_keys=True) + return hashlib.sha256(blob.encode()).hexdigest()[:12] + + +def _git_sha() -> str: + try: + return subprocess.check_output(["git", "rev-parse", "--short", "HEAD"], text=True).strip() + except Exception: + return "unknown" + + +def _build_messages(fixture: dict, policy: dict) -> list[dict]: + # The fixture transcript, then the reflection nudge as the final user turn — + # consecutive user turns are fine (the API merges them). + return [*fixture["messages"], {"role": "user", "content": policy["nudge"]}] + + +def validate() -> None: + """Check the fixtures are well-formed and print the run plan. No API calls.""" + fixtures = _load_fixtures() + ids = [f["id"] for f in fixtures] + if len(ids) != len(set(ids)): + raise SystemExit("duplicate fixture ids") + for f in fixtures: + for key in ("id", "should_store", "messages", "rationale"): + if key not in f: + raise SystemExit(f"fixture {f.get('id', '?')} missing '{key}'") + if not f["messages"] or f["messages"][0]["role"] != "user": + raise SystemExit(f"fixture {f['id']}: messages must be non-empty and start with a user turn") + for m in f["messages"]: + if m["role"] not in ("user", "assistant"): + raise SystemExit(f"fixture {f['id']}: bad role {m['role']!r}") + + positives = sum(1 for f in fixtures if f["should_store"]) + print(f"{len(fixtures)} fixtures — {positives} should-store, {len(fixtures) - positives} should-not") + print(f"policy hash: {_policy_hash(POLICY)} default model: {DEFAULT_MODEL}") + print("\nfixtures:") + for f in fixtures: + mark = "STORE " if f["should_store"] else "skip " + print(f" [{mark}] {f['id']}: {f['rationale']}") + print("\nfixtures valid. Run with: uv run --with anthropic scripts/store_eval.py run") + + +def run( + model: str = DEFAULT_MODEL, + limit: int | None = None, + record: bool = False, + policy_file: str | None = None, + thinking: bool = True, +) -> None: + """Run each fixture through `model` and report store precision/recall. + + model: Claude model id (default claude-opus-4-8). + limit: only run the first N fixtures (for a cheap smoke run). + record: append a run JSON under eval/store-runs/. + policy_file: JSON with {"system","nudge"} to A/B an alternative policy. + thinking: run with adaptive thinking (mirrors production agents); --no-thinking is cheaper. + """ + import anthropic # imported lazily so `validate` needs no SDK / network + + fixtures = _load_fixtures() + if limit is not None: + fixtures = fixtures[:limit] + policy = _policy(policy_file) + client = anthropic.Anthropic() + + results: list[dict] = [] + for fixture in fixtures: + kwargs: dict = { + "model": model, + "max_tokens": 4096, + "system": policy["system"], + "tools": [ENGRAM_STORE_TOOL], + "messages": _build_messages(fixture, policy), + } + if thinking: + kwargs["thinking"] = {"type": "adaptive"} + response = client.messages.create(**kwargs) + + store_calls = [ + b.input for b in response.content + if b.type == "tool_use" and b.name == "engram_store" + ] + stored = len(store_calls) > 0 + correct = stored == fixture["should_store"] + results.append({ + "id": fixture["id"], + "should_store": fixture["should_store"], + "stored": stored, + "correct": correct, + "stored_content": [c.get("content", "") for c in store_calls], + }) + + metrics = _metrics(results) + _print_report(results, metrics, model) + + if record: + RUNS_DIR.mkdir(parents=True, exist_ok=True) + stamp = time.strftime("%Y-%m-%dT%H-%M-%SZ", time.gmtime()) + out = RUNS_DIR / f"{stamp}-{model}.json" + out.write_text(json.dumps({ + "git_sha": _git_sha(), + "model": model, + "thinking": thinking, + "policy_hash": _policy_hash(policy), + "timestamp": stamp, + "metrics": metrics, + "results": results, + }, indent=2)) + print(f"\nrecorded → {out}") + + +def _metrics(results: list[dict]) -> dict: + tp = sum(1 for r in results if r["should_store"] and r["stored"]) + fp = sum(1 for r in results if not r["should_store"] and r["stored"]) + fn = sum(1 for r in results if r["should_store"] and not r["stored"]) + tn = sum(1 for r in results if not r["should_store"] and not r["stored"]) + precision = tp / (tp + fp) if (tp + fp) else 0.0 + recall = tp / (tp + fn) if (tp + fn) else 0.0 + f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0 + return { + "tp": tp, "fp": fp, "fn": fn, "tn": tn, + "precision": round(precision, 3), + "recall": round(recall, 3), + "f1": round(f1, 3), + "accuracy": round((tp + tn) / len(results), 3) if results else 0.0, + } + + +def _print_report(results: list[dict], metrics: dict, model: str) -> None: + print(f"\n=== store-behavior eval — {model} ===") + for r in results: + verdict = "ok " if r["correct"] else "MISS" + want = "store" if r["should_store"] else "skip" + got = "stored" if r["stored"] else "skipped" + print(f" [{verdict}] {r['id']:28} want={want:5} got={got}") + if r["stored"] and r["stored_content"]: + print(f" ↳ {r['stored_content'][0][:80]}") + m = metrics + print( + f"\nprecision {m['precision']} recall {m['recall']} f1 {m['f1']} " + f"accuracy {m['accuracy']} (tp={m['tp']} fp={m['fp']} fn={m['fn']} tn={m['tn']})" + ) + print("note: model/prompt-dependent — a relative A/B, not a benchmark.") + + +if __name__ == "__main__": + args = sys.argv[1:] + if not args or args[0] in ("-h", "--help"): + print(__doc__) + sys.exit(0) + command, rest = args[0], args[1:] + if command == "validate": + validate() + elif command == "run": + # Minimal flag parsing (kept dependency-free): --key value / --flag / --no-flag. + opts: dict = {} + i = 0 + while i < len(rest): + tok = rest[i] + if tok.startswith("--no-"): + opts[tok[5:].replace("-", "_")] = False + i += 1 + elif tok.startswith("--"): + key = tok[2:].replace("-", "_") + if i + 1 < len(rest) and not rest[i + 1].startswith("--"): + opts[key] = rest[i + 1] + i += 2 + else: + opts[key] = True + i += 1 + else: + i += 1 + if "limit" in opts and opts["limit"] is not True: + opts["limit"] = int(opts["limit"]) + run(**opts) + else: + raise SystemExit(f"unknown command {command!r} — use 'validate' or 'run'") diff --git a/scripts/store_eval_fixtures.json b/scripts/store_eval_fixtures.json new file mode 100644 index 0000000..87a3b76 --- /dev/null +++ b/scripts/store_eval_fixtures.json @@ -0,0 +1,125 @@ +{ + "_comment": "Labeled session fixtures for the store-behavior eval (ADR 0025). Each fixture is a short coding-session transcript plus a `should_store` label: does a durable, reusable memory genuinely surface here? Positives = preferences / decisions / project facts / gotchas worth recalling weeks later. Negatives = routine task chatter, transient state, general knowledge, or anything already in the repo/git. Keep messages realistic and concise; add fixtures as new failure modes show up in production.", + "fixtures": [ + { + "id": "pref-uv-over-pip", + "should_store": true, + "rationale": "A stated, durable personal preference — exactly what recall should surface in future sessions.", + "messages": [ + {"role": "user", "content": "Set up the project deps. And just so you know, I always use uv for Python — never pip or plain venv, in any of my repos."}, + {"role": "assistant", "content": "Got it — I'll use uv (`uv add`, `uv sync`) throughout. Running `uv sync` now."} + ] + }, + { + "id": "decision-rrf-fusion", + "should_store": true, + "rationale": "An architectural decision with a rationale — durable and reusable.", + "messages": [ + {"role": "user", "content": "Should we blend the lexical and semantic results by score or by rank?"}, + {"role": "assistant", "content": "By rank — Reciprocal Rank Fusion. Score blending is brittle because BM25 and cosine live on different scales; RRF only needs the ordering, so it's robust to that. Let's go with RRF."}, + {"role": "user", "content": "Agreed, do it that way."} + ] + }, + { + "id": "gotcha-notarization-403", + "should_store": true, + "rationale": "A hard-won, non-obvious gotcha that would waste time again if forgotten.", + "messages": [ + {"role": "user", "content": "The notarization step keeps failing with a 403."}, + {"role": "assistant", "content": "Found it: notarytool returns 403 'required agreement' until you re-accept the updated Apple Developer Program License in the portal. Once I accepted it, the same submission went through. Worth remembering — the error message doesn't hint at the license at all."} + ] + }, + { + "id": "explicit-remember", + "should_store": true, + "rationale": "An explicit user instruction to remember — must always store.", + "messages": [ + {"role": "user", "content": "Remember that our staging environment deploys from the `release` branch, not `main`."}, + {"role": "assistant", "content": "Noted — staging deploys from `release`."} + ] + }, + { + "id": "project-fact-bq-default", + "should_store": true, + "rationale": "A reusable project/infra fact not derivable from the repo in front of us.", + "messages": [ + {"role": "user", "content": "Where do BigQuery queries run by default for us?"}, + {"role": "assistant", "content": "Against the `es-dwh` project — that's the production data warehouse and our default for BQ. `es-dwh-test` is only for throwaway dev queries."}, + {"role": "user", "content": "Right, that's the convention."} + ] + }, + { + "id": "chatter-run-tests", + "should_store": false, + "rationale": "Routine task execution with no durable lesson.", + "messages": [ + {"role": "user", "content": "Run the tests."}, + {"role": "assistant", "content": "Done — `make test` passed, 97 tests green."} + ] + }, + { + "id": "repo-explain-function", + "should_store": false, + "rationale": "An explanation of code that's already in the repo — recall would be redundant with reading the file.", + "messages": [ + {"role": "user", "content": "What does `addMissingRetrievalColumns` do?"}, + {"role": "assistant", "content": "It runs during `migrate()` and adds the `session_id` column to the `retrievals` table if it's missing, then creates the supporting index. You can see it in MemoryStore.swift."} + ] + }, + { + "id": "transient-debug-step", + "should_store": false, + "rationale": "A transient debugging action; the state is gone by next session and there's no reusable takeaway.", + "messages": [ + {"role": "user", "content": "Why is the build red?"}, + {"role": "assistant", "content": "A stale `.build` cache — I ran `swift package clean` and it's green again."} + ] + }, + { + "id": "smalltalk-greeting", + "should_store": false, + "rationale": "Small talk; nothing to remember.", + "messages": [ + {"role": "user", "content": "morning! ready to keep going on the installer thing?"}, + {"role": "assistant", "content": "Morning! Yep — picking up where we left off on the install flow."} + ] + }, + { + "id": "general-knowledge", + "should_store": false, + "rationale": "General knowledge unrelated to the user/project — not a personal or project memory.", + "messages": [ + {"role": "user", "content": "Quick aside — what's the capital of Australia?"}, + {"role": "assistant", "content": "Canberra (not Sydney, which people often guess)."} + ] + }, + { + "id": "buried-fact-in-chatter", + "should_store": true, + "rationale": "A durable preference is buried in otherwise-routine chatter; the agent should still catch and store it.", + "messages": [ + {"role": "user", "content": "ok let's wrap up. push the branch and open a PR. oh and from now on default all my new services to Go 1.23, that's our standard now."}, + {"role": "assistant", "content": "Pushed and PR opened. And noted on the Go 1.23 standard for new services."} + ] + }, + { + "id": "one-off-lookup", + "should_store": false, + "rationale": "A one-off value the user asked about in passing; not a durable convention.", + "messages": [ + {"role": "user", "content": "how many memories are in the store right now?"}, + {"role": "assistant", "content": "98 active, 35 deleted."} + ] + }, + { + "id": "decision-no-helper-now", + "should_store": true, + "rationale": "A deliberate decision to defer work, with reasoning — useful context for why the code looks the way it does.", + "messages": [ + {"role": "user", "content": "Should we add the SMAppService privileged helper for one-click install?"}, + {"role": "assistant", "content": "Not now — for a one-shot symlink it's overkill (persistent root daemon + Login Items prompt). We decided to use a one-shot authenticated osascript instead and revisit the helper only if we ever need repeated silent privileged ops."}, + {"role": "user", "content": "Yeah, let's go with that decision."} + ] + } + ] +}