dakl · dakl · Jun 25, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -27,6 +27,7 @@ and recall content. See `README.md` for architecture and build instructions.
 - `Sources/EngramCore/Facets.swift` — pure parser splitting tags into `key:value` facets vs freeform; folds `source` into `project` (ADR 0013)
 - `Engram/Info.plist` — partial plist merged into the generated one; carries the Sparkle `SU*` keys (custom keys can't go through `INFOPLIST_KEY_*`)
 - `Engram/scripts/bundle-cli.sh` — build phase that bundles the CLI into the app
+- `scripts/store_eval.py` + `scripts/store_eval_fixtures.json` — LLM-in-the-loop **store-behavior** eval (ADR 0025): runs a model with an `engram_store` tool over labeled session fixtures and reports store precision/recall; `--model` parametrizes and records the model id per run under `eval/store-runs/`. `validate` checks fixtures with no API key; `run` needs `anthropic` + `ANTHROPIC_API_KEY`. Separate from the deterministic Swift `engram-eval`; not a CI gate (costs tokens, model-dependent).
 - `scripts/release.sh` + `scripts/bump_version.py` — local `make release-*` flow: gate, bump, tag, push (ADR 0010)
 - `scripts/update_appcast.py` — prepends a release entry to `docs/appcast.xml` (run by CI; stdlib-only so the runner needs no uv)
 - `.github/workflows/release.yml` + `.github/ExportOptions.plist` — CI that signs, notarizes, and publishes a release (ADR 0010)

diff --git a/docs/adr/0025-store-behavior-eval.md b/docs/adr/0025-store-behavior-eval.md
@@ -0,0 +1,77 @@
+# 25. LLM-in-the-loop eval for store behavior (does the agent save the right memories?)
+
+- **Status:** Accepted
+- **Date:** 2026-06-25
+- **Deciders:** Daniel Klevebring, Claude
+- **Relates to:** ADR 0001 (model-driven store), ADR 0021 (retrieval eval), ADR 0023 (recall metrics)
+
+## Context
+
+Recall quality has an offline gate (ADR 0021's `engram-eval`): deterministic,
+free, runs in CI. **Store** quality has had nothing. And storing is the half
+that was actually failing in practice — investigation of ~1800 real sessions
+found stores effectively stopped after an initial burst while recall kept
+firing: a mix of the model not *deciding* to save and, when it did try, fumbling
+the CLI invocation (the latter fixed separately by the `engram store`
+arg-robustness change). The decision half — "given a session, does the agent
+correctly choose to save the durable things and skip the noise?" — is the core
+risk for whether Engram accrues anything to recall.
+
+That question can't be answered by a deterministic harness. Storing is
+model-driven (ADR 0001): it depends on an LLM reading the reflection nudge / the
+`/remember` guidance and choosing to act. Measuring it requires running an
+actual model. That makes the eval fundamentally different from `engram-eval`:
+non-deterministic, model-dependent, and it costs API tokens.
+
+## Decision
+
+Add a separate **LLM-in-the-loop** harness, `scripts/store_eval.py` (Python +
+the `anthropic` SDK, matching `scripts/embeddings_eval.py`), that measures store
+**precision/recall** over labeled session fixtures.
+
+- **Fixtures** (`scripts/store_eval_fixtures.json`): short coding-session
+  transcripts, each labeled `should_store` (a durable preference / decision /
+  project fact / gotcha) or not (routine chatter, transient state, general
+  knowledge, repo-derivable facts). Starter set covers both classes incl. an
+  explicit "remember this", a fact buried in chatter, and near-miss negatives.
+- **Mechanism**: for each fixture, call the model with an `engram_store` tool
+  available and the production store signal as the policy (system prompt + the
+  recall hook's reflection nudge appended as the final user turn), `tool_choice`
+  auto. Whether the model emits an `engram_store` call is its decision; compare
+  to the label. We don't execute the tool — we only observe the call.
+- **Metrics**: store precision (of what it saved, how much was warranted),
+  recall (of what it should have saved, how much it did), plus F1/accuracy and
+  the per-fixture stored content for eyeballing quality.
+- **Model is a parameter and is recorded.** `--model` (default
+  `claude-opus-4-8`); every `--record` run writes
+  `eval/store-runs/<ts>-<model>.json` with the git sha, model, a hash of the
+  policy, and per-fixture results — so runs are comparable across models and
+  across prompt revisions.
+- **The policy is the thing under test, and is A/B-able.** The default `POLICY`
+  in the harness mirrors production (the nudge + the `/remember` criteria);
+  `--policy-file` swaps in a candidate wording. This is how we measure "how good
+  are we at *triggering* the agent to store" as we iterate the nudge.
+- A dependency-free `validate` subcommand checks the fixtures and prints the
+  plan without an API key, so CI / contributors can sanity-check fixtures
+  without spending tokens.
+
+## Consequences
+
+- We can finally measure, and regression-test, the store decision — and A/B nudge
+  wordings against real model behavior across Opus/Sonnet/Haiku, tracking which
+  model each number came from.
+- **Not a CI gate.** It costs tokens and is non-deterministic, so it runs
+  on-demand (like the embedding-model exploration harness), not on every PR.
+  Numbers are a relative A/B, not an absolute benchmark; the small fixture set
+  means per-fixture misses matter more than the aggregate.
+- **The policy must be kept in sync with production.** The harness embeds a copy
+  of the nudge/guidance; if the production reflection nudge (`main.swift`) or the
+  `/remember` skill (`Setup.swift`) changes and the harness doesn't, the eval
+  stops predicting production. This duplication is the accepted cost of testing
+  the prompt as a unit.
+- **Fidelity gap.** Production stores via the `/remember` skill through Claude
+  Code; the eval uses a single `engram_store` tool call via the API. It measures
+  the *decision* faithfully but not the full skill/CLI path (the arg-form
+  failures are covered by the deterministic CLI tests instead).
+- Token cost scales with fixtures × models × policies; keep the fixture set
+  small and curated, and use `--limit` for smoke runs.
diff --git a/docs/adr/README.md b/docs/adr/README.md
@@ -33,6 +33,7 @@ supersedes the old one (and update the old one's status to `Superseded by NNNN`)
 | [0021](0021-embedder-relative-recall-gate.md) | Embedder-relative recall gate, calibrated by offline eval (refines 0005's gate) | Accepted |
 | [0022](0022-privileged-helper-for-cli-install.md) | Privileged CLI install via a one-shot authenticated `osascript` | Accepted |
 | [0023](0023-session-scoped-recall-cooldown.md) | Session-scoped recall re-injection cooldown (stop re-injecting the same memory every prompt) | Accepted |
+| [0025](0025-store-behavior-eval.md) | LLM-in-the-loop eval for store behavior (store precision/recall, model-parametrized) | Accepted |
 
 ## Writing a new ADR
 

diff --git a/scripts/store_eval.py b/scripts/store_eval.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env python3
+"""Store-behavior eval — does the agent decide to save the right memories?
+
+LLM-in-the-loop (ADR 0025). For each labeled session fixture, this runs a model
+with an `engram_store` tool available and Engram's production store-reflection
+guidance as the policy, then checks whether the model *called* the tool —
+comparing its decision against the fixture's `should_store` label. Reports store
+**precision / recall** and per-fixture outcomes.
+
+Unlike the Swift `engram-eval` (a deterministic retrieval gate), this calls a
+real model: results are model- and prompt-dependent and cost API tokens. Treat
+it as a *relative* A/B across models and policy wordings — not an absolute
+benchmark. The model id and a hash of the policy are recorded on every run, so
+runs are comparable.
+
+The `POLICY` below mirrors the production store signal (the recall hook's
+reflection nudge + the /remember guidance). **Keep it in sync** — when the
+production nudge wording changes, update it here, or the eval stops predicting
+production behavior. To A/B a candidate wording, pass `--policy-file`.
+
+Usage:
+  # validate fixtures + print the plan, no API key needed:
+  uv run scripts/store_eval.py validate
+
+  # run against a model (needs ANTHROPIC_API_KEY); opus-4-8 is the default:
+  uv run --with anthropic scripts/store_eval.py run
+  uv run --with anthropic scripts/store_eval.py run --model claude-sonnet-4-6
+  uv run --with anthropic scripts/store_eval.py run --record   # → eval/store-runs/<ts>-<model>.json
+"""
+from __future__ import annotations
+
+import hashlib
+import json
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+FIXTURES_PATH = Path(__file__).parent / "store_eval_fixtures.json"
+RUNS_DIR = Path(__file__).parent.parent / "eval" / "store-runs"
+DEFAULT_MODEL = "claude-opus-4-8"
+
+# The policy under test: the system prompt + the final-turn reflection nudge the
+# agent sees. This mirrors the production store signal (the recall hook's nudge
+# in Sources/engram/main.swift + the /remember guidance in Setup.swift). The
+# only deliberate divergence: production saves via the `/remember` skill, here
+# the model saves via the `engram_store` tool so the harness can observe the
+# decision. Keep the wording in sync with production.
+POLICY = {
+    "system": (
+        "You are an AI assistant pair-working with a developer in a coding session. "
+        "You have access to Engram, a long-term memory store, via the `engram_store` "
+        "tool. Save durable, reusable knowledge — preferences, decisions, project "
+        "facts, and gotchas — that would help in future sessions. Do NOT save routine "
+        "task chatter, transient state, general knowledge, or anything already captured "
+        "in the repo or git history. If nothing durable surfaced, simply don't call the "
+        "tool."
+    ),
+    "nudge": (
+        "Engram reflection check: glance back over the recent turns. If something "
+        "durable surfaced — a preference, a decision, a project fact, or a gotcha you'd "
+        "want recalled weeks from now — save it with the engram_store tool. Only save "
+        "what's genuinely reusable; skip routine task chatter and anything already in "
+        "the repo or git history. Nothing notable? Don't call the tool."
+    ),
+}
+
+ENGRAM_STORE_TOOL = {
+    "name": "engram_store",
+    "description": (
+        "Save a durable memory to Engram for recall in future sessions. Call this only "
+        "when something genuinely reusable surfaced — a preference, decision, project "
+        "fact, or gotcha worth recalling weeks later."
+    ),
+    "input_schema": {
+        "type": "object",
+        "properties": {
+            "content": {
+                "type": "string",
+                "description": "The memory as a short Markdown note (a one-line title then the fact).",
+            },
+            "tags": {"type": "array", "items": {"type": "string"}},
+        },
+        "required": ["content"],
+        "additionalProperties": False,
+    },
+}
+
+
+def _load_fixtures() -> list[dict]:
+    data = json.loads(FIXTURES_PATH.read_text())
+    return data["fixtures"]
+
+
+def _policy(policy_file: str | None) -> dict:
+    if policy_file is None:
+        return POLICY
+    loaded = json.loads(Path(policy_file).read_text())
+    if "system" not in loaded or "nudge" not in loaded:
+        raise SystemExit(f"policy file {policy_file} must define 'system' and 'nudge'")
+    return loaded
+
+
+def _policy_hash(policy: dict) -> str:
+    blob = json.dumps([policy["system"], policy["nudge"], ENGRAM_STORE_TOOL], sort_keys=True)
+    return hashlib.sha256(blob.encode()).hexdigest()[:12]
+
+
+def _git_sha() -> str:
+    try:
+        return subprocess.check_output(["git", "rev-parse", "--short", "HEAD"], text=True).strip()
+    except Exception:
+        return "unknown"
+
+
+def _build_messages(fixture: dict, policy: dict) -> list[dict]:
+    # The fixture transcript, then the reflection nudge as the final user turn —
+    # consecutive user turns are fine (the API merges them).
+    return [*fixture["messages"], {"role": "user", "content": policy["nudge"]}]
+
+
+def validate() -> None:
+    """Check the fixtures are well-formed and print the run plan. No API calls."""
+    fixtures = _load_fixtures()
+    ids = [f["id"] for f in fixtures]
+    if len(ids) != len(set(ids)):
+        raise SystemExit("duplicate fixture ids")
+    for f in fixtures:
+        for key in ("id", "should_store", "messages", "rationale"):
+            if key not in f:
+                raise SystemExit(f"fixture {f.get('id', '?')} missing '{key}'")
+        if not f["messages"] or f["messages"][0]["role"] != "user":
+            raise SystemExit(f"fixture {f['id']}: messages must be non-empty and start with a user turn")
+        for m in f["messages"]:
+            if m["role"] not in ("user", "assistant"):
+                raise SystemExit(f"fixture {f['id']}: bad role {m['role']!r}")
+
+    positives = sum(1 for f in fixtures if f["should_store"])
+    print(f"{len(fixtures)} fixtures — {positives} should-store, {len(fixtures) - positives} should-not")
+    print(f"policy hash: {_policy_hash(POLICY)}   default model: {DEFAULT_MODEL}")
+    print("\nfixtures:")
+    for f in fixtures:
+        mark = "STORE " if f["should_store"] else "skip  "
+        print(f"  [{mark}] {f['id']}: {f['rationale']}")
+    print("\nfixtures valid. Run with: uv run --with anthropic scripts/store_eval.py run")
+
+
+def run(
+    model: str = DEFAULT_MODEL,
+    limit: int | None = None,
+    record: bool = False,
+    policy_file: str | None = None,
+    thinking: bool = True,
+) -> None:
+    """Run each fixture through `model` and report store precision/recall.
+
+    model: Claude model id (default claude-opus-4-8).
+    limit: only run the first N fixtures (for a cheap smoke run).
+    record: append a run JSON under eval/store-runs/.
+    policy_file: JSON with {"system","nudge"} to A/B an alternative policy.
+    thinking: run with adaptive thinking (mirrors production agents); --no-thinking is cheaper.
+    """
+    import anthropic  # imported lazily so `validate` needs no SDK / network
+
+    fixtures = _load_fixtures()
+    if limit is not None:
+        fixtures = fixtures[:limit]
+    policy = _policy(policy_file)
+    client = anthropic.Anthropic()
+
+    results: list[dict] = []
+    for fixture in fixtures:
+        kwargs: dict = {
+            "model": model,
+            "max_tokens": 4096,
+            "system": policy["system"],
+            "tools": [ENGRAM_STORE_TOOL],
+            "messages": _build_messages(fixture, policy),
+        }
+        if thinking:
+            kwargs["thinking"] = {"type": "adaptive"}
+        response = client.messages.create(**kwargs)
+
+        store_calls = [
+            b.input for b in response.content
+            if b.type == "tool_use" and b.name == "engram_store"
+        ]
+        stored = len(store_calls) > 0
+        correct = stored == fixture["should_store"]
+        results.append({
+            "id": fixture["id"],
+            "should_store": fixture["should_store"],
+            "stored": stored,
+            "correct": correct,
+            "stored_content": [c.get("content", "") for c in store_calls],
+        })
+
+    metrics = _metrics(results)
+    _print_report(results, metrics, model)
+
+    if record:
+        RUNS_DIR.mkdir(parents=True, exist_ok=True)
+        stamp = time.strftime("%Y-%m-%dT%H-%M-%SZ", time.gmtime())
+        out = RUNS_DIR / f"{stamp}-{model}.json"
+        out.write_text(json.dumps({
+            "git_sha": _git_sha(),
+            "model": model,
+            "thinking": thinking,
+            "policy_hash": _policy_hash(policy),
+            "timestamp": stamp,
+            "metrics": metrics,
+            "results": results,
+        }, indent=2))
+        print(f"\nrecorded → {out}")
+
+
+def _metrics(results: list[dict]) -> dict:
+    tp = sum(1 for r in results if r["should_store"] and r["stored"])
+    fp = sum(1 for r in results if not r["should_store"] and r["stored"])
+    fn = sum(1 for r in results if r["should_store"] and not r["stored"])
+    tn = sum(1 for r in results if not r["should_store"] and not r["stored"])
+    precision = tp / (tp + fp) if (tp + fp) else 0.0
+    recall = tp / (tp + fn) if (tp + fn) else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
+    return {
+        "tp": tp, "fp": fp, "fn": fn, "tn": tn,
+        "precision": round(precision, 3),
+        "recall": round(recall, 3),
+        "f1": round(f1, 3),
+        "accuracy": round((tp + tn) / len(results), 3) if results else 0.0,
+    }
+
+
+def _print_report(results: list[dict], metrics: dict, model: str) -> None:
+    print(f"\n=== store-behavior eval — {model} ===")
+    for r in results:
+        verdict = "ok " if r["correct"] else "MISS"
+        want = "store" if r["should_store"] else "skip"
+        got = "stored" if r["stored"] else "skipped"
+        print(f"  [{verdict}] {r['id']:28} want={want:5} got={got}")
+        if r["stored"] and r["stored_content"]:
+            print(f"          ↳ {r['stored_content'][0][:80]}")
+    m = metrics
+    print(
+        f"\nprecision {m['precision']}  recall {m['recall']}  f1 {m['f1']}  "
+        f"accuracy {m['accuracy']}  (tp={m['tp']} fp={m['fp']} fn={m['fn']} tn={m['tn']})"
+    )
+    print("note: model/prompt-dependent — a relative A/B, not a benchmark.")
+
+
+if __name__ == "__main__":
+    args = sys.argv[1:]
+    if not args or args[0] in ("-h", "--help"):
+        print(__doc__)
+        sys.exit(0)
+    command, rest = args[0], args[1:]
+    if command == "validate":
+        validate()
+    elif command == "run":
+        # Minimal flag parsing (kept dependency-free): --key value / --flag / --no-flag.
+        opts: dict = {}
+        i = 0
+        while i < len(rest):
+            tok = rest[i]
+            if tok.startswith("--no-"):
+                opts[tok[5:].replace("-", "_")] = False
+                i += 1
+            elif tok.startswith("--"):
+                key = tok[2:].replace("-", "_")
+                if i + 1 < len(rest) and not rest[i + 1].startswith("--"):
+                    opts[key] = rest[i + 1]
+                    i += 2
+                else:
+                    opts[key] = True
+                    i += 1
+            else:
+                i += 1
+        if "limit" in opts and opts["limit"] is not True:
+            opts["limit"] = int(opts["limit"])
+        run(**opts)
+    else:
+        raise SystemExit(f"unknown command {command!r} — use 'validate' or 'run'")