Skip to content

feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025)#8

Open
dakl wants to merge 1 commit into
mainfrom
claude/store-eval
Open

feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025)#8
dakl wants to merge 1 commit into
mainfrom
claude/store-eval

Conversation

@dakl

@dakl dakl commented Jun 25, 2026

Copy link
Copy Markdown
Owner

What

Recall quality has a deterministic offline gate (engram-eval, ADR 0021).
Store quality — the half that was actually failing in the
"nothing-gets-saved" investigation — had nothing. This adds a model-in-the-loop
harness that measures whether the agent decides to save the right memories.

For each labeled session fixture, it runs a model with an engram_store tool
available and Engram's production store signal as the policy (system prompt +
the recall hook's reflection nudge), then checks whether the model calls the
tool vs the fixture's should_store label. Reports store precision / recall
(+ F1/accuracy) and the per-fixture stored content.

Why it's a separate kind of eval

Storing is model-driven (ADR 0001), so measuring it requires running a real
model — non-deterministic and token-costing, unlike the Swift gate. Per your
ask: the model is a parameter and is recorded (--model, default
claude-opus-4-8; every --record run writes git sha + model + policy hash +
results under eval/store-runs/), so we can compare across model versions and
across nudge wordings (--policy-file).

Files

  • scripts/store_eval.pyvalidate (dependency-free fixture check, no key)
    and run (needs anthropic + ANTHROPIC_API_KEY).
  • scripts/store_eval_fixtures.json — 13 starter fixtures (7 store / 6 skip):
    preference, decision, gotcha, explicit-remember, fact-buried-in-chatter, plus
    near-miss negatives (chatter, repo-derivable, general knowledge, one-off).
  • ADR 0025 — methodology, the keep-in-sync-with-production requirement, the
    fidelity gap (tool call vs /remember skill), and why it's on-demand, not CI.

Verification

  • python3 scripts/store_eval.py validate passes (fixtures well-formed, 7/6).
  • py_compile clean. I did not run live model calls (no key in my env, and
    it spends tokens) — run is ready for you to exercise with a key.

Notes

  • Not a CI gate (token cost + non-determinism); runs on demand like the
    embedding-model exploration harness.
  • ADR index skips 0024 here — that number is taken by the concurrent
    bundled-embedder branch; its row lands when that PR merges.

🤖 Generated with Claude Code

Recall has a deterministic offline gate; storing — the half that was actually
failing — had nothing. Add a model-in-the-loop harness that measures store
precision/recall: for each labeled session fixture it runs a model with an
engram_store tool + the production store nudge as the policy, and checks whether
the model decides to call the tool vs the should_store label.

- scripts/store_eval.py: `validate` (dependency-free fixture check) and `run`
  (needs anthropic + key). --model parametrizes and is recorded per run under
  eval/store-runs/; --policy-file A/Bs nudge wordings.
- scripts/store_eval_fixtures.json: 13 starter fixtures (7 store / 6 skip),
  incl. explicit-remember, fact-buried-in-chatter, and near-miss negatives.
- ADR 0025 documents the methodology, the sync requirement with production, and
  why it's on-demand (token cost, non-deterministic), not a CI gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant