feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025) by dakl · Pull Request #8 · dakl/engram

dakl · 2026-06-25T07:11:27Z

What

Recall quality has a deterministic offline gate (engram-eval, ADR 0021).
Store quality — the half that was actually failing in the
"nothing-gets-saved" investigation — had nothing. This adds a model-in-the-loop
harness that measures whether the agent decides to save the right memories.

For each labeled session fixture, it runs a model with an engram_store tool
available and Engram's production store signal as the policy (system prompt +
the recall hook's reflection nudge), then checks whether the model calls the
tool vs the fixture's should_store label. Reports store precision / recall
(+ F1/accuracy) and the per-fixture stored content.

Why it's a separate kind of eval

Storing is model-driven (ADR 0001), so measuring it requires running a real
model — non-deterministic and token-costing, unlike the Swift gate. Per your
ask: the model is a parameter and is recorded (--model, default
claude-opus-4-8; every --record run writes git sha + model + policy hash +
results under eval/store-runs/), so we can compare across model versions and
across nudge wordings (--policy-file).

Files

scripts/store_eval.py — validate (dependency-free fixture check, no key)
and run (needs anthropic + ANTHROPIC_API_KEY).
scripts/store_eval_fixtures.json — 13 starter fixtures (7 store / 6 skip):
preference, decision, gotcha, explicit-remember, fact-buried-in-chatter, plus
near-miss negatives (chatter, repo-derivable, general knowledge, one-off).
ADR 0025 — methodology, the keep-in-sync-with-production requirement, the
fidelity gap (tool call vs /remember skill), and why it's on-demand, not CI.

Verification

python3 scripts/store_eval.py validate passes (fixtures well-formed, 7/6).
py_compile clean. I did not run live model calls (no key in my env, and
it spends tokens) — run is ready for you to exercise with a key.

Notes

Not a CI gate (token cost + non-determinism); runs on demand like the
embedding-model exploration harness.
ADR index skips 0024 here — that number is taken by the concurrent
bundled-embedder branch; its row lands when that PR merges.

🤖 Generated with Claude Code

Recall has a deterministic offline gate; storing — the half that was actually failing — had nothing. Add a model-in-the-loop harness that measures store precision/recall: for each labeled session fixture it runs a model with an engram_store tool + the production store nudge as the policy, and checks whether the model decides to call the tool vs the should_store label. - scripts/store_eval.py: `validate` (dependency-free fixture check) and `run` (needs anthropic + key). --model parametrizes and is recorded per run under eval/store-runs/; --policy-file A/Bs nudge wordings. - scripts/store_eval_fixtures.json: 13 starter fixtures (7 store / 6 skip), incl. explicit-remember, fact-buried-in-chatter, and near-miss negatives. - ADR 0025 documents the methodology, the sync requirement with production, and why it's on-demand (token cost, non-deterministic), not a CI gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025)#8

feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025)#8
dakl wants to merge 1 commit into
mainfrom
claude/store-eval

dakl commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dakl commented Jun 25, 2026

What

Why it's a separate kind of eval

Files

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant