feat(eval): LLM-in-the-loop store-behavior eval (ADR 0025)#8
Open
dakl wants to merge 1 commit into
Open
Conversation
Recall has a deterministic offline gate; storing — the half that was actually failing — had nothing. Add a model-in-the-loop harness that measures store precision/recall: for each labeled session fixture it runs a model with an engram_store tool + the production store nudge as the policy, and checks whether the model decides to call the tool vs the should_store label. - scripts/store_eval.py: `validate` (dependency-free fixture check) and `run` (needs anthropic + key). --model parametrizes and is recorded per run under eval/store-runs/; --policy-file A/Bs nudge wordings. - scripts/store_eval_fixtures.json: 13 starter fixtures (7 store / 6 skip), incl. explicit-remember, fact-buried-in-chatter, and near-miss negatives. - ADR 0025 documents the methodology, the sync requirement with production, and why it's on-demand (token cost, non-deterministic), not a CI gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Recall quality has a deterministic offline gate (
engram-eval, ADR 0021).Store quality — the half that was actually failing in the
"nothing-gets-saved" investigation — had nothing. This adds a model-in-the-loop
harness that measures whether the agent decides to save the right memories.
For each labeled session fixture, it runs a model with an
engram_storetoolavailable and Engram's production store signal as the policy (system prompt +
the recall hook's reflection nudge), then checks whether the model calls the
tool vs the fixture's
should_storelabel. Reports store precision / recall(+ F1/accuracy) and the per-fixture stored content.
Why it's a separate kind of eval
Storing is model-driven (ADR 0001), so measuring it requires running a real
model — non-deterministic and token-costing, unlike the Swift gate. Per your
ask: the model is a parameter and is recorded (
--model, defaultclaude-opus-4-8; every--recordrun writes git sha + model + policy hash +results under
eval/store-runs/), so we can compare across model versions andacross nudge wordings (
--policy-file).Files
scripts/store_eval.py—validate(dependency-free fixture check, no key)and
run(needsanthropic+ANTHROPIC_API_KEY).scripts/store_eval_fixtures.json— 13 starter fixtures (7 store / 6 skip):preference, decision, gotcha, explicit-remember, fact-buried-in-chatter, plus
near-miss negatives (chatter, repo-derivable, general knowledge, one-off).
fidelity gap (tool call vs
/rememberskill), and why it's on-demand, not CI.Verification
python3 scripts/store_eval.py validatepasses (fixtures well-formed, 7/6).py_compileclean. I did not run live model calls (no key in my env, andit spends tokens) —
runis ready for you to exercise with a key.Notes
embedding-model exploration harness.
bundled-embedder branch; its row lands when that PR merges.
🤖 Generated with Claude Code