feat: end-to-end QA scoring stage with pluggable LLM runners by groksrc · Pull Request #15 · basicmachines-co/basic-memory-benchmarks

groksrc · 2026-06-12T18:02:23Z

Summary

Adds the missing stage between retrieval metrics and publishable benchmark numbers: per-query answer generation over retrieved context, followed by LLM-judged grading against the expected answer. Retrieval metrics (recall/MRR/content-hit) measure the search layer; QA accuracy is what memory-system comparisons are actually made on.

What's new

llm/runners.py — LLMRunner abstraction with two transports:

claude:<model> — Claude Code CLI print mode (claude -p --output-format json --max-turns 1). Bills the operator's Claude subscription plan; no API key required.
openai-compat:<model>@<base_url> — any OpenAI-compatible chat-completions endpoint (Ollama, LM Studio, vLLM, OpenAI).

Both retry transient failures and report token usage + latency per call.

scoring/qa.py — fixed, auditable prompts (they ship in the repo so any number can be traced to its exact rubric):

Answer prompt: answer from retrieved memories only; abstain with exactly "I don't know" when the context lacks the answer.
Judge prompt: strict binary verdict as JSON. Abstention is correct only when the gold answer marks the question unanswerable (LoCoMo adversarial); abstaining on an answerable question is incorrect.
Malformed judge output or runner failure is recorded as an explicit per-case error (scored incorrect), never silently passed.

run qa CLI command — reads per-query-retrieval.jsonl, writes per-query-qa.jsonl + qa-summary.json with per-category accuracy, abstain/error counts, mean answer latency, and token totals. The same answerer and judge are used for every provider in a run, holding the model constant across systems.

Fairness properties

Identical answerer + judge models for all providers in a run.
Prompts are fixed in source, not configurable per provider.
Judge model recorded in every artifact row for auditability.

Testing

tests/llm/test_runners.py — spec parsing, claude transport parsing/retry/error paths (subprocess mocked).
tests/test_qa_scoring.py — prompt construction, verdict parsing (incl. code-fenced JSON), correct/incorrect/abstain scoring, category breakdown, error recording, token/latency accounting, artifact writing.
Live smoke test against the real claude CLI (haiku answerer, sonnet judge): answerable case answered and judged correct; adversarial case abstained and judged correct.

🤖 Generated with Claude Code

…compatible LLM runners Adds the stage that produces benchmark-comparable accuracy numbers: generate an answer per query from each provider's retrieved context, then grade it against the expected answer with an LLM judge. - llm/runners.py: LLMRunner abstraction with two transports — claude (Claude Code CLI print mode, bills the subscription plan, no API key) and openai-compat (Ollama/LM Studio/vLLM/OpenAI). Specs are strings (claude:<model>, openai-compat:<model>@<base_url>) so they flow through CLI flags and artifacts. - scoring/qa.py: fixed, auditable answer + judge prompts. Answerer must abstain when context lacks the answer; abstention is graded correct only when the gold answer marks the question unanswerable (LoCoMo adversarial). Per-case errors recorded, never silently scored. - run qa CLI command writing per-query-qa.jsonl + qa-summary.json with per-category accuracy, abstain/error counts, latency and token usage. - Same answerer/judge for every provider in a run, holding the model constant across systems. Smoke-tested live against claude CLI (haiku answerer, sonnet judge): answerable case answered and judged correct; adversarial case abstained and judged correct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>

groksrc merged commit 7501cdc into main Jun 12, 2026
1 check passed

groksrc deleted the feat/qa-stage-llm-runners branch June 12, 2026 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: end-to-end QA scoring stage with pluggable LLM runners#15

feat: end-to-end QA scoring stage with pluggable LLM runners#15
groksrc merged 1 commit into
mainfrom
feat/qa-stage-llm-runners

groksrc commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

groksrc commented Jun 12, 2026

Summary

What's new

Fairness properties

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant