Skip to content

feat: end-to-end QA scoring stage with pluggable LLM runners#15

Merged
groksrc merged 1 commit into
mainfrom
feat/qa-stage-llm-runners
Jun 12, 2026
Merged

feat: end-to-end QA scoring stage with pluggable LLM runners#15
groksrc merged 1 commit into
mainfrom
feat/qa-stage-llm-runners

Conversation

@groksrc

@groksrc groksrc commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Adds the missing stage between retrieval metrics and publishable benchmark numbers: per-query answer generation over retrieved context, followed by LLM-judged grading against the expected answer. Retrieval metrics (recall/MRR/content-hit) measure the search layer; QA accuracy is what memory-system comparisons are actually made on.

What's new

llm/runners.pyLLMRunner abstraction with two transports:

  • claude:<model> — Claude Code CLI print mode (claude -p --output-format json --max-turns 1). Bills the operator's Claude subscription plan; no API key required.
  • openai-compat:<model>@<base_url> — any OpenAI-compatible chat-completions endpoint (Ollama, LM Studio, vLLM, OpenAI).

Both retry transient failures and report token usage + latency per call.

scoring/qa.py — fixed, auditable prompts (they ship in the repo so any number can be traced to its exact rubric):

  • Answer prompt: answer from retrieved memories only; abstain with exactly "I don't know" when the context lacks the answer.
  • Judge prompt: strict binary verdict as JSON. Abstention is correct only when the gold answer marks the question unanswerable (LoCoMo adversarial); abstaining on an answerable question is incorrect.
  • Malformed judge output or runner failure is recorded as an explicit per-case error (scored incorrect), never silently passed.

run qa CLI command — reads per-query-retrieval.jsonl, writes per-query-qa.jsonl + qa-summary.json with per-category accuracy, abstain/error counts, mean answer latency, and token totals. The same answerer and judge are used for every provider in a run, holding the model constant across systems.

Fairness properties

  • Identical answerer + judge models for all providers in a run.
  • Prompts are fixed in source, not configurable per provider.
  • Judge model recorded in every artifact row for auditability.

Testing

  • tests/llm/test_runners.py — spec parsing, claude transport parsing/retry/error paths (subprocess mocked).
  • tests/test_qa_scoring.py — prompt construction, verdict parsing (incl. code-fenced JSON), correct/incorrect/abstain scoring, category breakdown, error recording, token/latency accounting, artifact writing.
  • Live smoke test against the real claude CLI (haiku answerer, sonnet judge): answerable case answered and judged correct; adversarial case abstained and judged correct.

🤖 Generated with Claude Code

…compatible LLM runners

Adds the stage that produces benchmark-comparable accuracy numbers:
generate an answer per query from each provider's retrieved context,
then grade it against the expected answer with an LLM judge.

- llm/runners.py: LLMRunner abstraction with two transports — claude
  (Claude Code CLI print mode, bills the subscription plan, no API key)
  and openai-compat (Ollama/LM Studio/vLLM/OpenAI). Specs are strings
  (claude:<model>, openai-compat:<model>@<base_url>) so they flow
  through CLI flags and artifacts.
- scoring/qa.py: fixed, auditable answer + judge prompts. Answerer must
  abstain when context lacks the answer; abstention is graded correct
  only when the gold answer marks the question unanswerable (LoCoMo
  adversarial). Per-case errors recorded, never silently scored.
- run qa CLI command writing per-query-qa.jsonl + qa-summary.json with
  per-category accuracy, abstain/error counts, latency and token usage.
- Same answerer/judge for every provider in a run, holding the model
  constant across systems.

Smoke-tested live against claude CLI (haiku answerer, sonnet judge):
answerable case answered and judged correct; adversarial case abstained
and judged correct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit 7501cdc into main Jun 12, 2026
1 check passed
@groksrc groksrc deleted the feat/qa-stage-llm-runners branch June 12, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant