feat: end-to-end QA scoring stage with pluggable LLM runners#15
Merged
Conversation
…compatible LLM runners Adds the stage that produces benchmark-comparable accuracy numbers: generate an answer per query from each provider's retrieved context, then grade it against the expected answer with an LLM judge. - llm/runners.py: LLMRunner abstraction with two transports — claude (Claude Code CLI print mode, bills the subscription plan, no API key) and openai-compat (Ollama/LM Studio/vLLM/OpenAI). Specs are strings (claude:<model>, openai-compat:<model>@<base_url>) so they flow through CLI flags and artifacts. - scoring/qa.py: fixed, auditable answer + judge prompts. Answerer must abstain when context lacks the answer; abstention is graded correct only when the gold answer marks the question unanswerable (LoCoMo adversarial). Per-case errors recorded, never silently scored. - run qa CLI command writing per-query-qa.jsonl + qa-summary.json with per-category accuracy, abstain/error counts, latency and token usage. - Same answerer/judge for every provider in a run, holding the model constant across systems. Smoke-tested live against claude CLI (haiku answerer, sonnet judge): answerable case answered and judged correct; adversarial case abstained and judged correct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the missing stage between retrieval metrics and publishable benchmark numbers: per-query answer generation over retrieved context, followed by LLM-judged grading against the expected answer. Retrieval metrics (recall/MRR/content-hit) measure the search layer; QA accuracy is what memory-system comparisons are actually made on.
What's new
llm/runners.py—LLMRunnerabstraction with two transports:claude:<model>— Claude Code CLI print mode (claude -p --output-format json --max-turns 1). Bills the operator's Claude subscription plan; no API key required.openai-compat:<model>@<base_url>— any OpenAI-compatible chat-completions endpoint (Ollama, LM Studio, vLLM, OpenAI).Both retry transient failures and report token usage + latency per call.
scoring/qa.py— fixed, auditable prompts (they ship in the repo so any number can be traced to its exact rubric):error(scored incorrect), never silently passed.run qaCLI command — readsper-query-retrieval.jsonl, writesper-query-qa.jsonl+qa-summary.jsonwith per-category accuracy, abstain/error counts, mean answer latency, and token totals. The same answerer and judge are used for every provider in a run, holding the model constant across systems.Fairness properties
Testing
tests/llm/test_runners.py— spec parsing, claude transport parsing/retry/error paths (subprocess mocked).tests/test_qa_scoring.py— prompt construction, verdict parsing (incl. code-fenced JSON), correct/incorrect/abstain scoring, category breakdown, error recording, token/latency accounting, artifact writing.claudeCLI (haiku answerer, sonnet judge): answerable case answered and judged correct; adversarial case abstained and judged correct.🤖 Generated with Claude Code