fix: QA context assembly from ranked hits + LoCoMo session dates in docs#23
Merged
Conversation
Two QA-pipeline integrity fixes found by failure analysis of the first
baseline matrix (LoCoMo q300: multi_hop scored ~0 for EVERY provider
despite ~0.8 retrieval recall on those queries).
1. QA context assembly: the answer stage previously used the top-5
pre-joined matched chunks baked into retrieval rows. It now
assembles context from the stored ranked hits under explicit
budgets (10 hits, 2,500 chars/hit, 12,000 total), with numbered
sections naming the source doc. Identical assembly for every
provider; legacy artifacts without hits fall back to the old field.
2. LoCoMo session dates: the converter never wrote session_N_date_time
into docs, so every relative time expression in the dialogues
('yesterday', 'last Saturday') was unanchored — date questions
(most of multi_hop, much of temporal) were unanswerable BY ANY
PROVIDER, collapsing to abstention or hallucinated anchors. Session
timestamps now land in frontmatter, title, and the body heading,
same as the LongMemEval converter.
Validated end-to-end on a previously-failed case ('When did Caroline
go to the LGBTQ conference?' gold 10 July 2023): with dated docs the
answerer derives 'two days before July 12, 2023' -> July 10 and the
judge accepts. The context-assembly change alone did NOT fix
multi_hop (re-judged 189 cases: still ~0) — the date anchor was the
binding constraint; both fixes are kept because budget assembly is
strictly more material per query and now auditable.
Consequence: all pre-fix LoCoMo QA numbers are invalidated; the q300
matrix run must be regenerated after this lands. Retrieval metrics are
unaffected by (1) and only mildly affected by (2) (date tokens in
docs).
8 new tests (assembly budgets/fallback, dated docs, undated fallback).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two QA-pipeline integrity fixes from failure analysis of the first baseline matrix, where multi_hop scored ~0 for every provider despite ~0.8 retrieval recall.
1. Budget-based context assembly
The answer stage used top-5 pre-joined matched chunks (~1K chars). It now assembles from stored ranked hits under explicit budgets (10 hits / 2,500 chars each / 12,000 total) with numbered, source-attributed sections. Identical for every provider; legacy artifacts fall back.
2. LoCoMo session dates (the actual root cause)
The converter never wrote
session_N_date_timeinto docs. Every relative time expression ('yesterday', 'last Saturday') was unanchored, making date questions — most of multi_hop, much of temporal — unanswerable by any provider. Models abstained honestly or hallucinated anchors (one used today's date). Session timestamps now land in frontmatter, title, and body heading, matching the LongMemEval converter.Evidence discipline
Testing
8 new tests; full suite green (99), lint clean.
🤖 Generated with Claude Code