Skip to content

fix: QA context assembly from ranked hits + LoCoMo session dates in docs#23

Merged
groksrc merged 1 commit into
mainfrom
feat/qa-context-assembly
Jun 12, 2026
Merged

fix: QA context assembly from ranked hits + LoCoMo session dates in docs#23
groksrc merged 1 commit into
mainfrom
feat/qa-context-assembly

Conversation

@groksrc

@groksrc groksrc commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Two QA-pipeline integrity fixes from failure analysis of the first baseline matrix, where multi_hop scored ~0 for every provider despite ~0.8 retrieval recall.

1. Budget-based context assembly

The answer stage used top-5 pre-joined matched chunks (~1K chars). It now assembles from stored ranked hits under explicit budgets (10 hits / 2,500 chars each / 12,000 total) with numbered, source-attributed sections. Identical for every provider; legacy artifacts fall back.

2. LoCoMo session dates (the actual root cause)

The converter never wrote session_N_date_time into docs. Every relative time expression ('yesterday', 'last Saturday') was unanchored, making date questions — most of multi_hop, much of temporal — unanswerable by any provider. Models abstained honestly or hallucinated anchors (one used today's date). Session timestamps now land in frontmatter, title, and body heading, matching the LongMemEval converter.

Evidence discipline

  • Context assembly alone was re-judged on all 189 multi_hop cases: still ~0 → rejected as the root cause; kept because it's strictly better material and auditable.
  • Date fix validated end-to-end on a previously-failed case: answerer derives 'two days before July 12, 2023' → July 10, judge accepts.
  • All pre-fix LoCoMo QA numbers are invalidated; q300 will be re-run after this merges.

Testing

8 new tests; full suite green (99), lint clean.

🤖 Generated with Claude Code

Two QA-pipeline integrity fixes found by failure analysis of the first
baseline matrix (LoCoMo q300: multi_hop scored ~0 for EVERY provider
despite ~0.8 retrieval recall on those queries).

1. QA context assembly: the answer stage previously used the top-5
   pre-joined matched chunks baked into retrieval rows. It now
   assembles context from the stored ranked hits under explicit
   budgets (10 hits, 2,500 chars/hit, 12,000 total), with numbered
   sections naming the source doc. Identical assembly for every
   provider; legacy artifacts without hits fall back to the old field.

2. LoCoMo session dates: the converter never wrote session_N_date_time
   into docs, so every relative time expression in the dialogues
   ('yesterday', 'last Saturday') was unanchored — date questions
   (most of multi_hop, much of temporal) were unanswerable BY ANY
   PROVIDER, collapsing to abstention or hallucinated anchors. Session
   timestamps now land in frontmatter, title, and the body heading,
   same as the LongMemEval converter.

Validated end-to-end on a previously-failed case ('When did Caroline
go to the LGBTQ conference?' gold 10 July 2023): with dated docs the
answerer derives 'two days before July 12, 2023' -> July 10 and the
judge accepts. The context-assembly change alone did NOT fix
multi_hop (re-judged 189 cases: still ~0) — the date anchor was the
binding constraint; both fixes are kept because budget assembly is
strictly more material per query and now auditable.

Consequence: all pre-fix LoCoMo QA numbers are invalidated; the q300
matrix run must be regenerated after this lands. Retrieval metrics are
unaffected by (1) and only mildly affected by (2) (date tokens in
docs).

8 new tests (assembly budgets/fallback, dated docs, undated fallback).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit 498916d into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant