fix: QA context assembly from ranked hits + LoCoMo session dates in docs by groksrc · Pull Request #23 · basicmachines-co/basic-memory-benchmarks

groksrc · 2026-06-12T21:53:01Z

Summary

Two QA-pipeline integrity fixes from failure analysis of the first baseline matrix, where multi_hop scored ~0 for every provider despite ~0.8 retrieval recall.

1. Budget-based context assembly

The answer stage used top-5 pre-joined matched chunks (~1K chars). It now assembles from stored ranked hits under explicit budgets (10 hits / 2,500 chars each / 12,000 total) with numbered, source-attributed sections. Identical for every provider; legacy artifacts fall back.

2. LoCoMo session dates (the actual root cause)

The converter never wrote session_N_date_time into docs. Every relative time expression ('yesterday', 'last Saturday') was unanchored, making date questions — most of multi_hop, much of temporal — unanswerable by any provider. Models abstained honestly or hallucinated anchors (one used today's date). Session timestamps now land in frontmatter, title, and body heading, matching the LongMemEval converter.

Evidence discipline

Context assembly alone was re-judged on all 189 multi_hop cases: still ~0 → rejected as the root cause; kept because it's strictly better material and auditable.
Date fix validated end-to-end on a previously-failed case: answerer derives 'two days before July 12, 2023' → July 10, judge accepts.
All pre-fix LoCoMo QA numbers are invalidated; q300 will be re-run after this merges.

Testing

8 new tests; full suite green (99), lint clean.

🤖 Generated with Claude Code

Two QA-pipeline integrity fixes found by failure analysis of the first baseline matrix (LoCoMo q300: multi_hop scored ~0 for EVERY provider despite ~0.8 retrieval recall on those queries). 1. QA context assembly: the answer stage previously used the top-5 pre-joined matched chunks baked into retrieval rows. It now assembles context from the stored ranked hits under explicit budgets (10 hits, 2,500 chars/hit, 12,000 total), with numbered sections naming the source doc. Identical assembly for every provider; legacy artifacts without hits fall back to the old field. 2. LoCoMo session dates: the converter never wrote session_N_date_time into docs, so every relative time expression in the dialogues ('yesterday', 'last Saturday') was unanchored — date questions (most of multi_hop, much of temporal) were unanswerable BY ANY PROVIDER, collapsing to abstention or hallucinated anchors. Session timestamps now land in frontmatter, title, and the body heading, same as the LongMemEval converter. Validated end-to-end on a previously-failed case ('When did Caroline go to the LGBTQ conference?' gold 10 July 2023): with dated docs the answerer derives 'two days before July 12, 2023' -> July 10 and the judge accepts. The context-assembly change alone did NOT fix multi_hop (re-judged 189 cases: still ~0) — the date anchor was the binding constraint; both fixes are kept because budget assembly is strictly more material per query and now auditable. Consequence: all pre-fix LoCoMo QA numbers are invalidated; the q300 matrix run must be regenerated after this lands. Retrieval metrics are unaffected by (1) and only mildly affected by (2) (date tokens in docs). 8 new tests (assembly budgets/fallback, dated docs, undated fallback). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>

groksrc merged commit 498916d into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: QA context assembly from ranked hits + LoCoMo session dates in docs#23

fix: QA context assembly from ranked hits + LoCoMo session dates in docs#23
groksrc merged 1 commit into
mainfrom
feat/qa-context-assembly

groksrc commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

groksrc commented Jun 12, 2026

Summary

1. Budget-based context assembly

2. LoCoMo session dates (the actual root cause)

Evidence discipline

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant