feat: Penfield audit corrections for the LoCoMo answer key by groksrc · Pull Request #17 · basicmachines-co/basic-memory-benchmarks

groksrc · 2026-06-12T18:45:41Z

Summary

Wires the Penfield Labs LoCoMo audit (April 2026) into the harness: 156 answer-key errors across LoCoMo's 1,540 usable questions get corrected answers and corrected evidence citations at conversion time. This is a prerequisite for publishing LoCoMo numbers that withstand scrutiny — the original key is publicly documented as ~6.4% wrong.

Design

Pinned provenance: corrections fetched at audit commit 9493fb4b and merged into one corrections.json with checksum provenance — runs are reproducible and the upstream can't drift silently.
Fail-loud application: every correction is cross-checked against the dataset's question text during conversion; any mismatch raises instead of silently mis-correcting. (Verified: 0 mismatches across all 156 on the real dataset.)
Both scoring surfaces: corrected answers feed QA scoring; corrected evidence citations feed retrieval ground truth (30 queries' ground-truth sets change).
Auditability: corrected queries carry audit_corrected: true and the audit's error_type in metadata, so per-correction impact can be analyzed in any run's artifacts.
Opt-in via convert locomo --audit-corrections <path>; without the flag, conversion is byte-identical to before (covered by test).

Verification

8 new unit tests (application, evidence remap, question-mismatch fail-loud, loader validation, unchanged-without-flag).
Live run against the real dataset + real audit: 156 applied, type distribution matches the audit's published counts (57 WRONG_CITATION, 33 HALLUCINATION, 26 TEMPORAL_ERROR, 24 ATTRIBUTION_ERROR, 13 AMBIGUOUS, 3 INCOMPLETE), 120 answers changed, all 1,830 untouched queries byte-identical.
Full suite green.

🤖 Generated with Claude Code

The April 2026 Penfield Labs audit (github.com/dial481/locomo-audit) found 156 answer-key errors in LoCoMo's 1,540 usable questions: hallucinated facts, temporal arithmetic mistakes, speaker-attribution errors, and wrong evidence citations. Scoring against the original key caps a perfect system at ~93.6% and penalizes correct systems on the broken 6.4%. - datasets/locomo_audit.py: fetch the 10 per-conversation error files at a pinned audit commit (9493fb4b), merge into corrections.json with checksum provenance; loader validates shape and rejects duplicates. - convert locomo --audit-corrections: corrected answers replace expected_answer (QA scoring) and corrected citations replace evidence (retrieval ground truth). Corrected queries carry audit_corrected + audit_error_type metadata. Every correction is cross-checked against the dataset's question text at conversion time so audit/dataset drift fails loudly instead of silently mis-correcting. - justfile recipes (bench-convert-locomo-corrected) and README section. Verified against the real dataset + real audit: all 156 corrections applied (type distribution matches the audit's published counts: 57 WRONG_CITATION, 33 HALLUCINATION, 26 TEMPORAL_ERROR, 24 ATTRIBUTION_ERROR, 13 AMBIGUOUS, 3 INCOMPLETE), 120 answers and 30 ground-truth sets changed, all 1,830 uncorrected queries byte-identical to an uncorrected conversion, zero question-text mismatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>

groksrc merged commit b8ccfa8 into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Penfield audit corrections for the LoCoMo answer key#17

feat: Penfield audit corrections for the LoCoMo answer key#17
groksrc merged 1 commit into
mainfrom
feat/locomo-corrected-key

groksrc commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

groksrc commented Jun 12, 2026

Summary

Design

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant