Skip to content

feat: Penfield audit corrections for the LoCoMo answer key#17

Merged
groksrc merged 1 commit into
mainfrom
feat/locomo-corrected-key
Jun 12, 2026
Merged

feat: Penfield audit corrections for the LoCoMo answer key#17
groksrc merged 1 commit into
mainfrom
feat/locomo-corrected-key

Conversation

@groksrc

@groksrc groksrc commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Wires the Penfield Labs LoCoMo audit (April 2026) into the harness: 156 answer-key errors across LoCoMo's 1,540 usable questions get corrected answers and corrected evidence citations at conversion time. This is a prerequisite for publishing LoCoMo numbers that withstand scrutiny — the original key is publicly documented as ~6.4% wrong.

Design

  • Pinned provenance: corrections fetched at audit commit 9493fb4b and merged into one corrections.json with checksum provenance — runs are reproducible and the upstream can't drift silently.
  • Fail-loud application: every correction is cross-checked against the dataset's question text during conversion; any mismatch raises instead of silently mis-correcting. (Verified: 0 mismatches across all 156 on the real dataset.)
  • Both scoring surfaces: corrected answers feed QA scoring; corrected evidence citations feed retrieval ground truth (30 queries' ground-truth sets change).
  • Auditability: corrected queries carry audit_corrected: true and the audit's error_type in metadata, so per-correction impact can be analyzed in any run's artifacts.
  • Opt-in via convert locomo --audit-corrections <path>; without the flag, conversion is byte-identical to before (covered by test).

Verification

  • 8 new unit tests (application, evidence remap, question-mismatch fail-loud, loader validation, unchanged-without-flag).
  • Live run against the real dataset + real audit: 156 applied, type distribution matches the audit's published counts (57 WRONG_CITATION, 33 HALLUCINATION, 26 TEMPORAL_ERROR, 24 ATTRIBUTION_ERROR, 13 AMBIGUOUS, 3 INCOMPLETE), 120 answers changed, all 1,830 untouched queries byte-identical.
  • Full suite green.

🤖 Generated with Claude Code

The April 2026 Penfield Labs audit (github.com/dial481/locomo-audit)
found 156 answer-key errors in LoCoMo's 1,540 usable questions:
hallucinated facts, temporal arithmetic mistakes, speaker-attribution
errors, and wrong evidence citations. Scoring against the original key
caps a perfect system at ~93.6% and penalizes correct systems on the
broken 6.4%.

- datasets/locomo_audit.py: fetch the 10 per-conversation error files
  at a pinned audit commit (9493fb4b), merge into corrections.json with
  checksum provenance; loader validates shape and rejects duplicates.
- convert locomo --audit-corrections: corrected answers replace
  expected_answer (QA scoring) and corrected citations replace evidence
  (retrieval ground truth). Corrected queries carry audit_corrected +
  audit_error_type metadata. Every correction is cross-checked against
  the dataset's question text at conversion time so audit/dataset drift
  fails loudly instead of silently mis-correcting.
- justfile recipes (bench-convert-locomo-corrected) and README section.

Verified against the real dataset + real audit: all 156 corrections
applied (type distribution matches the audit's published counts:
57 WRONG_CITATION, 33 HALLUCINATION, 26 TEMPORAL_ERROR,
24 ATTRIBUTION_ERROR, 13 AMBIGUOUS, 3 INCOMPLETE), 120 answers and
30 ground-truth sets changed, all 1,830 uncorrected queries
byte-identical to an uncorrected conversion, zero question-text
mismatches.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit b8ccfa8 into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant