feat: LongMemEval-S dataset, grouped-corpus runner mode, and converter#16
Merged
Conversation
LongMemEval-S (Wu et al., ICLR 2025) gives each of its 500 questions an independent ~50-session haystack, so it cannot run as a single shared corpus. This adds: - datasets/longmemeval.py: streaming fetch from the official HF repo (~278MB) with checksum provenance, plus shape validation on load. - converters/longmemeval_to_corpus.py: one corpus per question under groups/<question_id>/docs plus a single queries.json whose entries carry a group field. Session ids are remapped to neutral positional ids and per-turn has_answer flags dropped — the raw dataset marks evidence sessions with an answer_ id prefix, which would leak ground truth into ingested corpora. Duplicate sessions within a haystack (15 questions) are ingested once. - runner grouped mode: when queries carry groups, each group runs as an isolated mini-benchmark — fresh provider instance, group-suffixed run id (namespacing the BM project / mem0 user), per-group corpus. A failed group is recorded in provider status and skipped; ProviderSkippedError on the first group skips the provider. - QA stage: question_date metadata is appended to the question for both answerer and judge — temporal-reasoning questions (133 of 500) are unanswerable without the reference date. - CLI: datasets fetch --dataset longmemeval-s, convert longmemeval (--max-questions for dev slices); justfile recipes incl. a 25-question dev slice; README section. Verified end-to-end against real data: converted 3 real questions (152 docs) and ran grouped retrieval with bm-local (BM 0.22.0) — recall@5 = 1.0, MRR = 1.0, evidence doc ranked first in all 3 groups, grouped metadata recorded in provider status. Known follow-up: per-group overhead is ~2.3 min with bm-local (CLI cold starts, reindex, MCP warm-up per group), ~19h extrapolated for the full 500. A warm-session/shared-config optimization across groups is the next harness task. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds LongMemEval-S (Wu et al., ICLR 2025) — the de-facto standard memory benchmark — to the harness. Each of its 500 questions carries an independent ~50-session haystack, so this also introduces grouped-corpus execution: each group runs as an isolated mini-benchmark.
What's new
datasets/longmemeval.py— streaming fetch from the official HF repo (~278MB) with checksum provenance; shape validation on load.converters/longmemeval_to_corpus.py— one corpus per question undergroups/<question_id>/docs+ a singlequeries.jsonwith agroupfield per query.ProviderSkippedErroron the first group skips the provider entirely.question_datemetadata is appended to the question for both answerer and judge (133/500 questions are temporal-reasoning and unanswerable without the reference date).datasets fetch --dataset longmemeval-s,convert longmemeval --max-questions), justfile recipes incl. 25-question dev slice, README docs.Anti-leakage (scrutiny-proofing)
The raw dataset marks evidence sessions with an
answer_session-id prefix and per-turnhas_answerflags. The converter remaps all session ids to neutral positional ids (<qid>-s012) and drops turn flags, so nothing a provider ingests distinguishes evidence from filler. Covered by a dedicated test. Duplicate sessions within a haystack (15 questions in the real data) are ingested once.Verification
Known follow-up
Per-group overhead with bm-local is ~2.3 min (CLI cold starts + reindex + MCP warm-up per group) → ~19h extrapolated for the full 500. Warm-session/shared-config reuse across groups is the next harness task; the 25-question dev slice is practical today (~1h).
🤖 Generated with Claude Code