feat: LongMemEval-S dataset, grouped-corpus runner mode, and converter by groksrc · Pull Request #16 · basicmachines-co/basic-memory-benchmarks

groksrc · 2026-06-12T18:34:04Z

Summary

Adds LongMemEval-S (Wu et al., ICLR 2025) — the de-facto standard memory benchmark — to the harness. Each of its 500 questions carries an independent ~50-session haystack, so this also introduces grouped-corpus execution: each group runs as an isolated mini-benchmark.

What's new

datasets/longmemeval.py — streaming fetch from the official HF repo (~278MB) with checksum provenance; shape validation on load.
converters/longmemeval_to_corpus.py — one corpus per question under groups/<question_id>/docs + a single queries.json with a group field per query.
Runner grouped mode — fresh provider instance per group, group-suffixed run id (namespaces the BM project / mem0 user with zero provider changes), per-group ingest/search/cleanup. Failed groups are recorded in provider status and skipped; ProviderSkippedError on the first group skips the provider entirely.
QA stage — question_date metadata is appended to the question for both answerer and judge (133/500 questions are temporal-reasoning and unanswerable without the reference date).
CLI (datasets fetch --dataset longmemeval-s, convert longmemeval --max-questions), justfile recipes incl. 25-question dev slice, README docs.

Anti-leakage (scrutiny-proofing)

The raw dataset marks evidence sessions with an answer_ session-id prefix and per-turn has_answer flags. The converter remaps all session ids to neutral positional ids (<qid>-s012) and drops turn flags, so nothing a provider ingests distinguishes evidence from filler. Covered by a dedicated test. Duplicate sessions within a haystack (15 questions in the real data) are ingested once.

Verification

16 new unit tests (converter incl. leakage scrub, grouped runner isolation/failure/skip semantics, QA date passthrough); full suite + test-int green.
Live end-to-end: converted 3 real questions (152 docs), ran grouped retrieval with bm-local (BM 0.22.0): recall@5 = 1.0, MRR = 1.0, evidence doc ranked Add MCP stdio provider for warm-connection benchmarks #1 in all 3 groups, grouped metadata recorded.

Known follow-up

Per-group overhead with bm-local is ~2.3 min (CLI cold starts + reindex + MCP warm-up per group) → ~19h extrapolated for the full 500. Warm-session/shared-config reuse across groups is the next harness task; the 25-question dev slice is practical today (~1h).

🤖 Generated with Claude Code

LongMemEval-S (Wu et al., ICLR 2025) gives each of its 500 questions an independent ~50-session haystack, so it cannot run as a single shared corpus. This adds: - datasets/longmemeval.py: streaming fetch from the official HF repo (~278MB) with checksum provenance, plus shape validation on load. - converters/longmemeval_to_corpus.py: one corpus per question under groups/<question_id>/docs plus a single queries.json whose entries carry a group field. Session ids are remapped to neutral positional ids and per-turn has_answer flags dropped — the raw dataset marks evidence sessions with an answer_ id prefix, which would leak ground truth into ingested corpora. Duplicate sessions within a haystack (15 questions) are ingested once. - runner grouped mode: when queries carry groups, each group runs as an isolated mini-benchmark — fresh provider instance, group-suffixed run id (namespacing the BM project / mem0 user), per-group corpus. A failed group is recorded in provider status and skipped; ProviderSkippedError on the first group skips the provider. - QA stage: question_date metadata is appended to the question for both answerer and judge — temporal-reasoning questions (133 of 500) are unanswerable without the reference date. - CLI: datasets fetch --dataset longmemeval-s, convert longmemeval (--max-questions for dev slices); justfile recipes incl. a 25-question dev slice; README section. Verified end-to-end against real data: converted 3 real questions (152 docs) and ran grouped retrieval with bm-local (BM 0.22.0) — recall@5 = 1.0, MRR = 1.0, evidence doc ranked first in all 3 groups, grouped metadata recorded in provider status. Known follow-up: per-group overhead is ~2.3 min with bm-local (CLI cold starts, reindex, MCP warm-up per group), ~19h extrapolated for the full 500. A warm-session/shared-config optimization across groups is the next harness task. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>

groksrc merged commit a21bb07 into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LongMemEval-S dataset, grouped-corpus runner mode, and converter#16

feat: LongMemEval-S dataset, grouped-corpus runner mode, and converter#16
groksrc merged 1 commit into
mainfrom
feat/longmemeval-s

groksrc commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

groksrc commented Jun 12, 2026

Summary

What's new

Anti-leakage (scrutiny-proofing)

Verification

Known follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant