Skip to content

feat: LongMemEval-S dataset, grouped-corpus runner mode, and converter#16

Merged
groksrc merged 1 commit into
mainfrom
feat/longmemeval-s
Jun 12, 2026
Merged

feat: LongMemEval-S dataset, grouped-corpus runner mode, and converter#16
groksrc merged 1 commit into
mainfrom
feat/longmemeval-s

Conversation

@groksrc

@groksrc groksrc commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Adds LongMemEval-S (Wu et al., ICLR 2025) — the de-facto standard memory benchmark — to the harness. Each of its 500 questions carries an independent ~50-session haystack, so this also introduces grouped-corpus execution: each group runs as an isolated mini-benchmark.

What's new

  • datasets/longmemeval.py — streaming fetch from the official HF repo (~278MB) with checksum provenance; shape validation on load.
  • converters/longmemeval_to_corpus.py — one corpus per question under groups/<question_id>/docs + a single queries.json with a group field per query.
  • Runner grouped mode — fresh provider instance per group, group-suffixed run id (namespaces the BM project / mem0 user with zero provider changes), per-group ingest/search/cleanup. Failed groups are recorded in provider status and skipped; ProviderSkippedError on the first group skips the provider entirely.
  • QA stagequestion_date metadata is appended to the question for both answerer and judge (133/500 questions are temporal-reasoning and unanswerable without the reference date).
  • CLI (datasets fetch --dataset longmemeval-s, convert longmemeval --max-questions), justfile recipes incl. 25-question dev slice, README docs.

Anti-leakage (scrutiny-proofing)

The raw dataset marks evidence sessions with an answer_ session-id prefix and per-turn has_answer flags. The converter remaps all session ids to neutral positional ids (<qid>-s012) and drops turn flags, so nothing a provider ingests distinguishes evidence from filler. Covered by a dedicated test. Duplicate sessions within a haystack (15 questions in the real data) are ingested once.

Verification

  • 16 new unit tests (converter incl. leakage scrub, grouped runner isolation/failure/skip semantics, QA date passthrough); full suite + test-int green.
  • Live end-to-end: converted 3 real questions (152 docs), ran grouped retrieval with bm-local (BM 0.22.0): recall@5 = 1.0, MRR = 1.0, evidence doc ranked Add MCP stdio provider for warm-connection benchmarks #1 in all 3 groups, grouped metadata recorded.

Known follow-up

Per-group overhead with bm-local is ~2.3 min (CLI cold starts + reindex + MCP warm-up per group) → ~19h extrapolated for the full 500. Warm-session/shared-config reuse across groups is the next harness task; the 25-question dev slice is practical today (~1h).

🤖 Generated with Claude Code

LongMemEval-S (Wu et al., ICLR 2025) gives each of its 500 questions an
independent ~50-session haystack, so it cannot run as a single shared
corpus. This adds:

- datasets/longmemeval.py: streaming fetch from the official HF repo
  (~278MB) with checksum provenance, plus shape validation on load.
- converters/longmemeval_to_corpus.py: one corpus per question under
  groups/<question_id>/docs plus a single queries.json whose entries
  carry a group field. Session ids are remapped to neutral positional
  ids and per-turn has_answer flags dropped — the raw dataset marks
  evidence sessions with an answer_ id prefix, which would leak ground
  truth into ingested corpora. Duplicate sessions within a haystack
  (15 questions) are ingested once.
- runner grouped mode: when queries carry groups, each group runs as an
  isolated mini-benchmark — fresh provider instance, group-suffixed
  run id (namespacing the BM project / mem0 user), per-group corpus.
  A failed group is recorded in provider status and skipped;
  ProviderSkippedError on the first group skips the provider.
- QA stage: question_date metadata is appended to the question for both
  answerer and judge — temporal-reasoning questions (133 of 500) are
  unanswerable without the reference date.
- CLI: datasets fetch --dataset longmemeval-s, convert longmemeval
  (--max-questions for dev slices); justfile recipes incl. a
  25-question dev slice; README section.

Verified end-to-end against real data: converted 3 real questions
(152 docs) and ran grouped retrieval with bm-local (BM 0.22.0) —
recall@5 = 1.0, MRR = 1.0, evidence doc ranked first in all 3 groups,
grouped metadata recorded in provider status.

Known follow-up: per-group overhead is ~2.3 min with bm-local (CLI cold
starts, reindex, MCP warm-up per group), ~19h extrapolated for the full
500. A warm-session/shared-config optimization across groups is the
next harness task.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit a21bb07 into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant