Skip to content

docs: matrix v1.2 internal results + #994 FTS-revival impact#30

Merged
groksrc merged 1 commit into
mainfrom
results/matrix-v1.2
Jun 13, 2026
Merged

docs: matrix v1.2 internal results + #994 FTS-revival impact#30
groksrc merged 1 commit into
mainfrom
results/matrix-v1.2

Conversation

@groksrc

@groksrc groksrc commented Jun 13, 2026

Copy link
Copy Markdown
Member

Internal benchmark results record (run 2026-06-12, local, zero API spend).

Headlines

  • BM leads QA accuracy on LongMemEval-S (0.617 vs mem0 0.417) and is a close 2nd on ConvoMem (0.792 vs mem0 0.474). mem0 edges retrieval recall but abstains 2-3x as often — its retrieved chunks are less answer-bearing. Retrieval recall ≠ answer quality.
  • PR #994 (FTS-revival): full corrected-LoCoMo retrieval +7.9 recall@5 / +10.0 MRR (every category up); QA accuracy +3.7 on the non-adversarial q300 subset.
  • multi_hop stays ~0.08 — bottlenecked by BM returning bullet-level chunks that strip document-level context (the session date is in the title), not by FTS. That's the next product fix.
  • Full-context is a poor baseline at small-model scale (LongMemEval-S 0.217) but wins on the smaller ConvoMem cs10 (0.825) — confirms full-context only beats retrieval while the corpus fits the model's working window.

Fairness note

mem0 ran in raw-add mode (infer=false) to match the June 10 baseline; its published numbers use infer=true (LLM extraction). A future matrix should run both and document the extraction model.

Run artifacts under benchmarks/runs/ are gitignored; this is the human-readable record.

🤖 Generated with Claude Code

Records the v1.2 benchmark matrix (LongMemEval-S, ConvoMem; bm-local,
mem0-local, baselines) and PR #994's measured impact.

Headlines:
- BM leads QA accuracy on LongMemEval-S (0.617 vs mem0 0.417) and is a
  close 2nd on ConvoMem (0.792 vs mem0 0.474); mem0 edges retrieval
  recall but abstains 2-3x as often — its chunks are less answer-bearing.
- #994 (FTS-revival): retrieval +7.9 recall@5 / +10.0 MRR on the full
  corrected-LoCoMo set (every category up); QA accuracy +3.7 on the
  non-adversarial q300 subset.
- multi_hop stays ~0.08 — bottlenecked by BM returning bullet-level
  chunks that strip document context (next product fix), not by FTS.

Also ignores the local .supermemory/ server data dir.

Run artifacts live under benchmarks/runs/ (gitignored); this is the
human-readable record.

Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit 259f72c into main Jun 13, 2026
1 check passed
@groksrc groksrc deleted the results/matrix-v1.2 branch June 13, 2026 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant