Evaluate TeleMem on LoCoMo and LongMemEval and publish reproducible results

TeleMem currently reports results on ZH-4O (86.33%, deterministic multiple-choice scoring — see the [README](https://github.com/TeleAI-UAGI/telemem#experimental-results)). The international agent-memory discourse runs on **LoCoMo** and **LongMemEval**, but published numbers there are widely unreliable — see [*The Benchmark Theatre*](https://essays.bloo-mind.ai/posts/2026-05-20-mem-eval/). The goal of this issue is therefore **not a leaderboard number, but a defensible one**, produced under TeleMem's [evaluation charter](https://teleai-uagi.github.io/telemem/evaluation/).

The harness in [`baselines/longmemeval/`](https://github.com/TeleAI-UAGI/telemem/tree/main/baselines/longmemeval) already enforces the charter. What remains is running it and publishing:

**Task checklist**
- [ ] Judge audit first: `--validate-judge` acceptance rates (gold ≥95%, wrong-but-topical ≤5%) for the chosen judge model — published alongside any judged score
- [ ] Baselines: `--system full-context` and `--system grep` under the same answer model/prompt as TeleMem
- [ ] TeleMem: `--system telemem --seeds 5` (10 preferred) → mean ± std, per-type Wilson 95% intervals
- [ ] Cost/latency table: ingestion wall-clock, search latency, answer+memory-side token usage
- [ ] Cross-check: feed hypotheses to LongMemEval's official `evaluate_qa.py`
- [ ] Publish everything (configs, prompts, raw outputs) so third parties can reproduce — including results where TeleMem does *not* win

**Reporting rules** (from the charter): no win claims across overlapping Wilson intervals; if TeleMem doesn't beat full-context + grep on accuracy, the published claim must be about cost/latency/scale, stated as such.

**References**
- LongMemEval: https://github.com/xiaowu0162/LongMemEval
- LoCoMo (and its known ground-truth/judge issues — prefer LoCoMo-Refined if attempted): https://github.com/snap-research/locomo

Partial contributions (single system, single baseline, judge audit only) are very welcome — comment here to coordinate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate TeleMem on LoCoMo and LongMemEval and publish reproducible results #10

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluate TeleMem on LoCoMo and LongMemEval and publish reproducible results #10

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions