TeleMem currently reports results on ZH-4O (86.33%, deterministic multiple-choice scoring — see the README). The international agent-memory discourse runs on LoCoMo and LongMemEval, but published numbers there are widely unreliable — see The Benchmark Theatre. The goal of this issue is therefore not a leaderboard number, but a defensible one, produced under TeleMem's evaluation charter.
The harness in baselines/longmemeval/ already enforces the charter. What remains is running it and publishing:
Task checklist
Reporting rules (from the charter): no win claims across overlapping Wilson intervals; if TeleMem doesn't beat full-context + grep on accuracy, the published claim must be about cost/latency/scale, stated as such.
References
Partial contributions (single system, single baseline, judge audit only) are very welcome — comment here to coordinate.
TeleMem currently reports results on ZH-4O (86.33%, deterministic multiple-choice scoring — see the README). The international agent-memory discourse runs on LoCoMo and LongMemEval, but published numbers there are widely unreliable — see The Benchmark Theatre. The goal of this issue is therefore not a leaderboard number, but a defensible one, produced under TeleMem's evaluation charter.
The harness in
baselines/longmemeval/already enforces the charter. What remains is running it and publishing:Task checklist
--validate-judgeacceptance rates (gold ≥95%, wrong-but-topical ≤5%) for the chosen judge model — published alongside any judged score--system full-contextand--system grepunder the same answer model/prompt as TeleMem--system telemem --seeds 5(10 preferred) → mean ± std, per-type Wilson 95% intervalsevaluate_qa.pyReporting rules (from the charter): no win claims across overlapping Wilson intervals; if TeleMem doesn't beat full-context + grep on accuracy, the published claim must be about cost/latency/scale, stated as such.
References
Partial contributions (single system, single baseline, judge audit only) are very welcome — comment here to coordinate.