Skip to content

Evaluate TeleMem on LoCoMo and LongMemEval and publish reproducible results #10

@dell-zhang

Description

@dell-zhang

TeleMem currently reports results on ZH-4O (86.33%, deterministic multiple-choice scoring — see the README). The international agent-memory discourse runs on LoCoMo and LongMemEval, but published numbers there are widely unreliable — see The Benchmark Theatre. The goal of this issue is therefore not a leaderboard number, but a defensible one, produced under TeleMem's evaluation charter.

The harness in baselines/longmemeval/ already enforces the charter. What remains is running it and publishing:

Task checklist

  • Judge audit first: --validate-judge acceptance rates (gold ≥95%, wrong-but-topical ≤5%) for the chosen judge model — published alongside any judged score
  • Baselines: --system full-context and --system grep under the same answer model/prompt as TeleMem
  • TeleMem: --system telemem --seeds 5 (10 preferred) → mean ± std, per-type Wilson 95% intervals
  • Cost/latency table: ingestion wall-clock, search latency, answer+memory-side token usage
  • Cross-check: feed hypotheses to LongMemEval's official evaluate_qa.py
  • Publish everything (configs, prompts, raw outputs) so third parties can reproduce — including results where TeleMem does not win

Reporting rules (from the charter): no win claims across overlapping Wilson intervals; if TeleMem doesn't beat full-context + grep on accuracy, the published claim must be about cost/latency/scale, stated as such.

References

Partial contributions (single system, single baseline, judge audit only) are very welcome — comment here to coordinate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions