Skip to content

perf: warm-session reuse across groups for bm-local (3x grouped speedup)#22

Merged
groksrc merged 1 commit into
mainfrom
perf/grouped-provider-reuse
Jun 12, 2026
Merged

perf: warm-session reuse across groups for bm-local (3x grouped speedup)#22
groksrc merged 1 commit into
mainfrom
perf/grouped-provider-reuse

Conversation

@groksrc

@groksrc groksrc commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Grouped runs created a fresh bm-local provider per group — new MCP session, full poll cycle — costing ~2.3 min/group (~19h extrapolated for full LongMemEval-S). This PR introduces opt-in group reuse: one provider instance serves every group.

Changes

  • BenchmarkProvider.supports_group_reuse (default False): grouped executor runs all groups through one instance — ingest per group, cleanup once at end with the base run config. Non-reuse providers keep exact prior semantics (covered by existing tests, unchanged).
  • bm-local opts in — empirically verified a warm bm mcp session serves projects added after it started. Project-per-group namespacing unchanged; resolved names cached per run id.
  • Fresh isolated config dir per instance (benchmarks/.bm-homes/, gitignored). Fixes a real bug found during this work: the persistent shared benchmarks/bm-home rotted across BM versions — a dev build's alembic migrations bricked the brew bm 0.22.0 (Can't locate revision n7i8j9k0l1m2). Also drops BASIC_MEMORY_HOME from the env.
  • Status polling backs off from 0.25s instead of a fixed 2s floor.

Measured (real BM 0.22.0, LongMemEval dev slice, 3 groups × 54 docs)

before after
wall clock ~7 min (~2.3 min/group) 2:14 (~45s/group)
full 500-question extrapolation ~19 h ~6.2 h
per-query search latency 1550 ms 527 ms
recall@5 / MRR 1.0 / 1.0 1.0 / 1.0 (identical)

Remaining per-group cost is embedding compute + reindex CLI startup — basic-memory-side, not harness overhead.

Testing

2 new runner tests (single-instance reuse, per-group namespacing, end-of-run cleanup, failure isolation under reuse); full suite green (91), lint clean.

🤖 Generated with Claude Code

Grouped runs (LongMemEval, ConvoMem) previously created a fresh bm-local
provider per group: new MCP session, full status-poll cycle, and a
shared persistent config dir. Measured ~2.3 min/group on the
LongMemEval dev slice (~19h extrapolated for the full 500 questions).

- BenchmarkProvider.supports_group_reuse (default False): when True the
  grouped executor runs every group through ONE provider instance
  (ingest per group, cleanup once at end of run with the base run
  config). Non-reuse providers keep exact prior semantics; the shared
  capability-probe instance serves group 0 so no extra instances are
  created either way.
- bm-local opts in: empirically verified that a warm bm mcp session
  serves projects added after it started, so one session covers all
  groups (project-per-group namespacing unchanged). Resolved project
  names are now cached per run id rather than once per instance.
- Fresh isolated config dir per provider instance under
  benchmarks/.bm-homes/ (gitignored). The previous persistent shared
  benchmarks/bm-home rotted across basic-memory versions — a dev
  build's alembic migrations bricked the brew-installed binary
  ('Can't locate revision') — and leaked projects between runs.
  BASIC_MEMORY_HOME is also dropped from the env for the same reason.
- Status polling backs off from 0.25s instead of a fixed 2s floor.

Measured: LongMemEval dev slice (3 groups, 54 docs each, real BM
0.22.0 via --bm-local-path) 2:14 total (~45s/group) vs ~2.3 min/group
before; identical retrieval results (recall@5 1.0, MRR 1.0). Remaining
per-group cost is embedding compute + reindex CLI start, not harness
overhead. 2 new runner tests cover single-instance reuse, per-group
namespacing, end-of-run cleanup, and failure isolation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit 9e868c0 into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant