perf: warm-session reuse across groups for bm-local (3x grouped speedup)#22
Merged
Conversation
Grouped runs (LongMemEval, ConvoMem) previously created a fresh bm-local
provider per group: new MCP session, full status-poll cycle, and a
shared persistent config dir. Measured ~2.3 min/group on the
LongMemEval dev slice (~19h extrapolated for the full 500 questions).
- BenchmarkProvider.supports_group_reuse (default False): when True the
grouped executor runs every group through ONE provider instance
(ingest per group, cleanup once at end of run with the base run
config). Non-reuse providers keep exact prior semantics; the shared
capability-probe instance serves group 0 so no extra instances are
created either way.
- bm-local opts in: empirically verified that a warm bm mcp session
serves projects added after it started, so one session covers all
groups (project-per-group namespacing unchanged). Resolved project
names are now cached per run id rather than once per instance.
- Fresh isolated config dir per provider instance under
benchmarks/.bm-homes/ (gitignored). The previous persistent shared
benchmarks/bm-home rotted across basic-memory versions — a dev
build's alembic migrations bricked the brew-installed binary
('Can't locate revision') — and leaked projects between runs.
BASIC_MEMORY_HOME is also dropped from the env for the same reason.
- Status polling backs off from 0.25s instead of a fixed 2s floor.
Measured: LongMemEval dev slice (3 groups, 54 docs each, real BM
0.22.0 via --bm-local-path) 2:14 total (~45s/group) vs ~2.3 min/group
before; identical retrieval results (recall@5 1.0, MRR 1.0). Remaining
per-group cost is embedding compute + reindex CLI start, not harness
overhead. 2 new runner tests cover single-instance reuse, per-group
namespacing, end-of-run cleanup, and failure isolation.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Grouped runs created a fresh bm-local provider per group — new MCP session, full poll cycle — costing ~2.3 min/group (~19h extrapolated for full LongMemEval-S). This PR introduces opt-in group reuse: one provider instance serves every group.
Changes
BenchmarkProvider.supports_group_reuse(defaultFalse): grouped executor runs all groups through one instance — ingest per group, cleanup once at end with the base run config. Non-reuse providers keep exact prior semantics (covered by existing tests, unchanged).bm mcpsession serves projects added after it started. Project-per-group namespacing unchanged; resolved names cached per run id.benchmarks/.bm-homes/, gitignored). Fixes a real bug found during this work: the persistent sharedbenchmarks/bm-homerotted across BM versions — a dev build's alembic migrations bricked the brewbm0.22.0 (Can't locate revision n7i8j9k0l1m2). Also dropsBASIC_MEMORY_HOMEfrom the env.Measured (real BM 0.22.0, LongMemEval dev slice, 3 groups × 54 docs)
Remaining per-group cost is embedding compute + reindex CLI startup — basic-memory-side, not harness overhead.
Testing
2 new runner tests (single-instance reuse, per-group namespacing, end-of-run cleanup, failure isolation under reuse); full suite green (91), lint clean.
🤖 Generated with Claude Code