Feature/cache benchmark ts harness#226
Conversation
…ache Mirrors the Python benchmark harness with TS-native tooling: - BetterDB adapter wrapping @betterdb/semantic-cache (bare/local/full/autotune modes) - Upstash adapter wrapping @upstash/semantic-cache for competitive comparison - HuggingFace dataset loaders (STSb, SICK, PAWS-Wiki, vCache LM Arena) with local JSONL caching - Local embedding via @huggingface/transformers (bge-small-en-v1.5, all-MiniLM-L6-v2) - F1/precision/recall/FPR metrics with latency percentiles - snake_case JSON output compatible with Python harness report tools Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ws than limit The cache key was (dataset, config, split) with no limit component. A run with limit=500 would cache 500 rows, then a subsequent run with limit=5000 would get a cache hit and silently return only 500 rows. Fix: if the cached file has fewer rows than the requested limit, treat it as stale and re-download.
…he_lmarena String(undefined) produces the literal string "undefined" which passes the !prompt truthy check. Check for null/undefined before String() conversion so rows with missing prompt fields are skipped.
…mark-ts-harness # Conflicts: # pnpm-lock.yaml
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 47b3982. Configure here.
| summary[summaryKey] = metrics; | ||
| } finally { | ||
| await adapter.close(); | ||
| } |
There was a problem hiding this comment.
Embedding model reloaded for every threshold iteration
Medium Severity
Each threshold iteration builds a new BetterDBAdapter via buildAdapter, which calls initialize() → buildEmbedFn() → pipeline('feature-extraction', ...). With the default 9 thresholds, the local embedding model (e.g., all-MiniLM-L6-v2) is loaded from disk into memory 9 separate times. Model loading typically takes several seconds per invocation, adding 30–90 seconds of unnecessary overhead to a benchmark run. The model and Valkey connection could be shared across threshold iterations since they don't change.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 47b3982. Configure here.
| adapters.set(adapter, f1s); | ||
| } | ||
| f1s.push(entry.metrics.f1); | ||
| } |
There was a problem hiding this comment.
Report adapter grouping splits key incorrectly for compound names
Low Severity
The report groups entries by adapter name using entry.key.split('_')[0]. The summary key format is ${adapter.name}_${threshold.toFixed(2)}. If an adapter name ever contains an underscore (e.g., a future redis_vl adapter), the split would incorrectly extract only the prefix before the first underscore, breaking the F1 trend grouping and best-F1 lookup logic.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 47b3982. Configure here.


Summary
Adds a TypeScript semantic cache benchmark harness (packages/cache-benchmark-ts) that benchmarks @betterdb/semantic-cache against @upstash/semantic-cache across four public datasets, six thresholds (0.10–0.50), and five BetterDB modes (bare, local, full, autotune, autotune-full). Includes dataset loaders fetched from HuggingFace with local JSONL caching, an Upstash adapter with vector ID hashing for long prompts, and per-competitor comparison files.
Changes
Benchmark results (peak F1, bge-small-en-v1.5):
┌─────────────────────┬────────────────┬───────────────────────────────┬────────┐
│ Dataset │ Upstash best │ BetterDB best │ Delta │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ STSb (5K) │ 75.9% (θ=0.10) │ 76.3% (θ=0.20, bare/autotune) │ +0.4pp │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ SICK (9.9K) │ 77.6% (θ=0.30) │ 77.7% (θ=0.50, bare/autotune) │ +0.1pp │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ PAWS-Wiki (8K) │ 61.3% │ 61.3% │ tie │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ vcache_lmarena (5K) │ 70.1% (θ=0.10) │ 71.4% (θ=0.20, autotune) │ +1.3pp │
└─────────────────────┴────────────────┴───────────────────────────────┴────────┘
Latency: BetterDB 0.7ms p50 vs Upstash 90ms p50 (100-150x, local Valkey vs cloud REST)
Note: Same model name but different embedding runtimes (ONNX vs server-side) produce different similarity score distributions — Upstash [0, 0.26] vs BetterDB [0, 0.50]. Thresholds are not directly comparable; peak F1 at each adapter's optimal threshold is the fair comparison.
Checklist
roborev review --branchor/roborev-review-branchin Claude Code (internal)Note
Low Risk
New private benchmarking package and lockfile churn only; no changes to production API paths unless operators run autotune/full modes against live Monitor/OpenAI.
Overview
Adds
packages/cache-benchmark-ts, a new TypeScript replay harness aligned with the existing Pythoncache-benchmark, to compare@betterdb/semantic-cacheand@upstash/semantic-cacheon STSb, SICK, PAWS-Wiki, and vcache_lmarena.The CLI sweeps cosine-distance thresholds and writes per-run JSON (snake_case for Python tooling), summaries, and optional markdown reports. BetterDB modes cover bare threshold, keyword rerank, LLM judge, and Monitor-driven autotune (poll + propose + auto-approve). Upstash maps distance to
minProximity, upserts with hashed IDs for long prompts, and queries the Vector index directly so similarity scores are recorded.Dataset loading uses the HuggingFace rows API with JSONL cache, row limits on download (avoids pulling huge splits), low concurrency, and retries on 429/5xx. The vcache loader tolerates
prompt/Promptfield casing and builds balanced positive/negative pairs from equivalence classes.pnpm-lock.yamlpicks up the new workspace package and related deps (@huggingface/transformers, Upstash,tsx, etc.).Reviewed by Cursor Bugbot for commit 47b3982. Bugbot is set up for automated code reviews on this repo. Configure here.