Feature/cache benchmark ts harness by KIvanow · Pull Request #226 · BetterDB-inc/monitor

KIvanow · 2026-05-29T14:03:44Z

Summary

Adds a TypeScript semantic cache benchmark harness (packages/cache-benchmark-ts) that benchmarks @betterdb/semantic-cache against @upstash/semantic-cache across four public datasets, six thresholds (0.10–0.50), and five BetterDB modes (bare, local, full, autotune, autotune-full). Includes dataset loaders fetched from HuggingFace with local JSONL caching, an Upstash adapter with vector ID hashing for long prompts, and per-competitor comparison files.

Changes

Add packages/cache-benchmark-ts — TypeScript benchmark harness mirroring the Python harness architecture
- Add BetterDB adapter with all 5 modes (bare, local, full, autotune, autotune-full) using @betterdb/semantic-cache workspace package
- Add Upstash adapter using @upstash/semantic-cache with direct Index API for store (hashes long prompt IDs to work around Upstash's 1000-char vector ID limit) and query (captures similarity scores that SemanticCache.get() discards)
- Add HuggingFace dataset loader with local JSONL caching, configurable concurrency, and retry with exponential backoff for rate-limited APIs
- Add dataset loaders for STSb, SICK, PAWS-Wiki, and vcache_lmarena with match-threshold support for continuous-score datasets
- Fix vcache_lmarena loader to handle both prompt and Prompt field casing from different HF API versions
- Fix dataset downloader to respect limit param (was downloading entire dataset even when only 5K rows needed — caused 3GB cache files and HF rate limiting)
- Reduce HF API concurrency from 10 to 1 and increase retry delay to handle rate limiting on large datasets
- Generate per-competitor comparison files: results/competitor_redisvl.txt (3K) and results/competitor_upstash.txt (6K) with peak F1 analysis, latency comparison, and key findings
- Run full benchmark matrix: 4 datasets × 6 thresholds × 6 modes = 108 runs (Upstash vcache_lmarena required ID hash fix mid-run)

Benchmark results (peak F1, bge-small-en-v1.5):

┌─────────────────────┬────────────────┬───────────────────────────────┬────────┐
│ Dataset │ Upstash best │ BetterDB best │ Delta │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ STSb (5K) │ 75.9% (θ=0.10) │ 76.3% (θ=0.20, bare/autotune) │ +0.4pp │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ SICK (9.9K) │ 77.6% (θ=0.30) │ 77.7% (θ=0.50, bare/autotune) │ +0.1pp │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ PAWS-Wiki (8K) │ 61.3% │ 61.3% │ tie │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ vcache_lmarena (5K) │ 70.1% (θ=0.10) │ 71.4% (θ=0.20, autotune) │ +1.3pp │
└─────────────────────┴────────────────┴───────────────────────────────┴────────┘

Latency: BetterDB 0.7ms p50 vs Upstash 90ms p50 (100-150x, local Valkey vs cloud REST)

Note: Same model name but different embedding runtimes (ONNX vs server-side) produce different similarity score distributions — Upstash [0, 0.26] vs BetterDB [0, 0.50]. Thresholds are not directly comparable; peak F1 at each adapter's optimal threshold is the fair comparison.

Checklist

Unit / integration tests added
Docs added / updated
Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
Competitive analysis done / discussed (internal)
Blog post about it discussed (internal)

Note

Low Risk
New private benchmarking package and lockfile churn only; no changes to production API paths unless operators run autotune/full modes against live Monitor/OpenAI.

Overview
Adds packages/cache-benchmark-ts, a new TypeScript replay harness aligned with the existing Python cache-benchmark, to compare @betterdb/semantic-cache and @upstash/semantic-cache on STSb, SICK, PAWS-Wiki, and vcache_lmarena.

The CLI sweeps cosine-distance thresholds and writes per-run JSON (snake_case for Python tooling), summaries, and optional markdown reports. BetterDB modes cover bare threshold, keyword rerank, LLM judge, and Monitor-driven autotune (poll + propose + auto-approve). Upstash maps distance to minProximity, upserts with hashed IDs for long prompts, and queries the Vector index directly so similarity scores are recorded.

Dataset loading uses the HuggingFace rows API with JSONL cache, row limits on download (avoids pulling huge splits), low concurrency, and retries on 429/5xx. The vcache loader tolerates prompt / Prompt field casing and builds balanced positive/negative pairs from equivalence classes.

pnpm-lock.yaml picks up the new workspace package and related deps (@huggingface/transformers, Upstash, tsx, etc.).

^{Reviewed by Cursor Bugbot for commit 47b3982. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ache Mirrors the Python benchmark harness with TS-native tooling: - BetterDB adapter wrapping @betterdb/semantic-cache (bare/local/full/autotune modes) - Upstash adapter wrapping @upstash/semantic-cache for competitive comparison - HuggingFace dataset loaders (STSb, SICK, PAWS-Wiki, vCache LM Arena) with local JSONL caching - Local embedding via @huggingface/transformers (bge-small-en-v1.5, all-MiniLM-L6-v2) - F1/precision/recall/FPR metrics with latency percentiles - snake_case JSON output compatible with Python harness report tools Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ws than limit The cache key was (dataset, config, split) with no limit component. A run with limit=500 would cache 500 rows, then a subsequent run with limit=5000 would get a cache hit and silently return only 500 rows. Fix: if the cached file has fewer rows than the requested limit, treat it as stale and re-download.

…he_lmarena String(undefined) produces the literal string "undefined" which passes the !prompt truthy check. Check for null/undefined before String() conversion so rows with missing prompt fields are skipped.

…mark-ts-harness # Conflicts: # pnpm-lock.yaml

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 47b3982. Configure here.}

cursor · 2026-05-29T15:12:12Z

+      summary[summaryKey] = metrics;
+    } finally {
+      await adapter.close();
+    }


Embedding model reloaded for every threshold iteration

Medium Severity

Each threshold iteration builds a new BetterDBAdapter via buildAdapter, which calls initialize() → buildEmbedFn() → pipeline('feature-extraction', ...). With the default 9 thresholds, the local embedding model (e.g., all-MiniLM-L6-v2) is loaded from disk into memory 9 separate times. Model loading typically takes several seconds per invocation, adding 30–90 seconds of unnecessary overhead to a benchmark run. The model and Valkey connection could be shared across threshold iterations since they don't change.

Additional Locations (1)

packages/cache-benchmark-ts/src/adapters/betterdb.ts#L61-L89

^{Reviewed by Cursor Bugbot for commit 47b3982. Configure here.}

cursor · 2026-05-29T15:12:12Z

+      adapters.set(adapter, f1s);
+    }
+    f1s.push(entry.metrics.f1);
+  }


Report adapter grouping splits key incorrectly for compound names

Low Severity

The report groups entries by adapter name using entry.key.split('_')[0]. The summary key format is ${adapter.name}_${threshold.toFixed(2)}. If an adapter name ever contains an underscore (e.g., a future redis_vl adapter), the split would incorrectly extract only the prefix before the first underscore, breaking the F1 trend grouping and best-F1 lookup logic.

Additional Locations (1)

packages/cache-benchmark-ts/src/cli.ts#L103-L104

^{Reviewed by Cursor Bugbot for commit 47b3982. Configure here.}

KIvanow and others added 2 commits May 28, 2026 19:48

running issues fixed

e060354

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread packages/cache-benchmark-ts/src/datasets/loader.ts

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread packages/cache-benchmark-ts/src/datasets/vcache-lmarena.ts Outdated

KIvanow added 2 commits May 29, 2026 17:53

fix(cache-benchmark-ts): guard against undefined prompt field in vcac…

30ecf95

…he_lmarena String(undefined) produces the literal string "undefined" which passes the !prompt truthy check. Check for null/undefined before String() conversion so rows with missing prompt fields are skipped.

Merge remote-tracking branch 'origin/master' into feature/cache-bench…

47b3982

…mark-ts-harness # Conflicts: # pnpm-lock.yaml

cursor Bot reviewed May 29, 2026

View reviewed changes

KIvanow merged commit ba9e1a3 into master May 29, 2026
3 checks passed

KIvanow deleted the feature/cache-benchmark-ts-harness branch May 29, 2026 15:17

github-actions Bot locked and limited conversation to collaborators May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/cache benchmark ts harness#226

Feature/cache benchmark ts harness#226
KIvanow merged 5 commits into
masterfrom
feature/cache-benchmark-ts-harness

KIvanow commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

cursor Bot May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KIvanow commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Checklist

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Embedding model reloaded for every threshold iteration

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Report adapter grouping splits key incorrectly for compound names

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KIvanow commented May 29, 2026 •

edited by cursor Bot

Loading