Skip to content

Feature/cache benchmark ts harness#226

Merged
KIvanow merged 5 commits into
masterfrom
feature/cache-benchmark-ts-harness
May 29, 2026
Merged

Feature/cache benchmark ts harness#226
KIvanow merged 5 commits into
masterfrom
feature/cache-benchmark-ts-harness

Conversation

@KIvanow

@KIvanow KIvanow commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

Adds a TypeScript semantic cache benchmark harness (packages/cache-benchmark-ts) that benchmarks @betterdb/semantic-cache against @upstash/semantic-cache across four public datasets, six thresholds (0.10–0.50), and five BetterDB modes (bare, local, full, autotune, autotune-full). Includes dataset loaders fetched from HuggingFace with local JSONL caching, an Upstash adapter with vector ID hashing for long prompts, and per-competitor comparison files.

Changes

  • Add packages/cache-benchmark-ts — TypeScript benchmark harness mirroring the Python harness architecture
    • Add BetterDB adapter with all 5 modes (bare, local, full, autotune, autotune-full) using @betterdb/semantic-cache workspace package
    • Add Upstash adapter using @upstash/semantic-cache with direct Index API for store (hashes long prompt IDs to work around Upstash's 1000-char vector ID limit) and query (captures similarity scores that SemanticCache.get() discards)
    • Add HuggingFace dataset loader with local JSONL caching, configurable concurrency, and retry with exponential backoff for rate-limited APIs
    • Add dataset loaders for STSb, SICK, PAWS-Wiki, and vcache_lmarena with match-threshold support for continuous-score datasets
    • Fix vcache_lmarena loader to handle both prompt and Prompt field casing from different HF API versions
    • Fix dataset downloader to respect limit param (was downloading entire dataset even when only 5K rows needed — caused 3GB cache files and HF rate limiting)
    • Reduce HF API concurrency from 10 to 1 and increase retry delay to handle rate limiting on large datasets
    • Generate per-competitor comparison files: results/competitor_redisvl.txt (3K) and results/competitor_upstash.txt (6K) with peak F1 analysis, latency comparison, and key findings
    • Run full benchmark matrix: 4 datasets × 6 thresholds × 6 modes = 108 runs (Upstash vcache_lmarena required ID hash fix mid-run)

Benchmark results (peak F1, bge-small-en-v1.5):

┌─────────────────────┬────────────────┬───────────────────────────────┬────────┐
│ Dataset │ Upstash best │ BetterDB best │ Delta │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ STSb (5K) │ 75.9% (θ=0.10) │ 76.3% (θ=0.20, bare/autotune) │ +0.4pp │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ SICK (9.9K) │ 77.6% (θ=0.30) │ 77.7% (θ=0.50, bare/autotune) │ +0.1pp │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ PAWS-Wiki (8K) │ 61.3% │ 61.3% │ tie │
├─────────────────────┼────────────────┼───────────────────────────────┼────────┤
│ vcache_lmarena (5K) │ 70.1% (θ=0.10) │ 71.4% (θ=0.20, autotune) │ +1.3pp │
└─────────────────────┴────────────────┴───────────────────────────────┴────────┘

Latency: BetterDB 0.7ms p50 vs Upstash 90ms p50 (100-150x, local Valkey vs cloud REST)

Note: Same model name but different embedding runtimes (ONNX vs server-side) produce different similarity score distributions — Upstash [0, 0.26] vs BetterDB [0, 0.50]. Thresholds are not directly comparable; peak F1 at each adapter's optimal threshold is the fair comparison.

Checklist

  • Unit / integration tests added
  • Docs added / updated
  • Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
  • Competitive analysis done / discussed (internal)
  • Blog post about it discussed (internal)

Note

Low Risk
New private benchmarking package and lockfile churn only; no changes to production API paths unless operators run autotune/full modes against live Monitor/OpenAI.

Overview
Adds packages/cache-benchmark-ts, a new TypeScript replay harness aligned with the existing Python cache-benchmark, to compare @betterdb/semantic-cache and @upstash/semantic-cache on STSb, SICK, PAWS-Wiki, and vcache_lmarena.

The CLI sweeps cosine-distance thresholds and writes per-run JSON (snake_case for Python tooling), summaries, and optional markdown reports. BetterDB modes cover bare threshold, keyword rerank, LLM judge, and Monitor-driven autotune (poll + propose + auto-approve). Upstash maps distance to minProximity, upserts with hashed IDs for long prompts, and queries the Vector index directly so similarity scores are recorded.

Dataset loading uses the HuggingFace rows API with JSONL cache, row limits on download (avoids pulling huge splits), low concurrency, and retries on 429/5xx. The vcache loader tolerates prompt / Prompt field casing and builds balanced positive/negative pairs from equivalence classes.

pnpm-lock.yaml picks up the new workspace package and related deps (@huggingface/transformers, Upstash, tsx, etc.).

Reviewed by Cursor Bugbot for commit 47b3982. Bugbot is set up for automated code reviews on this repo. Configure here.

KIvanow and others added 2 commits May 28, 2026 19:48
…ache

Mirrors the Python benchmark harness with TS-native tooling:
- BetterDB adapter wrapping @betterdb/semantic-cache (bare/local/full/autotune modes)
- Upstash adapter wrapping @upstash/semantic-cache for competitive comparison
- HuggingFace dataset loaders (STSb, SICK, PAWS-Wiki, vCache LM Arena) with local JSONL caching
- Local embedding via @huggingface/transformers (bge-small-en-v1.5, all-MiniLM-L6-v2)
- F1/precision/recall/FPR metrics with latency percentiles
- snake_case JSON output compatible with Python harness report tools

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread packages/cache-benchmark-ts/src/datasets/loader.ts
…ws than limit

The cache key was (dataset, config, split) with no limit component. A run
with limit=500 would cache 500 rows, then a subsequent run with limit=5000
would get a cache hit and silently return only 500 rows.

Fix: if the cached file has fewer rows than the requested limit, treat it
as stale and re-download.
Comment thread packages/cache-benchmark-ts/src/datasets/vcache-lmarena.ts Outdated
KIvanow added 2 commits May 29, 2026 17:53
…he_lmarena

String(undefined) produces the literal string "undefined" which passes
the !prompt truthy check. Check for null/undefined before String()
conversion so rows with missing prompt fields are skipped.
…mark-ts-harness

# Conflicts:
#	pnpm-lock.yaml

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 47b3982. Configure here.

summary[summaryKey] = metrics;
} finally {
await adapter.close();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding model reloaded for every threshold iteration

Medium Severity

Each threshold iteration builds a new BetterDBAdapter via buildAdapter, which calls initialize()buildEmbedFn()pipeline('feature-extraction', ...). With the default 9 thresholds, the local embedding model (e.g., all-MiniLM-L6-v2) is loaded from disk into memory 9 separate times. Model loading typically takes several seconds per invocation, adding 30–90 seconds of unnecessary overhead to a benchmark run. The model and Valkey connection could be shared across threshold iterations since they don't change.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 47b3982. Configure here.

adapters.set(adapter, f1s);
}
f1s.push(entry.metrics.f1);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Report adapter grouping splits key incorrectly for compound names

Low Severity

The report groups entries by adapter name using entry.key.split('_')[0]. The summary key format is ${adapter.name}_${threshold.toFixed(2)}. If an adapter name ever contains an underscore (e.g., a future redis_vl adapter), the split would incorrectly extract only the prefix before the first underscore, breaking the F1 trend grouping and best-F1 lookup logic.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 47b3982. Configure here.

@KIvanow KIvanow merged commit ba9e1a3 into master May 29, 2026
3 checks passed
@KIvanow KIvanow deleted the feature/cache-benchmark-ts-harness branch May 29, 2026 15:17
@github-actions github-actions Bot locked and limited conversation to collaborators May 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant