Feat/benchmarking by KIvanow · Pull Request #223 · BetterDB-inc/monitor

KIvanow · 2026-05-27T08:12:13Z

Summary

Adds outcome-aware self-tuning to the Monitor's threshold recommendation engine and benchmarks it on four public datasets (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki). The autotuner loosens when too strict, tightens when too loose, and does nothing when the threshold is already optimal — with zero degradation across every configuration tested.

Changes

Add signal quality guards to the recommendation engine — require the uncertain-hit fraction of all operations (not just hits) to exceed 15% before tightening; require hit rate > 80% before treating distant-hits as a tighten
signal
- Add recall-cost guard — block tighten if estimated hit loss exceeds 15% of current hits
- Add outcome tracking — after each adjustment, compare current signal metrics against the snapshot from the previous adjustment; stop if the triggering signal did not improve by 20%
- Add velocity dampening — progressive step-size reduction for consecutive same-direction adjustments with oscillation detection
- Write tuning history with metrics snapshots in the apply dispatcher to enable outcome tracking
- Add STSb dataset loader (8,628 pairs, 3 genre categories, continuous scores)
- Add SICK dataset loader (9,927 pairs, score-derived categories)
- Add --match-threshold CLI option for continuous-score datasets
- Add autotune-full mode combining rerank + LLM judge + Monitor-driven autotuning
- Improve judge prompt — semantic equivalence framing instead of Q&A, strip "Answer:" prefix
- Add timeout retry with exponential backoff on Monitor API calls

Benchmark results (bare → autotune):

STSb θ=0.20: +2.8% F1 (loosened 0.20 → 0.22)
STSb θ=0.40: +8.6% precision (tightened 0.40 → 0.30)
SICK θ=0.10: +2.1% F1 (loosened 0.10 → 0.145)
SemBenchmarkLmArena θ=0.40: +2.9% F1 (tightened 0.40 → 0.30)
PAWS-Wiki: 0% change (wall confirmed, autotuner correctly does nothing)

Checklist

Unit / integration tests added
Docs added / updated
Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
Competitive analysis done / discussed (internal)
Blog post about it discussed (internal)

Note

Medium Risk
Changes proprietary threshold recommendation and apply paths that affect production cache behavior; benchmark and docs changes are lower risk but autotune now depends on Monitor API credentials and external calls.

Overview
Strengthens Monitor-driven semantic cache threshold tuning with richer signals (uncertain hits, distant weak hits, near-misses, low hit rate), a recall-cost guard, outcome tracking against prior adjustments, and velocity/oscillation dampening. Approved changes now append tuning history with metric snapshots in Valkey, and live config reads prefer the threshold field in __config.

Expands the cache-benchmark harness: STSb and SICK loaders, --match-threshold, autotune-full (rerank + judge + Monitor autotune), and BetterDB autotune wired to the Monitor recommend → propose → approve API (replacing in-process tuning), plus judge prompt tweaks and Monitor API retries. Adds Python semantic cache package documentation and minor benchmark/README/lockfile updates.

^{Reviewed by Cursor Bugbot for commit 5af5e5f. Bugbot is set up for automated code reviews on this repo. Configure here.}

…profiler New package packages/cache-benchmark implementing a replay-based benchmark harness for comparing semantic cache implementations (BetterDB, RedisVL, GPTCache) against labeled query-pair datasets (vCache SemBenchmarkLmArena, PAWS-Wiki). Key features: - Three-mode benchmark: bare (cosine only), local (native rerankers, no external APIs), full (LLM judge + Cohere rerank with API keys) - BetterDB adapter: cosine threshold + keyword-overlap rerank + OpenAI gpt-4o-mini judge gate on uncertain hits (full mode) - RedisVL adapter: raw FT.SEARCH workaround for Valkey Search compatibility (VectorRangeQuery unsupported); native check() path for Redis Stack - GPTCache adapter: FAISS+SQLite with SBERT crossencoder rerank (local) or Cohere Rerank API (full); GPTCACHE_LOCAL_EVALUATOR env var for alternatives - Harness stores informative dummy responses ("Answer: {prompt}") so LLM-as-judge adapters receive meaningful text matching production conditions - --debug-judge flag: wires gpt-4o-mini onto uncertain hits and writes every invocation to a JSONL log for investigation - Dataset loaders for vCache/SemBenchmarkLmArena (ID_Set equivalence field, train split) and google-research-datasets/paws (labeled_final) - Metrics: precision, recall, F1, FPR, p50/p95/p99 latency - Threshold sweep CLI with per-(adapter, mode, dataset, threshold) JSON output and summary files - Markdown report generator with per-adapter F1 sparklines - Standalone latency profiler (scripts/latency_profile.py) with embed/ network/parse breakdown and BetterDB embedding cache hypothesis test

- vcache_lmarena.py: fix module docstring (wrong org name, wrong field name); move random import to top level; remove dead guard (random.sample already guarantees distinct elements) - harness.py: remove unused populate_with_dummy_responses parameter; move asyncio and tqdm imports to module level - redisvl_adapter.py: fix malformed module docstring (# lines inside """); remove unused cached_prompt variable; fix double-embed in _check_native by passing pre-computed vector to cache.check() instead of re-embedding; remove unused PROMPT_FIELD_NAME import - gptcache_adapter.py: update stale comment in _check_full_cohere (harness now stores informative responses, not hashes); convert __main__ string literal to proper comments - validate.py: fix default port in __main__ from 6379 to 6381 - latency_profile.py: wire --profile flag to gate the summary section

…is Stack 7.4.7) Previous profile used redis/redis-stack-server:latest (Redis 7.4.7, sha256:798ab84d) which showed native RedisVL at 6.30ms p50 vs the Valkey workaround at 3.46ms — a misleading 84% gap driven by VectorRangeQuery overhead in Redis Stack's older search module. Re-tested against redis:latest (Redis 8.6.3, Search 8.6.7): native RedisVL on Redis 8 is 3.44ms p50, statistically identical to the Valkey workaround (3.46ms, delta = 0.02ms). The latency is now dominated entirely by SBERT embedding compute (~3ms), not search internals. Changes: - redisvl_adapter.py: add "redis-os" as the canonical backend value for Redis 8 Open Source; keep "redis-stack" as a backward-compatible alias; switch env var to REDIS_OS_URL (falls back to REDIS_STACK_URL); update default port from 6383 to 6384; update error message and docstring - latency_profile.py: document previous Redis Stack version in module docstring; hardcode legacy Redis Stack 7.4.7 results as LEGACY_REDIS_STACK_RESULT so both rows appear in the table without re-running; update backend from "redis-stack" to "redis-os"; include Redis version string in row label dynamically; update summary to compare Redis Stack vs Redis 8 and render the blog narrative verdict automatically

- validate.py: TARGETS_FILE used 4 .parent calls instead of 3, resolving to packages/validate_targets.yaml (does not exist) instead of packages/cache-benchmark/validate_targets.yaml. Always caused FileNotFoundError when running the validation gate. - harness.py: _with_retry called adapter.close() + adapter.initialize() on transient errors, silently wiping all stored benchmark entries. RedisVL.close() drops the FT index and deletes all keys; GPTCache.close() removes the FAISS data directory. A single timeout mid-store phase would reset the cache to empty and corrupt precision/recall/F1 for the rest of that run. Removed the close/initialize from retry; the underlying client reconnects automatically on the next command after the sleep.

The check() method only matched "redis-stack" for the native code path, but __init__ treats both "redis-os" and "redis-stack" as native backends. This caused backend="redis-os" to silently run through the Valkey workaround, producing misleading comparison results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation with outcome tracking Add four safety mechanisms to the Monitor's threshold recommendation engine that turn the naive "see signal, adjust threshold" loop into a system that either improves or matches bare performance across every configuration tested. Recommendation engine (cache-readonly.service.ts): - Signal quality guards: require uncertain-hit fraction of ALL operations (not just hits) to exceed 15% before tightening; require hit rate > 80% before treating distant-hits as a tighten signal - Recall-cost guard: block tighten if estimated hit loss exceeds 15% of current hits — the TP/FP distributions overlap too much for threshold adjustment alone - Outcome tracking: after each adjustment, compare current signal metrics against the snapshot from the previous adjustment; if the triggering signal did not improve by 20%, declare further adjustment ineffective - Velocity dampening: progressive step-size reduction (1.0, 0.67, 0.50, 0.40, 0.33) for consecutive same-direction adjustments; cap at 5 consecutive moves; detect and break oscillation after 3 direction flips Apply dispatcher (cache-apply.dispatcher.ts): - Write tuning history with metrics snapshot to {prefix}:__tuning_history on every threshold proposal apply, enabling outcome tracking - Compute signal metrics from similarity window at apply time so the recommendation engine can compare before/after without cross-service deps Benchmark harness (cache-benchmark): - Add STSb dataset loader (mteb/stsbenchmark-sts, 8,628 pairs, 3 genre categories, continuous 0-5 scores normalized to 0-1) - Add SICK dataset loader (mteb/sickr-sts, 9,927 pairs, score-derived categories, dense [3,4) ambiguous middle band) - Add --match-threshold CLI option for continuous-score datasets - Add autotune-full mode: rerank + LLM judge + Monitor-driven autotuning - Improve judge prompt: semantic equivalence framing instead of Q&A ("are these two texts semantically equivalent?" vs "is this response an acceptable answer?"), strip "Answer:" prefix from stored responses - Add timeout retry with exponential backoff on Monitor API calls - Increase httpx timeout from 10s to 30s for large similarity windows Benchmark results across four datasets (STSb 5K, SICK 9.9K, SemBenchmarkLmArena 5K, PAWS-Wiki 8K), five thresholds, five modes: - STSb θ=0.20: +2.8% F1 vs static (loosened 0.20 → 0.22) - STSb θ=0.40: +8.6% precision vs static (tightened 0.40 → 0.30) - SICK θ=0.10: +2.1% F1 vs static (loosened 0.10 → 0.145) - SemBenchmarkLmArena θ=0.40: +2.9% F1 vs static (tightened 0.40 → 0.30) - PAWS-Wiki: 0% change (wall confirmed, autotuner correctly does nothing) - Zero degradation at already-optimal thresholds across all datasets

KIvanow · 2026-05-27T08:55:28Z

+            discovery=DiscoveryOptions(enabled=is_autotune),
+            # 1s interval: short enough that a cloud Monitor threshold change is visible
+            # within the next few check() calls even on a fast benchmark run.
+            config_refresh=ConfigRefreshOptions(enabled=is_autotune, interval_ms=30_000),


THis is fine and expected

Add docs/packages/semantic-cache-python.md covering betterdb-semantic-cache v0.4.0: installation, quick start, full configuration reference, all framework adapters (LangChain, OpenAI, Anthropic, LlamaIndex, LangGraph), embedding helpers, LLM-as-judge, reranking, cost tracking, threshold effectiveness, batch check, config refresh, discovery, telemetry, and TypeScript interoperability notes.

KIvanow · 2026-05-27T08:55:28Z

+            discovery=DiscoveryOptions(enabled=is_autotune),
+            # 1s interval: short enough that a cloud Monitor threshold change is visible
+            # within the next few check() calls even on a fast benchmark run.
+            config_refresh=ConfigRefreshOptions(enabled=is_autotune, interval_ms=30_000),


THis is fine and expected

…tance best_distance of 0.0 (identical vectors) is falsy in Python, causing the or expression to fall through to the candidates fallback. Use an explicit None check instead.

# Conflicts: # packages/cache-benchmark/pyproject.toml # packages/cache-benchmark/src/cache_benchmark/adapters/base.py # packages/cache-benchmark/src/cache_benchmark/adapters/betterdb.py # packages/cache-benchmark/src/cache_benchmark/adapters/gptcache_adapter.py # packages/cache-benchmark/src/cache_benchmark/cli.py # packages/cache-benchmark/uv.lock

Rewrite README to cover all four datasets (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki), five modes (bare, local, full, autotune, autotune-full), usage examples, output file descriptions, and CLI reference.

… to 30s The code sets interval_ms=30_000 (30 seconds) but five comments and two user-facing feature strings said "1s interval". Fix all to say "30s".

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6436ea9. Configure here.}

1. Dispatcher signal classification diverges from recommendation engine: computeMetricsSnapshot used simplified conditions (uncertainHitRate > 0.2) that omit the inner guards added to the recommendation engine (uncertainFractionOfAll > 0.15, hitRate > 0.8). The stored signal could differ from what actually triggered the recommendation, causing checkLastOutcome to compare the wrong metric. 2. Tighten recommendation can produce a loosening threshold: In the distant_hits path, target = p75 + uncertainty_band * 0.3 can exceed the current threshold when p75 is close to it. Clamp with Math.min(threshold, ...) so a tighten never produces a value above the current threshold.

… guards The test used extreme data (all 50 hits at exactly score 0.09) which triggered the recall-cost guard (100% of hits would be lost by tightening). Update test data to a realistic distribution: 85 strong hits well below the proposed threshold, 10 uncertain hits near the boundary, 5 misses. This exercises the tighten signal while passing all safety guards. Also add lrange stub to StubValkey for tuning history reads.

KIvanow and others added 6 commits May 25, 2026 02:04

cursor Bot reviewed May 27, 2026

View reviewed changes

KIvanow commented May 27, 2026

View reviewed changes

BetterDB-inc deleted a comment from cursor Bot May 27, 2026

KIvanow added 3 commits May 27, 2026 11:56

fix(cache-benchmark): use None check instead of falsy or for zero dis…

7c789f7

…tance best_distance of 0.0 (identical vectors) is falsy in Python, causing the or expression to fall through to the candidates fallback. Use an explicit None check instead.

docs(cache-benchmark): update README with datasets, modes, and usage

1e66bbd

Rewrite README to cover all four datasets (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki), five modes (bare, local, full, autotune, autotune-full), usage examples, output file descriptions, and CLI reference.

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread packages/cache-benchmark/src/cache_benchmark/adapters/betterdb.py

fix(cache-benchmark): correct configRefresh interval comments from 1s…

6436ea9

… to 30s The code sets interval_ms=30_000 (30 seconds) but five comments and two user-facing feature strings said "1s interval". Fix all to say "30s".

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread proprietary/cache-proposals/cache-readonly.service.ts Outdated

Comment thread proprietary/cache-proposals/cache-apply.dispatcher.ts

KIvanow force-pushed the feat/benchmarking branch from 6904255 to ede7bbe Compare May 27, 2026 09:51

KIvanow merged commit 1dcc2f4 into master May 27, 2026
3 checks passed

KIvanow deleted the feat/benchmarking branch May 27, 2026 10:10

github-actions Bot locked and limited conversation to collaborators May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/benchmarking#223

Feat/benchmarking#223
KIvanow merged 13 commits into
masterfrom
feat/benchmarking

KIvanow commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

KIvanow May 27, 2026

Uh oh!

Uh oh!

KIvanow May 27, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KIvanow commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Checklist

Uh oh!

KIvanow May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KIvanow May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KIvanow commented May 27, 2026 •

edited by cursor Bot

Loading