Feat/benchmarking#223
Merged
Merged
Conversation
…profiler
New package packages/cache-benchmark implementing a replay-based benchmark
harness for comparing semantic cache implementations (BetterDB, RedisVL,
GPTCache) against labeled query-pair datasets (vCache SemBenchmarkLmArena,
PAWS-Wiki).
Key features:
- Three-mode benchmark: bare (cosine only), local (native rerankers, no
external APIs), full (LLM judge + Cohere rerank with API keys)
- BetterDB adapter: cosine threshold + keyword-overlap rerank + OpenAI
gpt-4o-mini judge gate on uncertain hits (full mode)
- RedisVL adapter: raw FT.SEARCH workaround for Valkey Search compatibility
(VectorRangeQuery unsupported); native check() path for Redis Stack
- GPTCache adapter: FAISS+SQLite with SBERT crossencoder rerank (local) or
Cohere Rerank API (full); GPTCACHE_LOCAL_EVALUATOR env var for alternatives
- Harness stores informative dummy responses ("Answer: {prompt}") so
LLM-as-judge adapters receive meaningful text matching production conditions
- --debug-judge flag: wires gpt-4o-mini onto uncertain hits and writes every
invocation to a JSONL log for investigation
- Dataset loaders for vCache/SemBenchmarkLmArena (ID_Set equivalence field,
train split) and google-research-datasets/paws (labeled_final)
- Metrics: precision, recall, F1, FPR, p50/p95/p99 latency
- Threshold sweep CLI with per-(adapter, mode, dataset, threshold) JSON output
and summary files
- Markdown report generator with per-adapter F1 sparklines
- Standalone latency profiler (scripts/latency_profile.py) with embed/
network/parse breakdown and BetterDB embedding cache hypothesis test
- vcache_lmarena.py: fix module docstring (wrong org name, wrong field name); move random import to top level; remove dead guard (random.sample already guarantees distinct elements) - harness.py: remove unused populate_with_dummy_responses parameter; move asyncio and tqdm imports to module level - redisvl_adapter.py: fix malformed module docstring (# lines inside """); remove unused cached_prompt variable; fix double-embed in _check_native by passing pre-computed vector to cache.check() instead of re-embedding; remove unused PROMPT_FIELD_NAME import - gptcache_adapter.py: update stale comment in _check_full_cohere (harness now stores informative responses, not hashes); convert __main__ string literal to proper comments - validate.py: fix default port in __main__ from 6379 to 6381 - latency_profile.py: wire --profile flag to gate the summary section
…is Stack 7.4.7) Previous profile used redis/redis-stack-server:latest (Redis 7.4.7, sha256:798ab84d) which showed native RedisVL at 6.30ms p50 vs the Valkey workaround at 3.46ms — a misleading 84% gap driven by VectorRangeQuery overhead in Redis Stack's older search module. Re-tested against redis:latest (Redis 8.6.3, Search 8.6.7): native RedisVL on Redis 8 is 3.44ms p50, statistically identical to the Valkey workaround (3.46ms, delta = 0.02ms). The latency is now dominated entirely by SBERT embedding compute (~3ms), not search internals. Changes: - redisvl_adapter.py: add "redis-os" as the canonical backend value for Redis 8 Open Source; keep "redis-stack" as a backward-compatible alias; switch env var to REDIS_OS_URL (falls back to REDIS_STACK_URL); update default port from 6383 to 6384; update error message and docstring - latency_profile.py: document previous Redis Stack version in module docstring; hardcode legacy Redis Stack 7.4.7 results as LEGACY_REDIS_STACK_RESULT so both rows appear in the table without re-running; update backend from "redis-stack" to "redis-os"; include Redis version string in row label dynamically; update summary to compare Redis Stack vs Redis 8 and render the blog narrative verdict automatically
- validate.py: TARGETS_FILE used 4 .parent calls instead of 3, resolving to packages/validate_targets.yaml (does not exist) instead of packages/cache-benchmark/validate_targets.yaml. Always caused FileNotFoundError when running the validation gate. - harness.py: _with_retry called adapter.close() + adapter.initialize() on transient errors, silently wiping all stored benchmark entries. RedisVL.close() drops the FT index and deletes all keys; GPTCache.close() removes the FAISS data directory. A single timeout mid-store phase would reset the cache to empty and corrupt precision/recall/F1 for the rest of that run. Removed the close/initialize from retry; the underlying client reconnects automatically on the next command after the sleep.
The check() method only matched "redis-stack" for the native code path, but __init__ treats both "redis-os" and "redis-stack" as native backends. This caused backend="redis-os" to silently run through the Valkey workaround, producing misleading comparison results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation with outcome tracking
Add four safety mechanisms to the Monitor's threshold recommendation engine
that turn the naive "see signal, adjust threshold" loop into a system that
either improves or matches bare performance across every configuration tested.
Recommendation engine (cache-readonly.service.ts):
- Signal quality guards: require uncertain-hit fraction of ALL operations
(not just hits) to exceed 15% before tightening; require hit rate > 80%
before treating distant-hits as a tighten signal
- Recall-cost guard: block tighten if estimated hit loss exceeds 15% of
current hits — the TP/FP distributions overlap too much for threshold
adjustment alone
- Outcome tracking: after each adjustment, compare current signal metrics
against the snapshot from the previous adjustment; if the triggering
signal did not improve by 20%, declare further adjustment ineffective
- Velocity dampening: progressive step-size reduction (1.0, 0.67, 0.50,
0.40, 0.33) for consecutive same-direction adjustments; cap at 5
consecutive moves; detect and break oscillation after 3 direction flips
Apply dispatcher (cache-apply.dispatcher.ts):
- Write tuning history with metrics snapshot to {prefix}:__tuning_history
on every threshold proposal apply, enabling outcome tracking
- Compute signal metrics from similarity window at apply time so the
recommendation engine can compare before/after without cross-service deps
Benchmark harness (cache-benchmark):
- Add STSb dataset loader (mteb/stsbenchmark-sts, 8,628 pairs, 3 genre
categories, continuous 0-5 scores normalized to 0-1)
- Add SICK dataset loader (mteb/sickr-sts, 9,927 pairs, score-derived
categories, dense [3,4) ambiguous middle band)
- Add --match-threshold CLI option for continuous-score datasets
- Add autotune-full mode: rerank + LLM judge + Monitor-driven autotuning
- Improve judge prompt: semantic equivalence framing instead of Q&A
("are these two texts semantically equivalent?" vs "is this response
an acceptable answer?"), strip "Answer:" prefix from stored responses
- Add timeout retry with exponential backoff on Monitor API calls
- Increase httpx timeout from 10s to 30s for large similarity windows
Benchmark results across four datasets (STSb 5K, SICK 9.9K,
SemBenchmarkLmArena 5K, PAWS-Wiki 8K), five thresholds, five modes:
- STSb θ=0.20: +2.8% F1 vs static (loosened 0.20 → 0.22)
- STSb θ=0.40: +8.6% precision vs static (tightened 0.40 → 0.30)
- SICK θ=0.10: +2.1% F1 vs static (loosened 0.10 → 0.145)
- SemBenchmarkLmArena θ=0.40: +2.9% F1 vs static (tightened 0.40 → 0.30)
- PAWS-Wiki: 0% change (wall confirmed, autotuner correctly does nothing)
- Zero degradation at already-optimal thresholds across all datasets
| discovery=DiscoveryOptions(enabled=is_autotune), | ||
| # 1s interval: short enough that a cloud Monitor threshold change is visible | ||
| # within the next few check() calls even on a fast benchmark run. | ||
| config_refresh=ConfigRefreshOptions(enabled=is_autotune, interval_ms=30_000), |
Member
Author
There was a problem hiding this comment.
THis is fine and expected
Add docs/packages/semantic-cache-python.md covering betterdb-semantic-cache v0.4.0: installation, quick start, full configuration reference, all framework adapters (LangChain, OpenAI, Anthropic, LlamaIndex, LangGraph), embedding helpers, LLM-as-judge, reranking, cost tracking, threshold effectiveness, batch check, config refresh, discovery, telemetry, and TypeScript interoperability notes.
KIvanow
commented
May 27, 2026
| discovery=DiscoveryOptions(enabled=is_autotune), | ||
| # 1s interval: short enough that a cloud Monitor threshold change is visible | ||
| # within the next few check() calls even on a fast benchmark run. | ||
| config_refresh=ConfigRefreshOptions(enabled=is_autotune, interval_ms=30_000), |
Member
Author
There was a problem hiding this comment.
THis is fine and expected
…tance best_distance of 0.0 (identical vectors) is falsy in Python, causing the or expression to fall through to the candidates fallback. Use an explicit None check instead.
# Conflicts: # packages/cache-benchmark/pyproject.toml # packages/cache-benchmark/src/cache_benchmark/adapters/base.py # packages/cache-benchmark/src/cache_benchmark/adapters/betterdb.py # packages/cache-benchmark/src/cache_benchmark/adapters/gptcache_adapter.py # packages/cache-benchmark/src/cache_benchmark/cli.py # packages/cache-benchmark/uv.lock
Rewrite README to cover all four datasets (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki), five modes (bare, local, full, autotune, autotune-full), usage examples, output file descriptions, and CLI reference.
… to 30s The code sets interval_ms=30_000 (30 seconds) but five comments and two user-facing feature strings said "1s interval". Fix all to say "30s".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6436ea9. Configure here.
1. Dispatcher signal classification diverges from recommendation engine: computeMetricsSnapshot used simplified conditions (uncertainHitRate > 0.2) that omit the inner guards added to the recommendation engine (uncertainFractionOfAll > 0.15, hitRate > 0.8). The stored signal could differ from what actually triggered the recommendation, causing checkLastOutcome to compare the wrong metric. 2. Tighten recommendation can produce a loosening threshold: In the distant_hits path, target = p75 + uncertainty_band * 0.3 can exceed the current threshold when p75 is close to it. Clamp with Math.min(threshold, ...) so a tighten never produces a value above the current threshold.
6904255 to
ede7bbe
Compare
… guards The test used extreme data (all 50 hits at exactly score 0.09) which triggered the recall-cost guard (100% of hits would be lost by tightening). Update test data to a realistic distribution: 85 strong hits well below the proposed threshold, 10 uncertain hits near the boundary, 5 misses. This exercises the tighten signal while passing all safety guards. Also add lrange stub to StubValkey for tuning history reads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Adds outcome-aware self-tuning to the Monitor's threshold recommendation engine and benchmarks it on four public datasets (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki). The autotuner loosens when too strict, tightens when too loose, and does nothing when the threshold is already optimal — with zero degradation across every configuration tested.
Changes
signal
Benchmark results (bare → autotune):
Checklist
roborev review --branchor/roborev-review-branchin Claude Code (internal)Note
Medium Risk
Changes proprietary threshold recommendation and apply paths that affect production cache behavior; benchmark and docs changes are lower risk but autotune now depends on Monitor API credentials and external calls.
Overview
Strengthens Monitor-driven semantic cache threshold tuning with richer signals (uncertain hits, distant weak hits, near-misses, low hit rate), a recall-cost guard, outcome tracking against prior adjustments, and velocity/oscillation dampening. Approved changes now append tuning history with metric snapshots in Valkey, and live config reads prefer the
thresholdfield in__config.Expands the cache-benchmark harness: STSb and SICK loaders,
--match-threshold,autotune-full(rerank + judge + Monitor autotune), and BetterDB autotune wired to the Monitor recommend → propose → approve API (replacing in-process tuning), plus judge prompt tweaks and Monitor API retries. Adds Python semantic cache package documentation and minor benchmark/README/lockfile updates.Reviewed by Cursor Bugbot for commit 5af5e5f. Bugbot is set up for automated code reviews on this repo. Configure here.