Skip to content

Feat/benchmarking#223

Merged
KIvanow merged 13 commits into
masterfrom
feat/benchmarking
May 27, 2026
Merged

Feat/benchmarking#223
KIvanow merged 13 commits into
masterfrom
feat/benchmarking

Conversation

@KIvanow

@KIvanow KIvanow commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

Adds outcome-aware self-tuning to the Monitor's threshold recommendation engine and benchmarks it on four public datasets (STSb, SICK, SemBenchmarkLmArena, PAWS-Wiki). The autotuner loosens when too strict, tightens when too loose, and does nothing when the threshold is already optimal — with zero degradation across every configuration tested.

Changes

  • Add signal quality guards to the recommendation engine — require the uncertain-hit fraction of all operations (not just hits) to exceed 15% before tightening; require hit rate > 80% before treating distant-hits as a tighten
    signal
    • Add recall-cost guard — block tighten if estimated hit loss exceeds 15% of current hits
    • Add outcome tracking — after each adjustment, compare current signal metrics against the snapshot from the previous adjustment; stop if the triggering signal did not improve by 20%
    • Add velocity dampening — progressive step-size reduction for consecutive same-direction adjustments with oscillation detection
    • Write tuning history with metrics snapshots in the apply dispatcher to enable outcome tracking
    • Add STSb dataset loader (8,628 pairs, 3 genre categories, continuous scores)
    • Add SICK dataset loader (9,927 pairs, score-derived categories)
    • Add --match-threshold CLI option for continuous-score datasets
    • Add autotune-full mode combining rerank + LLM judge + Monitor-driven autotuning
    • Improve judge prompt — semantic equivalence framing instead of Q&A, strip "Answer:" prefix
    • Add timeout retry with exponential backoff on Monitor API calls

Benchmark results (bare → autotune):

  • STSb θ=0.20: +2.8% F1 (loosened 0.20 → 0.22)
  • STSb θ=0.40: +8.6% precision (tightened 0.40 → 0.30)
  • SICK θ=0.10: +2.1% F1 (loosened 0.10 → 0.145)
  • SemBenchmarkLmArena θ=0.40: +2.9% F1 (tightened 0.40 → 0.30)
  • PAWS-Wiki: 0% change (wall confirmed, autotuner correctly does nothing)

Checklist

  • Unit / integration tests added
  • Docs added / updated
  • Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
  • Competitive analysis done / discussed (internal)
  • Blog post about it discussed (internal)

Note

Medium Risk
Changes proprietary threshold recommendation and apply paths that affect production cache behavior; benchmark and docs changes are lower risk but autotune now depends on Monitor API credentials and external calls.

Overview
Strengthens Monitor-driven semantic cache threshold tuning with richer signals (uncertain hits, distant weak hits, near-misses, low hit rate), a recall-cost guard, outcome tracking against prior adjustments, and velocity/oscillation dampening. Approved changes now append tuning history with metric snapshots in Valkey, and live config reads prefer the threshold field in __config.

Expands the cache-benchmark harness: STSb and SICK loaders, --match-threshold, autotune-full (rerank + judge + Monitor autotune), and BetterDB autotune wired to the Monitor recommend → propose → approve API (replacing in-process tuning), plus judge prompt tweaks and Monitor API retries. Adds Python semantic cache package documentation and minor benchmark/README/lockfile updates.

Reviewed by Cursor Bugbot for commit 5af5e5f. Bugbot is set up for automated code reviews on this repo. Configure here.

KIvanow and others added 6 commits May 25, 2026 02:04
…profiler

New package packages/cache-benchmark implementing a replay-based benchmark
harness for comparing semantic cache implementations (BetterDB, RedisVL,
GPTCache) against labeled query-pair datasets (vCache SemBenchmarkLmArena,
PAWS-Wiki).

Key features:
- Three-mode benchmark: bare (cosine only), local (native rerankers, no
  external APIs), full (LLM judge + Cohere rerank with API keys)
- BetterDB adapter: cosine threshold + keyword-overlap rerank + OpenAI
  gpt-4o-mini judge gate on uncertain hits (full mode)
- RedisVL adapter: raw FT.SEARCH workaround for Valkey Search compatibility
  (VectorRangeQuery unsupported); native check() path for Redis Stack
- GPTCache adapter: FAISS+SQLite with SBERT crossencoder rerank (local) or
  Cohere Rerank API (full); GPTCACHE_LOCAL_EVALUATOR env var for alternatives
- Harness stores informative dummy responses ("Answer: {prompt}") so
  LLM-as-judge adapters receive meaningful text matching production conditions
- --debug-judge flag: wires gpt-4o-mini onto uncertain hits and writes every
  invocation to a JSONL log for investigation
- Dataset loaders for vCache/SemBenchmarkLmArena (ID_Set equivalence field,
  train split) and google-research-datasets/paws (labeled_final)
- Metrics: precision, recall, F1, FPR, p50/p95/p99 latency
- Threshold sweep CLI with per-(adapter, mode, dataset, threshold) JSON output
  and summary files
- Markdown report generator with per-adapter F1 sparklines
- Standalone latency profiler (scripts/latency_profile.py) with embed/
  network/parse breakdown and BetterDB embedding cache hypothesis test
- vcache_lmarena.py: fix module docstring (wrong org name, wrong field name);
  move random import to top level; remove dead guard (random.sample already
  guarantees distinct elements)
- harness.py: remove unused populate_with_dummy_responses parameter; move
  asyncio and tqdm imports to module level
- redisvl_adapter.py: fix malformed module docstring (# lines inside """);
  remove unused cached_prompt variable; fix double-embed in _check_native by
  passing pre-computed vector to cache.check() instead of re-embedding; remove
  unused PROMPT_FIELD_NAME import
- gptcache_adapter.py: update stale comment in _check_full_cohere (harness now
  stores informative responses, not hashes); convert __main__ string literal to
  proper comments
- validate.py: fix default port in __main__ from 6379 to 6381
- latency_profile.py: wire --profile flag to gate the summary section
…is Stack 7.4.7)

Previous profile used redis/redis-stack-server:latest (Redis 7.4.7,
sha256:798ab84d) which showed native RedisVL at 6.30ms p50 vs the Valkey
workaround at 3.46ms — a misleading 84% gap driven by VectorRangeQuery
overhead in Redis Stack's older search module.

Re-tested against redis:latest (Redis 8.6.3, Search 8.6.7): native RedisVL
on Redis 8 is 3.44ms p50, statistically identical to the Valkey workaround
(3.46ms, delta = 0.02ms). The latency is now dominated entirely by SBERT
embedding compute (~3ms), not search internals.

Changes:
- redisvl_adapter.py: add "redis-os" as the canonical backend value for
  Redis 8 Open Source; keep "redis-stack" as a backward-compatible alias;
  switch env var to REDIS_OS_URL (falls back to REDIS_STACK_URL); update
  default port from 6383 to 6384; update error message and docstring
- latency_profile.py: document previous Redis Stack version in module
  docstring; hardcode legacy Redis Stack 7.4.7 results as LEGACY_REDIS_STACK_RESULT
  so both rows appear in the table without re-running; update backend from
  "redis-stack" to "redis-os"; include Redis version string in row label
  dynamically; update summary to compare Redis Stack vs Redis 8 and render
  the blog narrative verdict automatically
- validate.py: TARGETS_FILE used 4 .parent calls instead of 3, resolving
  to packages/validate_targets.yaml (does not exist) instead of
  packages/cache-benchmark/validate_targets.yaml. Always caused
  FileNotFoundError when running the validation gate.

- harness.py: _with_retry called adapter.close() + adapter.initialize()
  on transient errors, silently wiping all stored benchmark entries.
  RedisVL.close() drops the FT index and deletes all keys;
  GPTCache.close() removes the FAISS data directory. A single timeout
  mid-store phase would reset the cache to empty and corrupt
  precision/recall/F1 for the rest of that run. Removed the
  close/initialize from retry; the underlying client reconnects
  automatically on the next command after the sleep.
The check() method only matched "redis-stack" for the native code path,
but __init__ treats both "redis-os" and "redis-stack" as native backends.
This caused backend="redis-os" to silently run through the Valkey
workaround, producing misleading comparison results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation with outcome tracking

Add four safety mechanisms to the Monitor's threshold recommendation engine
that turn the naive "see signal, adjust threshold" loop into a system that
either improves or matches bare performance across every configuration tested.

Recommendation engine (cache-readonly.service.ts):
- Signal quality guards: require uncertain-hit fraction of ALL operations
  (not just hits) to exceed 15% before tightening; require hit rate > 80%
  before treating distant-hits as a tighten signal
- Recall-cost guard: block tighten if estimated hit loss exceeds 15% of
  current hits — the TP/FP distributions overlap too much for threshold
  adjustment alone
- Outcome tracking: after each adjustment, compare current signal metrics
  against the snapshot from the previous adjustment; if the triggering
  signal did not improve by 20%, declare further adjustment ineffective
- Velocity dampening: progressive step-size reduction (1.0, 0.67, 0.50,
  0.40, 0.33) for consecutive same-direction adjustments; cap at 5
  consecutive moves; detect and break oscillation after 3 direction flips

Apply dispatcher (cache-apply.dispatcher.ts):
- Write tuning history with metrics snapshot to {prefix}:__tuning_history
  on every threshold proposal apply, enabling outcome tracking
- Compute signal metrics from similarity window at apply time so the
  recommendation engine can compare before/after without cross-service deps

Benchmark harness (cache-benchmark):
- Add STSb dataset loader (mteb/stsbenchmark-sts, 8,628 pairs, 3 genre
  categories, continuous 0-5 scores normalized to 0-1)
- Add SICK dataset loader (mteb/sickr-sts, 9,927 pairs, score-derived
  categories, dense [3,4) ambiguous middle band)
- Add --match-threshold CLI option for continuous-score datasets
- Add autotune-full mode: rerank + LLM judge + Monitor-driven autotuning
- Improve judge prompt: semantic equivalence framing instead of Q&A
  ("are these two texts semantically equivalent?" vs "is this response
  an acceptable answer?"), strip "Answer:" prefix from stored responses
- Add timeout retry with exponential backoff on Monitor API calls
- Increase httpx timeout from 10s to 30s for large similarity windows

Benchmark results across four datasets (STSb 5K, SICK 9.9K,
SemBenchmarkLmArena 5K, PAWS-Wiki 8K), five thresholds, five modes:
- STSb θ=0.20: +2.8% F1 vs static (loosened 0.20 → 0.22)
- STSb θ=0.40: +8.6% precision vs static (tightened 0.40 → 0.30)
- SICK θ=0.10: +2.1% F1 vs static (loosened 0.10 → 0.145)
- SemBenchmarkLmArena θ=0.40: +2.9% F1 vs static (tightened 0.40 → 0.30)
- PAWS-Wiki: 0% change (wall confirmed, autotuner correctly does nothing)
- Zero degradation at already-optimal thresholds across all datasets
discovery=DiscoveryOptions(enabled=is_autotune),
# 1s interval: short enough that a cloud Monitor threshold change is visible
# within the next few check() calls even on a fast benchmark run.
config_refresh=ConfigRefreshOptions(enabled=is_autotune, interval_ms=30_000),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis is fine and expected

Comment thread packages/cache-benchmark/src/cache_benchmark/adapters/gptcache_adapter.py Outdated
Add docs/packages/semantic-cache-python.md covering betterdb-semantic-cache
v0.4.0: installation, quick start, full configuration reference, all
framework adapters (LangChain, OpenAI, Anthropic, LlamaIndex, LangGraph),
embedding helpers, LLM-as-judge, reranking, cost tracking, threshold
effectiveness, batch check, config refresh, discovery, telemetry, and
TypeScript interoperability notes.
discovery=DiscoveryOptions(enabled=is_autotune),
# 1s interval: short enough that a cloud Monitor threshold change is visible
# within the next few check() calls even on a fast benchmark run.
config_refresh=ConfigRefreshOptions(enabled=is_autotune, interval_ms=30_000),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis is fine and expected

@BetterDB-inc BetterDB-inc deleted a comment from cursor Bot May 27, 2026
KIvanow added 3 commits May 27, 2026 11:56
…tance

best_distance of 0.0 (identical vectors) is falsy in Python, causing the
or expression to fall through to the candidates fallback. Use an explicit
None check instead.
# Conflicts:
#	packages/cache-benchmark/pyproject.toml
#	packages/cache-benchmark/src/cache_benchmark/adapters/base.py
#	packages/cache-benchmark/src/cache_benchmark/adapters/betterdb.py
#	packages/cache-benchmark/src/cache_benchmark/adapters/gptcache_adapter.py
#	packages/cache-benchmark/src/cache_benchmark/cli.py
#	packages/cache-benchmark/uv.lock
Rewrite README to cover all four datasets (STSb, SICK, SemBenchmarkLmArena,
PAWS-Wiki), five modes (bare, local, full, autotune, autotune-full), usage
examples, output file descriptions, and CLI reference.
Comment thread packages/cache-benchmark/src/cache_benchmark/adapters/betterdb.py
… to 30s

The code sets interval_ms=30_000 (30 seconds) but five comments and two
user-facing feature strings said "1s interval". Fix all to say "30s".

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6436ea9. Configure here.

Comment thread proprietary/cache-proposals/cache-readonly.service.ts Outdated
Comment thread proprietary/cache-proposals/cache-apply.dispatcher.ts
1. Dispatcher signal classification diverges from recommendation engine:
   computeMetricsSnapshot used simplified conditions (uncertainHitRate > 0.2)
   that omit the inner guards added to the recommendation engine
   (uncertainFractionOfAll > 0.15, hitRate > 0.8). The stored signal could
   differ from what actually triggered the recommendation, causing
   checkLastOutcome to compare the wrong metric.

2. Tighten recommendation can produce a loosening threshold:
   In the distant_hits path, target = p75 + uncertainty_band * 0.3 can
   exceed the current threshold when p75 is close to it. Clamp with
   Math.min(threshold, ...) so a tighten never produces a value above
   the current threshold.
@KIvanow KIvanow force-pushed the feat/benchmarking branch from 6904255 to ede7bbe Compare May 27, 2026 09:51
… guards

The test used extreme data (all 50 hits at exactly score 0.09) which
triggered the recall-cost guard (100% of hits would be lost by tightening).
Update test data to a realistic distribution: 85 strong hits well below
the proposed threshold, 10 uncertain hits near the boundary, 5 misses.
This exercises the tighten signal while passing all safety guards.

Also add lrange stub to StubValkey for tuning history reads.
@KIvanow KIvanow merged commit 1dcc2f4 into master May 27, 2026
3 checks passed
@KIvanow KIvanow deleted the feat/benchmarking branch May 27, 2026 10:10
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant