[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1) by Srinivasoo7 · Pull Request #2690 · kvcache-ai/Mooncake

Srinivasoo7 · 2026-06-30T23:03:08Z

Implements RFC #2233.

Summary

V1 promotion-on-hit (#2071) is Get-driven: when a hot LOCAL_DISK-only key passes the CountMinSketch frequency gate, the master queues an async promotion task. However, transient blockers — DRAM above watermark, promotion queue at cap, recoverable push failure — cause the key to miss its window. It only gets another chance on the next Get, which may not arrive promptly for bursty workloads where access patterns have subsided but the key should still be in DRAM.

V1.1 adds a master-side background scheduler that re-evaluates rejected promotion candidates without waiting for another Get.

Design

Get path stays non-blocking. TryPushPromotionQueue now returns a typed PromotionQueueResult instead of void. On a transient rejection (watermark, queue cap, recoverable push failure) it records/updates a PromotionCandidate under the key's TenantState.
PromotionCandidate stores: sketch_score (frequency at record time), first_seen, last_seen (refreshed on each rejection — extends effective TTL while the key stays hot), retry_after (exponential backoff), retry_count, last_reason, last_error. Uses steady_clock for monotonic interval tracking.
Bounded candidate set: global count kPromotionCandidateLimit (50 000), per-candidate kPromotionCandidateMaxRetries (8) and kPromotionCandidateTtl (60 s from last_seen).
Background scheduler: RunPromotionCandidateRetry runs in the existing eviction thread every 10 ms (early-exit when count == 0). It advances a ring-buffer cursor over kPromotionRetryShardBatch (64) shards per tick, collects up to kPromotionRetryBatchSize (128) due candidates, then calls TryPushPromotionQueue(..., record_candidate=false) to re-enter the V1 admission path.
Permanent resolution (object gone, MEMORY replica appeared, task already in flight) erases the candidate immediately under the shard lock.
Metadata reload calls ClearCandidatesForReload from both MetadataSerializer::Deserialize and Reset to reset all transient candidate state.

Metrics

All six counters are wired into MasterMetricManager and exposed via Prometheus:

Prometheus counter	Meaning
`master_promotion_candidate_recorded_total`	New candidate entries created
`master_promotion_candidate_admitted_total`	Candidates successfully promoted by background scheduler
`master_promotion_candidate_admission_rejected_total`	Candidates that hit a gate again on retry
`master_promotion_candidate_expired_evaluated_total`	Candidates exhausted after retries or TTL — were evaluated at least once
`master_promotion_candidate_expired_unevaluated_total`	Candidates that aged out before the scheduler ever reached them (scan budget too small)
`master_promotion_candidate_dropped_limit_total`	Candidates dropped at record time because the global count limit was reached

The expired_unevaluated / expired_evaluated distinction is implemented by checking retry_count == 0 at expiry time in the scan loop: a candidate with retry_count == 0 was never processed by the scheduler before its TTL elapsed.

Non-goals (unchanged from V1)

No synchronous promotion on Get
No proactive scan of cold keys — only Get-observed candidates
No LOCAL_DISK deletion after promotion
No new client-facing RPCs or heartbeat protocol changes

Test plan

Five new TEST_F cases added to promotion_on_hit_test.cpp:

RetryCandidate_WatermarkRejectionRecordsCandidate — watermark rejection records a candidate; a second Get deduplicates (count stays 1)
RetryCandidate_ExhaustedAfterMaxRetries — candidate is erased and expired_evaluated is incremented after kPromotionCandidateMaxRetries scans
RetryCandidate_ObjectRemovedMidRetry — object removed while candidate is pending; next scan cleans up with no spurious admitted increment
RetryCandidate_MultipleKeysTracked — multiple LOCAL_DISK-only keys tracked independently, recorded counter matches key count
RetryCandidate_ClearOnReload — ClearCandidatesForReload resets all candidate state and the global atomic counter to zero

cmake --build build-ccf-ninja --target promotion_on_hit_test -j4
./build-ccf-ninja/mooncake-store/tests/promotion_on_hit_test --gtest_filter='*RetryCandidate*' --gtest_brief=1

🤖 Generated with Claude Code

… (V1.1) Implements RFC kvcache-ai#2233: bounded background retry/admission layer on top of the V1 promotion-on-hit path (kvcache-ai#2071). ## Motivation V1 promotion-on-hit is Get-driven: a hot LOCAL_DISK-only key queues a promotion when it passes the CountMinSketch frequency gate. However, transient blockers (DRAM above watermark, promotion queue at cap, holder push failure) cause the key to miss its promotion window. It only retries on the next Get, which may not arrive promptly for bursty workloads. V1.1 adds a master-side background scheduler that re-evaluates rejected candidates without waiting for another Get. ## Design - `Get` path remains non-blocking. `TryPushPromotionQueue` now returns a typed `PromotionQueueResult` instead of `void`. On transient rejection (watermark, queue cap, recoverable push failure), it records/updates a `PromotionCandidate` entry under the key's `TenantState`. - `PromotionCandidate` stores: `sketch_score`, `first_seen`, `last_seen` (reset on each rejection to extend TTL while key stays hot), `retry_after` (exponential backoff), `retry_count`, `last_reason`, `last_error`. Uses `steady_clock` for monotonic interval tracking. - Global candidate set is bounded by `kPromotionCandidateLimit` (50 000), per-candidate by `kPromotionCandidateMaxRetries` (8) and `kPromotionCandidateTtl` (60 s from `last_seen`). - `RunPromotionCandidateRetry` runs in the eviction thread every 10 ms (early-exit when count == 0). It scans `kPromotionRetryShardBatch` (64) shards per tick via a ring-buffer cursor, collecting up to `kPromotionRetryBatchSize` (128) due candidates, then calls `TryPushPromotionQueue(..., record_candidate=false)` to re-enter the V1 admission path without re-recording candidates. - Candidate state is cleared on metadata snapshot reload/reset via `ClearCandidatesForReload`, hooked into `MetadataSerializer::Deserialize` and `Reset`. - Permanent resolution (object gone, MEMORY replica appeared, task already in flight) erases the candidate immediately. ## Metrics (all wired into MasterMetricManager / Prometheus) | Counter | Meaning | |---|---| | `master_promotion_candidate_recorded_total` | New candidate entries created | | `master_promotion_candidate_admitted_total` | Candidates promoted by background scheduler | | `master_promotion_candidate_admission_rejected_total` | Candidates that hit a gate again on retry | | `master_promotion_candidate_expired_evaluated_total` | Candidates exhausted after retries or TTL post-evaluation | | `master_promotion_candidate_expired_unevaluated_total` | Candidates aged out before scheduler ever reached them (scan budget too small) | | `master_promotion_candidate_dropped_limit_total` | Candidates dropped at record time due to global count limit | The `expired_unevaluated` / `expired_evaluated` distinction is implemented by checking `retry_count == 0` at expiry time in the scan loop. ## Non-goals (unchanged from V1) - No synchronous promotion on Get - No proactive scan of cold keys — only Get-observed candidates - No LOCAL_DISK deletion after promotion - No new client-facing RPCs or heartbeat protocol changes ## Tests Five new `TEST_F` cases in `promotion_on_hit_test.cpp`: - Watermark rejection records a candidate; duplicate Get does not add a second entry - Candidate exhausted and removed after `kPromotionCandidateMaxRetries` scans - Object removed while candidate is pending → next scan cleans up, no spurious admit metric - Multiple keys tracked independently with correct per-key counts - `ClearCandidatesForReload` resets all candidate state and the global atomic counter Closes kvcache-ai#2233

Srinivasoo7 · 2026-06-30T23:04:26Z

@stmatengss @LujhCoconut
PTAL

Thanks

gemini-code-assist

Code Review

This pull request introduces a promotion retry candidate mechanism to handle transient promotion failures (such as watermark or queue capacity rejections) by tracking candidates, applying exponential backoff, and retrying them in a background thread. It also adds corresponding metrics and unit tests. The review feedback identifies two important issues: a thread-safety violation in ClearCandidatesForReload() where metadata shards are accessed without acquiring write locks, and a logic issue in RecordOrUpdateCandidate() where retry_count is not reset to 0 upon a fresh hit, which can cause hot keys to be prematurely discarded.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T23:05:45Z

+void MasterService::ClearCandidatesForReload() {
+    for (auto& shard : metadata_shards_) {
+        for (auto& [tenant_id, tenant_state] : shard.tenants) {
+            (void)tenant_id;
+            tenant_state.promotion_candidates.clear();
+        }
+    }
+    promotion_candidate_count_.store(0, std::memory_order_relaxed);
+    promotion_retry_cursor_.store(0, std::memory_order_relaxed);
+}


In ClearCandidatesForReload(), the code iterates over metadata_shards_ and accesses shard.tenants directly without acquiring the shard's write lock (shard.mutex). This introduces a thread-safety violation (data race) if other threads are concurrently accessing or modifying the metadata shards. Since this method can be called during metadata reload or reset operations on a running service, we should safely acquire the write lock for each shard using MetadataShardAccessorRW.

void MasterService::ClearCandidatesForReload() { for (size_t i = 0; i < kNumShards; ++i) { MetadataShardAccessorRW shard(this, i); for (auto& [tenant_id, tenant_state] : shard->tenants) { (void)tenant_id; tenant_state.promotion_candidates.clear(); } } promotion_candidate_count_.store(0, std::memory_order_relaxed); promotion_retry_cursor_.store(0, std::memory_order_relaxed); }

gemini-code-assist · 2026-06-30T23:05:45Z

+    auto it = tenant_state.promotion_candidates.find(key);
+    if (it != tenant_state.promotion_candidates.end()) {
+        // Update existing entry: refresh last_seen, reset retry_after.
+        it->second.last_seen = now;
+        it->second.last_reason = reason;
+        it->second.last_error = last_error;
+        if (sketch_score > it->second.sketch_score) {
+            it->second.sketch_score = sketch_score;
+        }
+        it->second.retry_after = now;
+        return;
+    }


When a candidate already exists in RecordOrUpdateCandidate(), the code updates its last_seen and retry_after timestamps but does not reset retry_count to 0. This means that even if a key is actively and frequently accessed via Get, its retry_count will continue to increment on each background retry until it hits kPromotionCandidateMaxRetries (8) and gets prematurely erased. Resetting retry_count to 0 on a fresh Get hit ensures that hot keys get a fresh budget of retries, which aligns with the design goal of extending the effective TTL while the key stays hot.

auto it = tenant_state.promotion_candidates.find(key); if (it != tenant_state.promotion_candidates.end()) { // Update existing entry: refresh last_seen, reset retry_after and retry_count. it->second.last_seen = now; it->second.last_reason = reason; it->second.last_error = last_error; if (sketch_score > it->second.sketch_score) { it->second.sketch_score = sketch_score; } it->second.retry_after = now; it->second.retry_count = 0; return; }

Srinivasoo7 requested review from XucSh, YiXR, stmatengss and ykwd as code owners June 30, 2026 23:03

github-actions Bot added run-ci Store labels Jun 30, 2026

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1)#2690

[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1)#2690
Srinivasoo7 wants to merge 1 commit into
kvcache-ai:mainfrom
Srinivasoo7:store/promotion-retry-v1.1

Srinivasoo7 commented Jun 30, 2026

Uh oh!

Srinivasoo7 commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Srinivasoo7 commented Jun 30, 2026

Summary

Design

Metrics

Non-goals (unchanged from V1)

Test plan

Uh oh!

Srinivasoo7 commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant