[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1)#2690
[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1)#2690Srinivasoo7 wants to merge 1 commit into
Conversation
… (V1.1) Implements RFC kvcache-ai#2233: bounded background retry/admission layer on top of the V1 promotion-on-hit path (kvcache-ai#2071). ## Motivation V1 promotion-on-hit is Get-driven: a hot LOCAL_DISK-only key queues a promotion when it passes the CountMinSketch frequency gate. However, transient blockers (DRAM above watermark, promotion queue at cap, holder push failure) cause the key to miss its promotion window. It only retries on the next Get, which may not arrive promptly for bursty workloads. V1.1 adds a master-side background scheduler that re-evaluates rejected candidates without waiting for another Get. ## Design - `Get` path remains non-blocking. `TryPushPromotionQueue` now returns a typed `PromotionQueueResult` instead of `void`. On transient rejection (watermark, queue cap, recoverable push failure), it records/updates a `PromotionCandidate` entry under the key's `TenantState`. - `PromotionCandidate` stores: `sketch_score`, `first_seen`, `last_seen` (reset on each rejection to extend TTL while key stays hot), `retry_after` (exponential backoff), `retry_count`, `last_reason`, `last_error`. Uses `steady_clock` for monotonic interval tracking. - Global candidate set is bounded by `kPromotionCandidateLimit` (50 000), per-candidate by `kPromotionCandidateMaxRetries` (8) and `kPromotionCandidateTtl` (60 s from `last_seen`). - `RunPromotionCandidateRetry` runs in the eviction thread every 10 ms (early-exit when count == 0). It scans `kPromotionRetryShardBatch` (64) shards per tick via a ring-buffer cursor, collecting up to `kPromotionRetryBatchSize` (128) due candidates, then calls `TryPushPromotionQueue(..., record_candidate=false)` to re-enter the V1 admission path without re-recording candidates. - Candidate state is cleared on metadata snapshot reload/reset via `ClearCandidatesForReload`, hooked into `MetadataSerializer::Deserialize` and `Reset`. - Permanent resolution (object gone, MEMORY replica appeared, task already in flight) erases the candidate immediately. ## Metrics (all wired into MasterMetricManager / Prometheus) | Counter | Meaning | |---|---| | `master_promotion_candidate_recorded_total` | New candidate entries created | | `master_promotion_candidate_admitted_total` | Candidates promoted by background scheduler | | `master_promotion_candidate_admission_rejected_total` | Candidates that hit a gate again on retry | | `master_promotion_candidate_expired_evaluated_total` | Candidates exhausted after retries or TTL post-evaluation | | `master_promotion_candidate_expired_unevaluated_total` | Candidates aged out before scheduler ever reached them (scan budget too small) | | `master_promotion_candidate_dropped_limit_total` | Candidates dropped at record time due to global count limit | The `expired_unevaluated` / `expired_evaluated` distinction is implemented by checking `retry_count == 0` at expiry time in the scan loop. ## Non-goals (unchanged from V1) - No synchronous promotion on Get - No proactive scan of cold keys — only Get-observed candidates - No LOCAL_DISK deletion after promotion - No new client-facing RPCs or heartbeat protocol changes ## Tests Five new `TEST_F` cases in `promotion_on_hit_test.cpp`: - Watermark rejection records a candidate; duplicate Get does not add a second entry - Candidate exhausted and removed after `kPromotionCandidateMaxRetries` scans - Object removed while candidate is pending → next scan cleans up, no spurious admit metric - Multiple keys tracked independently with correct per-key counts - `ClearCandidatesForReload` resets all candidate state and the global atomic counter Closes kvcache-ai#2233
|
@stmatengss @LujhCoconut Thanks |
There was a problem hiding this comment.
Code Review
This pull request introduces a promotion retry candidate mechanism to handle transient promotion failures (such as watermark or queue capacity rejections) by tracking candidates, applying exponential backoff, and retrying them in a background thread. It also adds corresponding metrics and unit tests. The review feedback identifies two important issues: a thread-safety violation in ClearCandidatesForReload() where metadata shards are accessed without acquiring write locks, and a logic issue in RecordOrUpdateCandidate() where retry_count is not reset to 0 upon a fresh hit, which can cause hot keys to be prematurely discarded.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| void MasterService::ClearCandidatesForReload() { | ||
| for (auto& shard : metadata_shards_) { | ||
| for (auto& [tenant_id, tenant_state] : shard.tenants) { | ||
| (void)tenant_id; | ||
| tenant_state.promotion_candidates.clear(); | ||
| } | ||
| } | ||
| promotion_candidate_count_.store(0, std::memory_order_relaxed); | ||
| promotion_retry_cursor_.store(0, std::memory_order_relaxed); | ||
| } |
There was a problem hiding this comment.
In ClearCandidatesForReload(), the code iterates over metadata_shards_ and accesses shard.tenants directly without acquiring the shard's write lock (shard.mutex). This introduces a thread-safety violation (data race) if other threads are concurrently accessing or modifying the metadata shards. Since this method can be called during metadata reload or reset operations on a running service, we should safely acquire the write lock for each shard using MetadataShardAccessorRW.
void MasterService::ClearCandidatesForReload() {
for (size_t i = 0; i < kNumShards; ++i) {
MetadataShardAccessorRW shard(this, i);
for (auto& [tenant_id, tenant_state] : shard->tenants) {
(void)tenant_id;
tenant_state.promotion_candidates.clear();
}
}
promotion_candidate_count_.store(0, std::memory_order_relaxed);
promotion_retry_cursor_.store(0, std::memory_order_relaxed);
}| auto it = tenant_state.promotion_candidates.find(key); | ||
| if (it != tenant_state.promotion_candidates.end()) { | ||
| // Update existing entry: refresh last_seen, reset retry_after. | ||
| it->second.last_seen = now; | ||
| it->second.last_reason = reason; | ||
| it->second.last_error = last_error; | ||
| if (sketch_score > it->second.sketch_score) { | ||
| it->second.sketch_score = sketch_score; | ||
| } | ||
| it->second.retry_after = now; | ||
| return; | ||
| } |
There was a problem hiding this comment.
When a candidate already exists in RecordOrUpdateCandidate(), the code updates its last_seen and retry_after timestamps but does not reset retry_count to 0. This means that even if a key is actively and frequently accessed via Get, its retry_count will continue to increment on each background retry until it hits kPromotionCandidateMaxRetries (8) and gets prematurely erased. Resetting retry_count to 0 on a fresh Get hit ensures that hot keys get a fresh budget of retries, which aligns with the design goal of extending the effective TTL while the key stays hot.
auto it = tenant_state.promotion_candidates.find(key);
if (it != tenant_state.promotion_candidates.end()) {
// Update existing entry: refresh last_seen, reset retry_after and retry_count.
it->second.last_seen = now;
it->second.last_reason = reason;
it->second.last_error = last_error;
if (sketch_score > it->second.sketch_score) {
it->second.sketch_score = sketch_score;
}
it->second.retry_after = now;
it->second.retry_count = 0;
return;
}
Implements RFC #2233.
Summary
V1 promotion-on-hit (#2071) is Get-driven: when a hot
LOCAL_DISK-only key passes the CountMinSketch frequency gate, the master queues an async promotion task. However, transient blockers — DRAM above watermark, promotion queue at cap, recoverable push failure — cause the key to miss its window. It only gets another chance on the nextGet, which may not arrive promptly for bursty workloads where access patterns have subsided but the key should still be in DRAM.V1.1 adds a master-side background scheduler that re-evaluates rejected promotion candidates without waiting for another
Get.Design
Getpath stays non-blocking.TryPushPromotionQueuenow returns a typedPromotionQueueResultinstead ofvoid. On a transient rejection (watermark, queue cap, recoverable push failure) it records/updates aPromotionCandidateunder the key'sTenantState.PromotionCandidatestores:sketch_score(frequency at record time),first_seen,last_seen(refreshed on each rejection — extends effective TTL while the key stays hot),retry_after(exponential backoff),retry_count,last_reason,last_error. Usessteady_clockfor monotonic interval tracking.kPromotionCandidateLimit(50 000), per-candidatekPromotionCandidateMaxRetries(8) andkPromotionCandidateTtl(60 s fromlast_seen).RunPromotionCandidateRetryruns in the existing eviction thread every 10 ms (early-exit when count == 0). It advances a ring-buffer cursor overkPromotionRetryShardBatch(64) shards per tick, collects up tokPromotionRetryBatchSize(128) due candidates, then callsTryPushPromotionQueue(..., record_candidate=false)to re-enter the V1 admission path.ClearCandidatesForReloadfrom bothMetadataSerializer::DeserializeandResetto reset all transient candidate state.Metrics
All six counters are wired into
MasterMetricManagerand exposed via Prometheus:master_promotion_candidate_recorded_totalmaster_promotion_candidate_admitted_totalmaster_promotion_candidate_admission_rejected_totalmaster_promotion_candidate_expired_evaluated_totalmaster_promotion_candidate_expired_unevaluated_totalmaster_promotion_candidate_dropped_limit_totalThe
expired_unevaluated/expired_evaluateddistinction is implemented by checkingretry_count == 0at expiry time in the scan loop: a candidate withretry_count == 0was never processed by the scheduler before its TTL elapsed.Non-goals (unchanged from V1)
GetGet-observed candidatesLOCAL_DISKdeletion after promotionTest plan
Five new
TEST_Fcases added topromotion_on_hit_test.cpp:RetryCandidate_WatermarkRejectionRecordsCandidate— watermark rejection records a candidate; a secondGetdeduplicates (count stays 1)RetryCandidate_ExhaustedAfterMaxRetries— candidate is erased andexpired_evaluatedis incremented afterkPromotionCandidateMaxRetriesscansRetryCandidate_ObjectRemovedMidRetry— object removed while candidate is pending; next scan cleans up with no spuriousadmittedincrementRetryCandidate_MultipleKeysTracked— multiple LOCAL_DISK-only keys tracked independently,recordedcounter matches key countRetryCandidate_ClearOnReload—ClearCandidatesForReloadresets all candidate state and the global atomic counter to zerocmake --build build-ccf-ninja --target promotion_on_hit_test -j4 ./build-ccf-ninja/mooncake-store/tests/promotion_on_hit_test --gtest_filter='*RetryCandidate*' --gtest_brief=1🤖 Generated with Claude Code