Skip to content

[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1)#2690

Open
Srinivasoo7 wants to merge 1 commit into
kvcache-ai:mainfrom
Srinivasoo7:store/promotion-retry-v1.1
Open

[Store] L2→L1 promotion background retry for hot LOCAL_DISK-only keys (V1.1)#2690
Srinivasoo7 wants to merge 1 commit into
kvcache-ai:mainfrom
Srinivasoo7:store/promotion-retry-v1.1

Conversation

@Srinivasoo7

Copy link
Copy Markdown

Implements RFC #2233.

Summary

V1 promotion-on-hit (#2071) is Get-driven: when a hot LOCAL_DISK-only key passes the CountMinSketch frequency gate, the master queues an async promotion task. However, transient blockers — DRAM above watermark, promotion queue at cap, recoverable push failure — cause the key to miss its window. It only gets another chance on the next Get, which may not arrive promptly for bursty workloads where access patterns have subsided but the key should still be in DRAM.

V1.1 adds a master-side background scheduler that re-evaluates rejected promotion candidates without waiting for another Get.

Design

  • Get path stays non-blocking. TryPushPromotionQueue now returns a typed PromotionQueueResult instead of void. On a transient rejection (watermark, queue cap, recoverable push failure) it records/updates a PromotionCandidate under the key's TenantState.
  • PromotionCandidate stores: sketch_score (frequency at record time), first_seen, last_seen (refreshed on each rejection — extends effective TTL while the key stays hot), retry_after (exponential backoff), retry_count, last_reason, last_error. Uses steady_clock for monotonic interval tracking.
  • Bounded candidate set: global count kPromotionCandidateLimit (50 000), per-candidate kPromotionCandidateMaxRetries (8) and kPromotionCandidateTtl (60 s from last_seen).
  • Background scheduler: RunPromotionCandidateRetry runs in the existing eviction thread every 10 ms (early-exit when count == 0). It advances a ring-buffer cursor over kPromotionRetryShardBatch (64) shards per tick, collects up to kPromotionRetryBatchSize (128) due candidates, then calls TryPushPromotionQueue(..., record_candidate=false) to re-enter the V1 admission path.
  • Permanent resolution (object gone, MEMORY replica appeared, task already in flight) erases the candidate immediately under the shard lock.
  • Metadata reload calls ClearCandidatesForReload from both MetadataSerializer::Deserialize and Reset to reset all transient candidate state.

Metrics

All six counters are wired into MasterMetricManager and exposed via Prometheus:

Prometheus counter Meaning
master_promotion_candidate_recorded_total New candidate entries created
master_promotion_candidate_admitted_total Candidates successfully promoted by background scheduler
master_promotion_candidate_admission_rejected_total Candidates that hit a gate again on retry
master_promotion_candidate_expired_evaluated_total Candidates exhausted after retries or TTL — were evaluated at least once
master_promotion_candidate_expired_unevaluated_total Candidates that aged out before the scheduler ever reached them (scan budget too small)
master_promotion_candidate_dropped_limit_total Candidates dropped at record time because the global count limit was reached

The expired_unevaluated / expired_evaluated distinction is implemented by checking retry_count == 0 at expiry time in the scan loop: a candidate with retry_count == 0 was never processed by the scheduler before its TTL elapsed.

Non-goals (unchanged from V1)

  • No synchronous promotion on Get
  • No proactive scan of cold keys — only Get-observed candidates
  • No LOCAL_DISK deletion after promotion
  • No new client-facing RPCs or heartbeat protocol changes

Test plan

Five new TEST_F cases added to promotion_on_hit_test.cpp:

  • RetryCandidate_WatermarkRejectionRecordsCandidate — watermark rejection records a candidate; a second Get deduplicates (count stays 1)
  • RetryCandidate_ExhaustedAfterMaxRetries — candidate is erased and expired_evaluated is incremented after kPromotionCandidateMaxRetries scans
  • RetryCandidate_ObjectRemovedMidRetry — object removed while candidate is pending; next scan cleans up with no spurious admitted increment
  • RetryCandidate_MultipleKeysTracked — multiple LOCAL_DISK-only keys tracked independently, recorded counter matches key count
  • RetryCandidate_ClearOnReloadClearCandidatesForReload resets all candidate state and the global atomic counter to zero
cmake --build build-ccf-ninja --target promotion_on_hit_test -j4
./build-ccf-ninja/mooncake-store/tests/promotion_on_hit_test --gtest_filter='*RetryCandidate*' --gtest_brief=1

🤖 Generated with Claude Code

… (V1.1)

Implements RFC kvcache-ai#2233: bounded background retry/admission layer on top of
the V1 promotion-on-hit path (kvcache-ai#2071).

## Motivation

V1 promotion-on-hit is Get-driven: a hot LOCAL_DISK-only key queues a
promotion when it passes the CountMinSketch frequency gate. However,
transient blockers (DRAM above watermark, promotion queue at cap, holder
push failure) cause the key to miss its promotion window. It only retries
on the next Get, which may not arrive promptly for bursty workloads.

V1.1 adds a master-side background scheduler that re-evaluates rejected
candidates without waiting for another Get.

## Design

- `Get` path remains non-blocking. `TryPushPromotionQueue` now returns a
  typed `PromotionQueueResult` instead of `void`. On transient rejection
  (watermark, queue cap, recoverable push failure), it records/updates a
  `PromotionCandidate` entry under the key's `TenantState`.
- `PromotionCandidate` stores: `sketch_score`, `first_seen`, `last_seen`
  (reset on each rejection to extend TTL while key stays hot),
  `retry_after` (exponential backoff), `retry_count`, `last_reason`,
  `last_error`. Uses `steady_clock` for monotonic interval tracking.
- Global candidate set is bounded by `kPromotionCandidateLimit` (50 000),
  per-candidate by `kPromotionCandidateMaxRetries` (8) and
  `kPromotionCandidateTtl` (60 s from `last_seen`).
- `RunPromotionCandidateRetry` runs in the eviction thread every 10 ms
  (early-exit when count == 0). It scans `kPromotionRetryShardBatch` (64)
  shards per tick via a ring-buffer cursor, collecting up to
  `kPromotionRetryBatchSize` (128) due candidates, then calls
  `TryPushPromotionQueue(..., record_candidate=false)` to re-enter the
  V1 admission path without re-recording candidates.
- Candidate state is cleared on metadata snapshot reload/reset via
  `ClearCandidatesForReload`, hooked into `MetadataSerializer::Deserialize`
  and `Reset`.
- Permanent resolution (object gone, MEMORY replica appeared, task already
  in flight) erases the candidate immediately.

## Metrics (all wired into MasterMetricManager / Prometheus)

| Counter | Meaning |
|---|---|
| `master_promotion_candidate_recorded_total` | New candidate entries created |
| `master_promotion_candidate_admitted_total` | Candidates promoted by background scheduler |
| `master_promotion_candidate_admission_rejected_total` | Candidates that hit a gate again on retry |
| `master_promotion_candidate_expired_evaluated_total` | Candidates exhausted after retries or TTL post-evaluation |
| `master_promotion_candidate_expired_unevaluated_total` | Candidates aged out before scheduler ever reached them (scan budget too small) |
| `master_promotion_candidate_dropped_limit_total` | Candidates dropped at record time due to global count limit |

The `expired_unevaluated` / `expired_evaluated` distinction is implemented
by checking `retry_count == 0` at expiry time in the scan loop.

## Non-goals (unchanged from V1)

- No synchronous promotion on Get
- No proactive scan of cold keys — only Get-observed candidates
- No LOCAL_DISK deletion after promotion
- No new client-facing RPCs or heartbeat protocol changes

## Tests

Five new `TEST_F` cases in `promotion_on_hit_test.cpp`:
- Watermark rejection records a candidate; duplicate Get does not add a
  second entry
- Candidate exhausted and removed after `kPromotionCandidateMaxRetries`
  scans
- Object removed while candidate is pending → next scan cleans up, no
  spurious admit metric
- Multiple keys tracked independently with correct per-key counts
- `ClearCandidatesForReload` resets all candidate state and the global
  atomic counter

Closes kvcache-ai#2233
@Srinivasoo7

Copy link
Copy Markdown
Author

@stmatengss @LujhCoconut
PTAL

Thanks

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a promotion retry candidate mechanism to handle transient promotion failures (such as watermark or queue capacity rejections) by tracking candidates, applying exponential backoff, and retrying them in a background thread. It also adds corresponding metrics and unit tests. The review feedback identifies two important issues: a thread-safety violation in ClearCandidatesForReload() where metadata shards are accessed without acquiring write locks, and a logic issue in RecordOrUpdateCandidate() where retry_count is not reset to 0 upon a fresh hit, which can cause hot keys to be prematurely discarded.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +5148 to +5157
void MasterService::ClearCandidatesForReload() {
for (auto& shard : metadata_shards_) {
for (auto& [tenant_id, tenant_state] : shard.tenants) {
(void)tenant_id;
tenant_state.promotion_candidates.clear();
}
}
promotion_candidate_count_.store(0, std::memory_order_relaxed);
promotion_retry_cursor_.store(0, std::memory_order_relaxed);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In ClearCandidatesForReload(), the code iterates over metadata_shards_ and accesses shard.tenants directly without acquiring the shard's write lock (shard.mutex). This introduces a thread-safety violation (data race) if other threads are concurrently accessing or modifying the metadata shards. Since this method can be called during metadata reload or reset operations on a running service, we should safely acquire the write lock for each shard using MetadataShardAccessorRW.

void MasterService::ClearCandidatesForReload() {
    for (size_t i = 0; i < kNumShards; ++i) {
        MetadataShardAccessorRW shard(this, i);
        for (auto& [tenant_id, tenant_state] : shard->tenants) {
            (void)tenant_id;
            tenant_state.promotion_candidates.clear();
        }
    }
    promotion_candidate_count_.store(0, std::memory_order_relaxed);
    promotion_retry_cursor_.store(0, std::memory_order_relaxed);
}

Comment on lines +5041 to +5052
auto it = tenant_state.promotion_candidates.find(key);
if (it != tenant_state.promotion_candidates.end()) {
// Update existing entry: refresh last_seen, reset retry_after.
it->second.last_seen = now;
it->second.last_reason = reason;
it->second.last_error = last_error;
if (sketch_score > it->second.sketch_score) {
it->second.sketch_score = sketch_score;
}
it->second.retry_after = now;
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a candidate already exists in RecordOrUpdateCandidate(), the code updates its last_seen and retry_after timestamps but does not reset retry_count to 0. This means that even if a key is actively and frequently accessed via Get, its retry_count will continue to increment on each background retry until it hits kPromotionCandidateMaxRetries (8) and gets prematurely erased. Resetting retry_count to 0 on a fresh Get hit ensures that hot keys get a fresh budget of retries, which aligns with the design goal of extending the effective TTL while the key stays hot.

    auto it = tenant_state.promotion_candidates.find(key);
    if (it != tenant_state.promotion_candidates.end()) {
        // Update existing entry: refresh last_seen, reset retry_after and retry_count.
        it->second.last_seen = now;
        it->second.last_reason = reason;
        it->second.last_error = last_error;
        if (sketch_score > it->second.sketch_score) {
            it->second.sketch_score = sketch_score;
        }
        it->second.retry_after = now;
        it->second.retry_count = 0;
        return;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant