Cosmos: add cold-start metadata cache hedging (Java port of dotnet #5923)#49517
Draft
NaluTripician wants to merge 2 commits into
Draft
Cosmos: add cold-start metadata cache hedging (Java port of dotnet #5923)#49517NaluTripician wants to merge 2 commits into
NaluTripician wants to merge 2 commits into
Conversation
) Introduces MetadataHedgingStrategy: bounded cross-region hedging for cold-start metadata cache population. Races a primary regional read against a hedge dispatched after a fixed SDK-derived threshold, returning the first acceptable winner. Wired into RxClientCollectionCache cold-start Collection reads, gated by a tri-state Configs opt-in that follows PPAF when unset. Includes 19 unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix latent bug: hedgeOutcome was cached, decoupling the hedge timer from downstream cancellation so a spurious hedge could fire ~threshold after the primary already won. Removed .cache() on the hedge branch (single consumer) so merge cancellation cancels the timer; added a regression test (fast primary win must not leak a late hedge). - Guard the per-client budget permit against an assembly-time throw between tryAcquire and the doFinally release; single AtomicBoolean now guards every release path against double-release. - Thread isColdStart through RxCollectionCache.getByRid/getByName so forced metadata refreshes no longer hedge; only cold-start cache-miss population does. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports cold-start metadata cache hedging from the .NET Cosmos SDK (Azure/azure-cosmos-dotnet-v3#5923) to the Java SDK, adapted to the reactive (Project Reactor) model.
On cold-start metadata cache population (Collection read), the SDK now proactively dispatches a hedged cross-region request when the primary region hasn't responded within an SDK-derived threshold, reducing cold-start tail latency during regional brown-outs / PPAF failover.
What's included
New package
com.azure.cosmos.implementation.metadatahedging:MetadataHedgingStrategy(one per client):executeAsyncraces a primary regional read against a hedge dispatched after a fixed threshold (1.5s = first-attempt + 500ms), returns the first acceptable winner and cancels the loser. Non-blocking per-clientSemaphorebudget; fail-fast hedging when the primary returns a regional failure before the threshold; shared failure classification (isRegionalFailure,isAcceptableWinnerwith the hedge-branch 401/403 overlay).MetadataHedgingContext,MetadataHedgingResult,MetadataHedgeDiagnostics,MetadataHedgeEligibility,MetadataHedgeSkipReason,HedgeBranch.Wiring & config:
Configs.getMetadataHedgingForColdStartEnabled()— tri-state opt-in (null=follow PPAF,true=force on,false=off) viaCOSMOS.METADATA_HEDGING_FOR_COLD_START_ENABLEDsystem property / env var. Default behavior is unchanged (the strategy is only constructed when PPAF is enabled or the customer explicitly opts in).RxClientCollectionCache— injects GlobalEndpointManager + strategy (backward-compatible constructors); cold-start Collection reads route through the strategy with a region-targeted sender that handles both master-key and AAD auth per cloned branch.RxDocumentClientImpl— builds the strategy viacreateIfEnabled, resolving PPAF state from the per-partition automatic-failover manager.Testing
azure-cosmoscompiles; checkstyle and spotbugs both pass (not disabled).MetadataHedgingStrategyTest: 19/19 unit tests pass covering opt-in resolution,createIfEnabled, regional-failure / acceptable-winner classification, all eligibility skip reasons, andexecuteAsyncrace paths (ineligible→primary-only, primary-wins-no-hedge, hedge-wins-on-primary-regional-failure, budget-exhausted fallback).Scope notes
Phase-1 scope mirrors the .NET phased rollout. Deferred follow-ups: PartitionKeyRangeCache hedging (Java path goes through the full
readPartitionKeyRangesquery pipeline, not a direct store-model call), OpenTelemetry metrics, and the Gateway kill-switch account flag (hard-wiredfalsein .NET Phase 1 too).Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com