[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437
[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437jeet1995 wants to merge 68 commits into
Conversation
… QueryPlan proxy routing Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x002B) and x-ms-cosmos-query-version (0x002C) so the thin client proxy can read these values from the RNTBD body when processing QueryPlan requests. Previously these headers were only set as HTTP headers by QueryPlanRetriever and were lost when QueryPlan was routed through the proxy path, since ThinClientStoreModel serializes requests as RNTBD (not HTTP headers). IDs match server-side proxy definitions per ADO PR 1982503. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testThinClientChangeFeedFullRange covering FeedRange.forFullRange() across multiple partition keys, and testThinClientChangeFeedPartitionKey covering FeedRange.forLogicalPartition with exact doc count + PK validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents all 59 thin client E2E tests across query (50), point operations (3), change feed (3), and stored procedures (3) with SQL, query features covered, and known account-side blockers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… QueryPlan proxy routing Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x00F0) and x-ms-cosmos-query-version (0x00F1) so the thin client proxy can read these values from the RNTBD body when processing QueryPlan requests. IDs are provisional (0x00F0, 0x00F1) — must be coordinated with server-side proxy team. See ADO PR 1982503 for the proxy-side design. Note: The design doc listed 0x002B/0x002C but those are already assigned to PartitionKey/PartitionKeyRangeId in the Java SDK. Using 0x00F0/0x00F1 to avoid ID collision until final server-side IDs are assigned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…BD instructions - Fix testGetCurrentDateTime: assert ISO 8601 format instead of exact match (gateway and proxy return slightly different timestamps) - Add DefaultAzureCredential support via COSMOS.USE_AAD_AUTH system property for accounts with disableLocalAuth=true - Add RNTBD class reference as .github/instructions/rntbd.instructions.md - Add pom.xml system properties for THINCLIENT_ENABLED, HTTP2_ENABLED, USE_AAD_AUTH - Add beforeSuiteReuse mode for degraded accounts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Switch baseline from Gateway V1 to Direct TCP to avoid JVM config interference (THINCLIENT_ENABLED/HTTP2_ENABLED affect Gateway V1) - Assert :10250 endpoint only on Gateway V2 results (not baseline) - Rename helpers: assertDirectAndThinClientMatch (was gateway) - Document seedTestData schema in Javadoc - Remove 'Expected to fail' comments (account has vector search enabled) - Clean up class/method Javadoc Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Build 6424287 went from 49+/2 distinct test failures down to two new test-only failure patterns, both rooted in the same cause — the probe gate (enabled by default, threshold=1) is flipping data-plane routing from the thin-client proxy to Gateway V1 in test accounts whose proxy has not yet deployed the /connectivity-probe endpoint. New failures (both fixed in ad9e3df):
Fix: Disable the probe by default in No production impact — these are test-environment changes. The probe still defaults to Remaining single-shot failures (likely environmental, not probe-related):
Commit: ad9e3df |
…teBase" This reverts commit ad9e3df.
…PartitionCircuitBreakerE2ETests Companion to the prior revert. The revert undid the global TestSuiteBase probe disable (which masked production behaviour). This commit adds the necessary per-class disable to the two test classes whose assertions explicitly require thinclient routing: CosmosNotFoundTests (thinclient group) and PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group). Both clear the property in their @afterclass. Http2PingKeepaliveTest already has its own disable (restored by the revert). Production callers continue to get the connectivity probe ON by default with the production failure threshold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update on the CI-failure fix for build 6424287My earlier comment proposed disabling the connectivity probe globally in What I just pushed (commits
Each This keeps the rest of the suite honest about default behaviour while preventing the three classes that need a deterministic proxy route from failing while the proxy-side |
… into AzCosmos_GatewayV2_QueryPlanSupport # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConstants.java
Covers 5 new scenarios in ThinClientRoutingGateTests: - ExecuteStoredProcedure on a StoredProcedure resource routes to thin client - Non-execute StoredProcedure ops (Create) route to Gateway V1 - OperationType.QueryPlan routes to thin client - QueryPlan returns false when probe is unhealthy - ExecuteStoredProcedure returns false when probe is unhealthy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Item 3: remove unused COSMOS.USE_AAD_AUTH system property from the thinclient profile in azure-cosmos-tests/pom.xml. The Maven property cosmos.use.aad.auth is not defined anywhere in the repo, so the substitution produced a literal that Boolean.parseBoolean reads as false; TestSuiteBase already falls back to the COSMOS_USE_AAD_AUTH env var. Item 4: in Http2PingKeepaliveTest, gate the THINCLIENT_PROBE_ENABLED=false override behind !THIN_CLIENT_ENABLED so the probe is disabled only in the pure Compute Gateway (Gateway V2) branch on port 443, where thin-client routing is off entirely and probe POSTs to port 10250 are pure noise. In the thin-client branch (port 10250) the probe is intentionally left enabled so the same production code path that gates thin-client routing is exercised. Mirror the conditional in afterClass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Deep review summaryItem 1 — Stored Procedure routing on thin-client ✅Supported. Item 2 —
|
- QueryPlanRetriever: clarify thin-client / Gateway V1 routing comment; explain that Gateway V1 is pinned only when PartitionKeyDefinition is unavailable to convert proxy queryRanges to EPK hex ranges. - RxDocumentClientImpl + DocumentQueryExecutionContextFactory: plumb the resolved DocumentCollection into validateCustomQueryForReadManyByPartitionKeys -> fetchQueryPlanForValidation so the validation query-plan request is thin-client-eligible instead of forcibly pinned to Gateway V1. - EndpointProbeClient: extract DiagnosticsSnapshot and ProbeResult into top-level types (EndpointProbeDiagnosticsSnapshot, EndpointProbeResult); remove @SuppressWarnings("unused"); use the failure reason in RED-cycle logs and toString(); verbose NPE message in the constructor naming the required dependency and wiring entry point. - CHANGELOG: move thin-client entry under Features Added; one-line summary with HTTP/2 + gatewayMode requirement and PR link. - CosmosNotFoundTests: add forceHttp1IfGatewayMode helper applied to every fast-group client construction so Gateway-mode clients in the fast group use HTTP/1.1 and are guaranteed not to route through the proxy. Keep the COSMOS.THINCLIENT_PROBE_ENABLED=false set/clear for the thinclient group's defensive teardown. - PerPartitionCircuitBreakerE2ETests: drop the THINCLIENT_PROBE_ENABLED property mutation now that probe behavior is asserted only by tests that opt in explicitly.
- Add ReadManyByPartitionKeyQueryPlanRoutingTest unit tests that pin the useGatewayMode gate in QueryPlanRetriever: gateway mode when DocumentCollection is null and partitioned mode when a PartitionKeyDefinition is present. - Add three readManyByPartitionKeys E2E tests to ThinClientQueryE2ETest that exercise the validation QueryPlan path through Direct TCP (baseline) and Gateway V2 (thin client), covering no-custom-query, projection+filter, and parameterized variants. Each thin-client diagnostics page is asserted to use the :10250 endpoint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ill switch in forceUnhealthy Stage 1990 of build 6432767 had 83 failures in ClientRetryPolicyE2ETestsWithGatewayV2#serviceUnavailableWithGatewayV2 because the probe gate added by PR Azure#49437 flips routing from :10250 to :443 when CI test accounts lack /connectivity-probe (failureThreshold=1 trips proxyHealthy=false on the first failed probe and the assertion at TestSuiteBase#assertThinClientEndpointUsed then sees :443 and fails). Test fix (per CosmosNotFoundTests precedent, commit 3b392e3): - ClientRetryPolicyE2ETestsWithGatewayV2: disable probe in @BeforeClass / clear in @afterclass. - ThinClientTestBase: same lifecycle in enableThinClientForTest / clearThinClientForTest so all ThinClient*E2ETest subclasses inherit the guard. Source fix: honor COSMOS.THINCLIENT_PROBE_ENABLED kill switch inside EndpointProbeClient.forceUnhealthy(), closing the item-5 invariant gap (previously only runProbeCycle() checked the flag, so a mismatch-driven forceUnhealthy could mutate proxyHealthy even with the probe disabled). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
When COSMOS.THINCLIENT_PROBE_ENABLED=false, no probe cycles run, so the probe-health signal is stale/meaningless. Treat probe as healthy in that case so the gate is driven solely by COSMOS.THINCLIENT_ENABLED (already folded into useThinClient at construction). This fixes regressions in CI test classes (GatewayReadConsistencyStrategyE2ETest, FaultInjectionWithAvailabilityStrategyTestsBase, ThinClientQueryE2ETest) where probe to /connectivity-probe fails on test accounts and flips routing to :443 instead of :10250. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…onsistency tests Adds COSMOS.THINCLIENT_PROBE_ENABLED=false in @BeforeClass (and clears in @afterclass) for PerPartitionCircuitBreakerE2ETests, PerPartitionAutomaticFailoverE2ETests, FaultInjectionServerErrorRuleOnGatewayV2Tests, and GatewayReadConsistencyStrategyE2ETest. These tests run against CI accounts whose /connectivity-probe endpoint is not deployed. Without this, the first failed probe trips proxyHealthy=false and routing falls back to Gateway V1 (port 443), causing assertThinClientEndpointUsed (which expects port 10250) to fail. Locally validated PPAF (16 tests, 0 failures), GRCS (3 tests, 0 failures), FI (3 tests, 0 failures). PPCB still in progress at commit time; pushing to let local + CI run in parallel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
QueryPlan requests intentionally carry no RCS/CL headers (matches the V1 HTTP behavior). When the V2 thin-client routes the QueryPlan precursor through the same :10250 endpoint as the data query, the spy must skip the QueryPlan frame so the assertion checks the actual data-query frame. This mirrors the IS_QUERY_PLAN_REQUEST filter on the V1 path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Motivation
Gateway V2 (a.k.a. ThinClient) is the new Compute-hosted data plane for Cosmos DB. To roll it out without forcing every SDK consumer to opt in, we need a user-transparent way for the Java SDK to:
Today the SDK has no client-side signal about ThinClient reachability. If a thin-client endpoint is degraded — bad HTTP/2 negotiation, TLS handshake failure, proxy 5xx — every eligible request is routed there and fails before falling back. This PR closes that gap by:
COSMOS.THINCLIENT_ENABLEDto true by default, so any account whose topology advertises thin-client read locations starts routing eligible traffic through Gateway V2 automatically.EndpointProbeClient) thatPOSTs to/connectivity-probeover the thin-client HTTP/2 transport on every topology refresh. The probe result isAND-ed into the existing routing gate (useThinClientStoreModel) viaGlobalEndpointManager.isProxyProbeHealthy(). If the probe trips, eligible requests transparently route to Gateway V1 until probes recover — the caller never sees the difference except inCosmosDiagnostics.Key invariants the design preserves:
booleanread.proxyHealthy = trueuntil proven otherwise, so a slow first probe never delays the first request.CosmosClientconstruction and topology refresh are protected byonErrorResume.COSMOS.THINCLIENT_PROBE_ENABLED=falseshort-circuits the probe HTTP I/O entirely while leaving routing on its last-known state (defaults to GREEN).THINCLIENT_PROBE_FAILURE_THRESHOLDconsecutive RED cycles to flip GREEN→RED;THINCLIENT_PROBE_RECOVERY_THRESHOLDconsecutive GREEN cycles to flip back — prevents routing oscillation.Call sequence
sequenceDiagram autonumber participant App as User App participant Client as CosmosClient<br/>(RxDocumentClientImpl) participant GEM as GlobalEndpointManager participant Probe as EndpointProbeClient participant H2 as ThinClient HTTP/2<br/>HttpClient participant Proxy as ThinClient<br/>Endpoint (Gateway V2) participant V1 as Gateway V1<br/>(fallback) rect rgba(200,220,255,0.25) Note over App,Probe: Bootstrap (one-time) App->>Client: new CosmosClient(...) Client->>Client: useThinClient = config.isThinClientEnabled() (now defaults true) Client->>GEM: setThinClientHttpClient(reactorHttpClient) GEM->>Probe: new EndpointProbeClient(httpClient) Note over Probe: proxyHealthy = true (optimistic) end rect rgba(200,255,210,0.25) Note over GEM,Proxy: Topology refresh (every refresh tick) GEM->>GEM: refreshLocationPrivateAsync() GEM->>GEM: runThinClientProbeCycleMono() GEM->>Probe: runProbeCycle(thinClientRegionalEndpoints) par Per regional endpoint Probe->>H2: POST <endpoint>/connectivity-probe H2->>Proxy: HTTP/2 frame Proxy-->>H2: 200 OK (or error / timeout) H2-->>Probe: ProbeResult(GREEN|RED) end Probe->>Probe: applyCycleResult() — apply hysteresis<br/>update proxyHealthy Note right of Probe: ALL endpoints 200 → GREEN<br/>any non-200 / timeout → RED<br/>flip only after N consecutive end rect rgba(255,235,200,0.25) Note over App,V1: Data-plane / QueryPlan request App->>Client: documents().readItem(...) / queryItems(...) Client->>Client: useThinClientStoreModel(request) Client->>GEM: isProxyProbeHealthy() GEM->>Probe: isProxyHealthy() Probe-->>GEM: true (GREEN) / false (RED) GEM-->>Client: gate result alt useThinClient && hasThinClientReadLocations && proxyHealthy && eligible Client->>Proxy: route through ThinClient (Gateway V2) Proxy-->>Client: response else gate failed Client->>V1: route through Gateway V1 V1-->>Client: response end Client-->>App: response (transparent fallback) endWhat's changing
COSMOS.THINCLIENT_ENABLEDdefaults totruecom.azure.cosmos.implementation.EndpointProbeClient— probe lifecycle, hysteresis, single-flight viacycleInProgressCAS, never-throw contract, body-drain inside theMonolifecycle so reactor-netty releases pooled connectionsGlobalEndpointManager.setThinClientHttpClient(...)invoked fromRxDocumentClientImplwhenuseThinClientis true;runThinClientProbeCycleMono()chained intorefreshLocationPrivateAsyncRxDocumentClientImpl.shouldUseThinClientStoreModel(...)nowANDs inisProxyProbeHealthy(); stored procedures (isExecuteStoredProcedureBasedRequest()) explicitly eligibleEndpointProbeClient.DiagnosticsSnapshotexposes last cycle result, consecutive-failure/success counters, last-state-change timestamp; reachable viaGlobalEndpointManager.getThinClientProbeDiagnostics()hasThinClientReadLocations=truebut the resolved endpoint set is empty (eligibility/resolution mismatch),forceUnhealthy(...)pins the gate to RED so traffic falls back to Gateway V1Configuration
COSMOS.THINCLIENT_ENABLEDCOSMOS_THINCLIENT_ENABLEDtrue(changed)COSMOS.THINCLIENT_PROBE_ENABLEDCOSMOS_THINCLIENT_PROBE_ENABLEDtrueCOSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLDCOSMOS_THINCLIENT_PROBE_FAILURE_THRESHOLD1COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLDCOSMOS_THINCLIENT_PROBE_RECOVERY_THRESHOLD1COSMOS.THINCLIENT_PROBE_PATHCOSMOS_THINCLIENT_PROBE_PATH/connectivity-probeCOSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_MSCOSMOS_THINCLIENT_CONNECTION_TIMEOUT_IN_MSTesting
EndpointProbeClientTests,ThinClientProbeWiringTestsThinClientStoredProcedureE2ETest,ThinClientQueryE2ETest,Http2PingKeepaliveTest(probe disabled in pure-Gateway-V1 branch only — see review item 4)endpoint-probe-49437(NEU, single-region thin-client-enabled account): 20-minuteMixed 90/9/1benchmark with-DCOSMOS.HTTP2_ENABLED=true -DCOSMOS.THINCLIENT_ENABLED=trueconfirms probes fire (Http2PingHandler installed+ThinClientlog emissions verified in process logs); 196/199 thinclient profile tests pass.Independent benchmark validation (probe-off c-sweep, 2026-06-13)
Second-source perf comparison vs
mainon a different account (thin-client-mr-eventual-ci, WCUS, multi-region Eventual — distinct from theendpoint-probe-49437NEU account in the existing Testing section) and a different VM (Standard_D4s_v5, westcentralus, colocated). Probe gate disabled (rationale below).Setup
Standard_D4s_v5(4 vCPU, 16 GiB), Ubuntu 22.04, westcentralusthin-client-mr-eventual-ci-DCOSMOS.THINCLIENT_ENABLED=true -DCOSMOS.HTTP2_ENABLED=true -DCOSMOS.THINCLIENT_PROBE_ENABLED=falsebe37e742c36vsmainHEAD175199ec727BenchmarkConfigJSON,connectionMode=GATEWAY,consistencyLevel=Eventual,maxRunningTimeDuration=PT1H, ReadThroughput:10250(ThinClientStoreModel) on every run — confirmed in benchmark.logResults
mainHealth (every run)
metrics-check.jsonallPassed=truefor all 6 runsConclusion
No actionable perf regression at the meaningful concurrency points (c=20, c=50). The c=4 −11.4 % is single-run variance — it did not reproduce at c=20 (tied) and went the other way at c=50 (PR +2.7 %).
Caveat — probe disabled (
COSMOS.THINCLIENT_PROBE_ENABLED=false)The server-side
enableConnectivityProbefederation flag is currently OFF on every PROD*-fefederation our test accounts live on. Verified via the CosmosDB repo (code defaultfalseinFederationConfiguration.ThinProxy.cs:446; no Phase-2 enablement PR has merged on eitherreleases/EN20260506orreleases/EN20260409). The proxy returns503toPOST /connectivity-probe, the SDK probe gate (correctly) marks UNHEALTHY on cycle 1, and data plane falls back to Gateway V1 — meaning we would measure fallback throughput, not thin-client throughput.To exercise the actual
:10250data path I disabled the SDK-side probe. The probe gate behavior itself should be re-validated on a federation where the server flag istrue(thetest4fleet that server PR 2107592 validated against) before this client PR ships to customers — otherwise customers will see Gateway V1 fallback in every region where Phase 2 has not rolled out yet.Logs
Result directories preserved on the benchmark VM at
/home/azureuser/azure-sdk-for-java/sdk/cosmos/azure-cosmos-benchmark/results/20260613-SIMPLE-{main,PR-49437}-{commit}-c{4,20,50}/. Each containsbenchmark.log,gc.log,monitor.csv,workload-config.json,metrics-check.json, and per-metric Micrometer CSVs.