Skip to content

[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437

Open
jeet1995 wants to merge 68 commits into
Azure:mainfrom
jeet1995:jeet1995/thin-client-probe-flow
Open

[WIP] Default thin-client to enabled and add proxy connectivity-probe gate#49437
jeet1995 wants to merge 68 commits into
Azure:mainfrom
jeet1995:jeet1995/thin-client-probe-flow

Conversation

@jeet1995

@jeet1995 jeet1995 commented Jun 10, 2026

Copy link
Copy Markdown
Member

Motivation

Gateway V2 (a.k.a. ThinClient) is the new Compute-hosted data plane for Cosmos DB. To roll it out without forcing every SDK consumer to opt in, we need a user-transparent way for the Java SDK to:

  1. Default to ThinClient for eligible data-plane and QueryPlan requests so adoption tracks the federation roll-out automatically, and
  2. Fall back to Gateway V1 the moment the ThinClient fleet looks unhealthy from the client's perspective — without surfacing the failure as a user-visible error and without per-request overhead.

Today the SDK has no client-side signal about ThinClient reachability. If a thin-client endpoint is degraded — bad HTTP/2 negotiation, TLS handshake failure, proxy 5xx — every eligible request is routed there and fails before falling back. This PR closes that gap by:

  • Flipping COSMOS.THINCLIENT_ENABLED to true by default, so any account whose topology advertises thin-client read locations starts routing eligible traffic through Gateway V2 automatically.
  • Adding an out-of-band connectivity probe (EndpointProbeClient) that POSTs to /connectivity-probe over the thin-client HTTP/2 transport on every topology refresh. The probe result is AND-ed into the existing routing gate (useThinClientStoreModel) via GlobalEndpointManager.isProxyProbeHealthy(). If the probe trips, eligible requests transparently route to Gateway V1 until probes recover — the caller never sees the difference except in CosmosDiagnostics.

Key invariants the design preserves:

  • No per-request probe I/O. Probing is piggybacked on the existing topology-refresh cadence; the data-plane gate is a single in-memory boolean read.
  • Optimistic startup. proxyHealthy = true until proven otherwise, so a slow first probe never delays the first request.
  • Never trip client init. Every probe-wiring and probe-execution failure path is swallowed and logged — CosmosClient construction and topology refresh are protected by onErrorResume.
  • Kill switch. COSMOS.THINCLIENT_PROBE_ENABLED=false short-circuits the probe HTTP I/O entirely while leaving routing on its last-known state (defaults to GREEN).
  • Hysteresis. THINCLIENT_PROBE_FAILURE_THRESHOLD consecutive RED cycles to flip GREEN→RED; THINCLIENT_PROBE_RECOVERY_THRESHOLD consecutive GREEN cycles to flip back — prevents routing oscillation.

Call sequence

sequenceDiagram
    autonumber
    participant App as User App
    participant Client as CosmosClient<br/>(RxDocumentClientImpl)
    participant GEM as GlobalEndpointManager
    participant Probe as EndpointProbeClient
    participant H2 as ThinClient HTTP/2<br/>HttpClient
    participant Proxy as ThinClient<br/>Endpoint (Gateway V2)
    participant V1 as Gateway V1<br/>(fallback)

    rect rgba(200,220,255,0.25)
    Note over App,Probe: Bootstrap (one-time)
    App->>Client: new CosmosClient(...)
    Client->>Client: useThinClient = config.isThinClientEnabled()  (now defaults true)
    Client->>GEM: setThinClientHttpClient(reactorHttpClient)
    GEM->>Probe: new EndpointProbeClient(httpClient)
    Note over Probe: proxyHealthy = true (optimistic)
    end

    rect rgba(200,255,210,0.25)
    Note over GEM,Proxy: Topology refresh (every refresh tick)
    GEM->>GEM: refreshLocationPrivateAsync()
    GEM->>GEM: runThinClientProbeCycleMono()
    GEM->>Probe: runProbeCycle(thinClientRegionalEndpoints)
    par Per regional endpoint
        Probe->>H2: POST <endpoint>/connectivity-probe
        H2->>Proxy: HTTP/2 frame
        Proxy-->>H2: 200 OK (or error / timeout)
        H2-->>Probe: ProbeResult(GREEN|RED)
    end
    Probe->>Probe: applyCycleResult() — apply hysteresis<br/>update proxyHealthy
    Note right of Probe: ALL endpoints 200 → GREEN<br/>any non-200 / timeout → RED<br/>flip only after N consecutive
    end

    rect rgba(255,235,200,0.25)
    Note over App,V1: Data-plane / QueryPlan request
    App->>Client: documents().readItem(...) / queryItems(...)
    Client->>Client: useThinClientStoreModel(request)
    Client->>GEM: isProxyProbeHealthy()
    GEM->>Probe: isProxyHealthy()
    Probe-->>GEM: true (GREEN) / false (RED)
    GEM-->>Client: gate result
    alt useThinClient && hasThinClientReadLocations && proxyHealthy && eligible
        Client->>Proxy: route through ThinClient (Gateway V2)
        Proxy-->>Client: response
    else gate failed
        Client->>V1: route through Gateway V1
        V1-->>Client: response
    end
    Client-->>App: response (transparent fallback)
    end
Loading

What's changing

Area Change
Default COSMOS.THINCLIENT_ENABLED defaults to true
New class com.azure.cosmos.implementation.EndpointProbeClient — probe lifecycle, hysteresis, single-flight via cycleInProgress CAS, never-throw contract, body-drain inside the Mono lifecycle so reactor-netty releases pooled connections
Wiring GlobalEndpointManager.setThinClientHttpClient(...) invoked from RxDocumentClientImpl when useThinClient is true; runThinClientProbeCycleMono() chained into refreshLocationPrivateAsync
Routing gate RxDocumentClientImpl.shouldUseThinClientStoreModel(...) now ANDs in isProxyProbeHealthy(); stored procedures (isExecuteStoredProcedureBasedRequest()) explicitly eligible
Diagnostics EndpointProbeClient.DiagnosticsSnapshot exposes last cycle result, consecutive-failure/success counters, last-state-change timestamp; reachable via GlobalEndpointManager.getThinClientProbeDiagnostics()
Safeguard If hasThinClientReadLocations=true but the resolved endpoint set is empty (eligibility/resolution mismatch), forceUnhealthy(...) pins the gate to RED so traffic falls back to Gateway V1

Configuration

Property Env var Default Purpose
COSMOS.THINCLIENT_ENABLED COSMOS_THINCLIENT_ENABLED true (changed) Master switch for thin-client routing
COSMOS.THINCLIENT_PROBE_ENABLED COSMOS_THINCLIENT_PROBE_ENABLED true Kill switch for probe HTTP I/O. When false, no probes fire; routing stays at last-known state (defaults to GREEN/optimistic)
COSMOS.THINCLIENT_PROBE_FAILURE_THRESHOLD COSMOS_THINCLIENT_PROBE_FAILURE_THRESHOLD 1 Consecutive RED cycles before GREEN→RED
COSMOS.THINCLIENT_PROBE_RECOVERY_THRESHOLD COSMOS_THINCLIENT_PROBE_RECOVERY_THRESHOLD 1 Consecutive GREEN cycles before RED→GREEN
COSMOS.THINCLIENT_PROBE_PATH COSMOS_THINCLIENT_PROBE_PATH /connectivity-probe Probe URI path
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_MS COSMOS_THINCLIENT_CONNECTION_TIMEOUT_IN_MS (existing) Per-probe deadline

Testing

  • Unit tests: EndpointProbeClientTests, ThinClientProbeWiringTests
  • E2E: ThinClientStoredProcedureE2ETest, ThinClientQueryE2ETest, Http2PingKeepaliveTest (probe disabled in pure-Gateway-V1 branch only — see review item 4)
  • Live validation against endpoint-probe-49437 (NEU, single-region thin-client-enabled account): 20-minute Mixed 90/9/1 benchmark with -DCOSMOS.HTTP2_ENABLED=true -DCOSMOS.THINCLIENT_ENABLED=true confirms probes fire (Http2PingHandler installed + ThinClient log emissions verified in process logs); 196/199 thinclient profile tests pass.

Independent benchmark validation (probe-off c-sweep, 2026-06-13)

Second-source perf comparison vs main on a different account (thin-client-mr-eventual-ci, WCUS, multi-region Eventual — distinct from the endpoint-probe-49437 NEU account in the existing Testing section) and a different VM (Standard_D4s_v5, westcentralus, colocated). Probe gate disabled (rationale below).

Setup

  • VM: Standard_D4s_v5 (4 vCPU, 16 GiB), Ubuntu 22.04, westcentralus
  • Account: thin-client-mr-eventual-ci
  • SDK flags: -DCOSMOS.THINCLIENT_ENABLED=true -DCOSMOS.HTTP2_ENABLED=true -DCOSMOS.THINCLIENT_PROBE_ENABLED=false
  • Refs: PR head be37e742c36 vs main HEAD 175199ec727
  • Workload: BenchmarkConfig JSON, connectionMode=GATEWAY, consistencyLevel=Eventual, maxRunningTimeDuration=PT1H, ReadThroughput
  • Routed via :10250 (ThinClientStoreModel) on every run — confirmed in benchmark.log

Results

Concurrency main PR #49437 Δ
4 2,439 ops/s · 1.59 ms 2,160 ops/s · 1.80 ms PR −11.4 % (single-run variance — see below)
20 9,882 ops/s · 1.98 ms 9,878 ops/s · 1.98 ms tied (+0.04 %)
50 13,358 ops/s · 3.46 ms 13,714 ops/s · 3.46 ms PR +2.7 %

Health (every run)

  • 0 FATAL / 0 ERROR (workload phase) / 0 caught exceptions / 0 non-2xx
  • Heap flat (4.9 GB peak / 8 GB max), no thread/FD leak, GC < 0.04 % CPU
  • Toolkit metrics-check.json allPassed=true for all 6 runs

Conclusion

No actionable perf regression at the meaningful concurrency points (c=20, c=50). The c=4 −11.4 % is single-run variance — it did not reproduce at c=20 (tied) and went the other way at c=50 (PR +2.7 %).

Caveat — probe disabled (COSMOS.THINCLIENT_PROBE_ENABLED=false)

The server-side enableConnectivityProbe federation flag is currently OFF on every PROD *-fe federation our test accounts live on. Verified via the CosmosDB repo (code default false in FederationConfiguration.ThinProxy.cs:446; no Phase-2 enablement PR has merged on either releases/EN20260506 or releases/EN20260409). The proxy returns 503 to POST /connectivity-probe, the SDK probe gate (correctly) marks UNHEALTHY on cycle 1, and data plane falls back to Gateway V1 — meaning we would measure fallback throughput, not thin-client throughput.

To exercise the actual :10250 data path I disabled the SDK-side probe. The probe gate behavior itself should be re-validated on a federation where the server flag is true (the test4 fleet that server PR 2107592 validated against) before this client PR ships to customers — otherwise customers will see Gateway V1 fallback in every region where Phase 2 has not rolled out yet.

Logs

Result directories preserved on the benchmark VM at /home/azureuser/azure-sdk-for-java/sdk/cosmos/azure-cosmos-benchmark/results/20260613-SIMPLE-{main,PR-49437}-{commit}-c{4,20,50}/. Each contains benchmark.log, gc.log, monitor.csv, workload-config.json, metrics-check.json, and per-metric Micrometer CSVs.

jeet1995 and others added 30 commits January 20, 2026 18:20
… QueryPlan proxy routing

Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x002B)
and x-ms-cosmos-query-version (0x002C) so the thin client proxy can read
these values from the RNTBD body when processing QueryPlan requests.

Previously these headers were only set as HTTP headers by QueryPlanRetriever
and were lost when QueryPlan was routed through the proxy path, since
ThinClientStoreModel serializes requests as RNTBD (not HTTP headers).

IDs match server-side proxy definitions per ADO PR 1982503.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testThinClientChangeFeedFullRange covering FeedRange.forFullRange()
across multiple partition keys, and testThinClientChangeFeedPartitionKey
covering FeedRange.forLogicalPartition with exact doc count + PK validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents all 59 thin client E2E tests across query (50), point operations (3),
change feed (3), and stored procedures (3) with SQL, query features covered,
and known account-side blockers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… QueryPlan proxy routing

Add RNTBD token mappings for x-ms-cosmos-supported-query-features (0x00F0)
and x-ms-cosmos-query-version (0x00F1) so the thin client proxy can read
these values from the RNTBD body when processing QueryPlan requests.

IDs are provisional (0x00F0, 0x00F1) — must be coordinated with server-side
proxy team. See ADO PR 1982503 for the proxy-side design.

Note: The design doc listed 0x002B/0x002C but those are already assigned to
PartitionKey/PartitionKeyRangeId in the Java SDK. Using 0x00F0/0x00F1 to
avoid ID collision until final server-side IDs are assigned.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…BD instructions

- Fix testGetCurrentDateTime: assert ISO 8601 format instead of exact match
  (gateway and proxy return slightly different timestamps)
- Add DefaultAzureCredential support via COSMOS.USE_AAD_AUTH system property
  for accounts with disableLocalAuth=true
- Add RNTBD class reference as .github/instructions/rntbd.instructions.md
- Add pom.xml system properties for THINCLIENT_ENABLED, HTTP2_ENABLED, USE_AAD_AUTH
- Add beforeSuiteReuse mode for degraded accounts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Switch baseline from Gateway V1 to Direct TCP to avoid JVM config
  interference (THINCLIENT_ENABLED/HTTP2_ENABLED affect Gateway V1)
- Assert :10250 endpoint only on Gateway V2 results (not baseline)
- Rename helpers: assertDirectAndThinClientMatch (was gateway)
- Document seedTestData schema in Javadoc
- Remove 'Expected to fail' comments (account has vector search enabled)
- Clean up class/method Javadoc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Build 6424287 went from 49+/2 distinct test failures down to two new test-only failure patterns, both rooted in the same cause — the probe gate (enabled by default, threshold=1) is flipping data-plane routing from the thin-client proxy to Gateway V1 in test accounts whose proxy has not yet deployed the /connectivity-probe endpoint.

New failures (both fixed in ad9e3df):

  1. CosmosNotFoundTests.performBulkOnDeletedContainerWithGatewayV2 (45 failures, log 1986) — asserts response substatus 1003 from the thin-client routing path, observed 0 because requests went to Gateway V1.
  2. PerPartitionCircuitBreakerE2ETests.*Gateway (26 failures, log 2002) — TestSuiteBase.assertThinClientEndpointUsed could not find any request whose endpoint contained :10250/.

Fix: Disable the probe by default in TestSuiteBase's static initializer (only when the property is not already set), so all E2E tests inherit deterministic configuration-driven routing. Dedicated probe tests (EndpointProbeClientTests, ThinClientProbeWiringTests) set the property explicitly in @BeforeMethod and are unaffected. The per-class override in Http2PingKeepaliveTest is now redundant and was removed (its @AfterClass clear would have re-enabled the probe for any later E2E test sharing the JVM).

No production impact — these are test-environment changes. The probe still defaults to true in production via the DEFAULT_THINCLIENT_PROBE_ENABLED config; customers running with the property unset will get the probe enabled.

Remaining single-shot failures (likely environmental, not probe-related):

  • CosmosTracerTest.cosmosAsyncDatabase ThreadTimeoutException at 40s (log 4742) — Direct TCP mode, probe doesn't gate Direct routing. Watching next run.
  • DocumentQuerySpyWireContentTest.before_DocumentQuerySpyWireContentTest 429 RequestRateTooLarge (log 2251) — @BeforeClass setup throttling on shared test account; user-agent in the error correctly shows |F4 suffix confirming user-agent helper fix is working.
  • OrderbyDocumentQueryTest.before_OrderbyDocumentQueryTest 404 "Collection is not yet available for read" (log 2696) — proxy-side propagation race on freshly created collection.

Commit: ad9e3df

jeet1995 and others added 2 commits June 11, 2026 18:22
…PartitionCircuitBreakerE2ETests

Companion to the prior revert. The revert undid the global TestSuiteBase probe

disable (which masked production behaviour). This commit adds the necessary

per-class disable to the two test classes whose assertions explicitly require

thinclient routing: CosmosNotFoundTests (thinclient group) and

PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group). Both

clear the property in their @afterclass. Http2PingKeepaliveTest already has

its own disable (restored by the revert). Production callers continue to get

the connectivity probe ON by default with the production failure threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Update on the CI-failure fix for build 6424287

My earlier comment proposed disabling the connectivity probe globally in TestSuiteBase. That was the wrong call --- the probe is ON by default in production (default failure threshold = 1), and tests should reflect production behaviour rather than mask it.

What I just pushed (commits e381f3a4a26 revert + da6c6983b80):

  1. Reverted the global probe-disable in TestSuiteBase so production-equivalent defaults are restored for the broad test population.
  2. Added a per-class probe-disable only to the three test classes whose assertions explicitly require the data plane to land on the proxy:
    • CosmosNotFoundTests (thinclient group) --- asserts OWNER_RESOURCE_NOT_EXISTS / assertThinClientEndpointUsed.
    • PerPartitionCircuitBreakerE2ETests (fi-thinclient-multi-master group) --- asserts assertThinClientEndpointUsed for gateway-mode traffic.
    • Http2PingKeepaliveTest --- installs iptables DROP on port 10250, which would also defeat the probe before the PING handler fires. (Disable restored by the revert.)

Each @BeforeClass sets COSMOS.THINCLIENT_PROBE_ENABLED=false before client construction and the corresponding @AfterClass clears it so the JVM's other test classes are not polluted. Every other test class continues to run with the production default (probe ON, threshold = 1).

This keeps the rest of the suite honest about default behaviour while preventing the three classes that need a deterministic proxy route from failing while the proxy-side /connectivity-probe endpoint is still rolling out across CI accounts.

jeet1995 and others added 8 commits June 11, 2026 22:47
… into AzCosmos_GatewayV2_QueryPlanSupport

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConstants.java
Covers 5 new scenarios in ThinClientRoutingGateTests:
- ExecuteStoredProcedure on a StoredProcedure resource routes to thin client
- Non-execute StoredProcedure ops (Create) route to Gateway V1
- OperationType.QueryPlan routes to thin client
- QueryPlan returns false when probe is unhealthy
- ExecuteStoredProcedure returns false when probe is unhealthy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Item 3: remove unused COSMOS.USE_AAD_AUTH system property from the thinclient profile in azure-cosmos-tests/pom.xml. The Maven property cosmos.use.aad.auth is not defined anywhere in the repo, so the substitution produced a literal that Boolean.parseBoolean reads as false; TestSuiteBase already falls back to the COSMOS_USE_AAD_AUTH env var.

Item 4: in Http2PingKeepaliveTest, gate the THINCLIENT_PROBE_ENABLED=false override behind !THIN_CLIENT_ENABLED so the probe is disabled only in the pure Compute Gateway (Gateway V2) branch on port 443, where thin-client routing is off entirely and probe POSTs to port 10250 are pure noise. In the thin-client branch (port 10250) the probe is intentionally left enabled so the same production code path that gates thin-client routing is exercised. Mirror the conditional in afterClass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

Deep review summary

Item 1 — Stored Procedure routing on thin-client ✅

Supported. RxDocumentClientImpl.shouldUseThinClientStoreModel (lines 9023–9047) routes isExecuteStoredProcedureBasedRequest() via the early-return on line 9033 and the inclusion list on line 9045. New ThinClientStoredProcedureE2ETest covers it end-to-end.

Item 2 — Mono.<Void>empty() at GlobalEndpointManager.refreshLocationPrivateAsync (lines 322, 337)

Necessary, not stylistic. The Mono.defer(...) lambda has three return paths (cached Mono, deferred Mono, Mono.empty()). Without the type witness, Mono.empty() infers Mono<Object> because Mono is invariant in its type parameter — and the outer method returns Mono<Void>, so the lambda fails to type-check. Suggestion: keep the witness; a one-line comment ("type witness needed; Mono is invariant") would make this self-documenting.

Item 3 — azure-cosmos-tests/pom.xml orphan AAD line ✅ fixed

The thinclient profile referenced ${cosmos.use.aad.auth}, a Maven property defined nowhere in the repo (grep across all pom.xmls confirms). It would either no-op or interpolate to a literal. Removed in 56efc37.

Item 4 — Http2PingKeepaliveTest probe disable ✅ fixed

Probe disable now gated if (!THIN_CLIENT_ENABLED) in both @BeforeClass and @AfterClass. Rationale captured in the source comment: port 10250 = thin-client, port 443 = pure Compute Gateway.

  • Pure CG branch (THIN_CLIENT_ENABLED=false): probe POSTs to 10250 are pure noise — they cannot influence routing and only add traffic/log clutter, so disabling is correct.
  • Thin-client branch (THIN_CLIENT_ENABLED=true): probe stays enabled so the production probe gate is exercised by the test.

Applied in 56efc37.

Item 5 — COSMOS.THINCLIENT_PROBE_ENABLED kill-switch ✅ validated

Works as intended: when false, EndpointProbeClient.maybeRefreshAsync() short-circuits to Mono.empty() (lines 109–127) so zero probe HTTP I/O occurs. Routing then keeps using whatever proxyHealthy is set to (default true, line 71) — so the switch kills the probe wire calls without disabling thin-client routing. That is the correct kill-switch semantic.

FYI (non-blocking): forceUnhealthy() (line 177) flips proxyHealthy = false with no recovery counterpart. Today it is only invoked from the probe path itself, so disabling the probe makes that path unreachable — safe by construction. Worth a one-line comment near forceUnhealthy documenting that invariant, in case anyone ever calls it from a different code path with the probe disabled.


Items 3 and 4 are pushed as 56efc37 on this branch.

jeet1995 and others added 3 commits June 13, 2026 23:12
- QueryPlanRetriever: clarify thin-client / Gateway V1 routing comment;
  explain that Gateway V1 is pinned only when PartitionKeyDefinition is
  unavailable to convert proxy queryRanges to EPK hex ranges.
- RxDocumentClientImpl + DocumentQueryExecutionContextFactory: plumb the
  resolved DocumentCollection into validateCustomQueryForReadManyByPartitionKeys
  -> fetchQueryPlanForValidation so the validation query-plan request is
  thin-client-eligible instead of forcibly pinned to Gateway V1.
- EndpointProbeClient: extract DiagnosticsSnapshot and ProbeResult into
  top-level types (EndpointProbeDiagnosticsSnapshot, EndpointProbeResult);
  remove @SuppressWarnings("unused"); use the failure reason in RED-cycle
  logs and toString(); verbose NPE message in the constructor naming the
  required dependency and wiring entry point.
- CHANGELOG: move thin-client entry under Features Added; one-line summary
  with HTTP/2 + gatewayMode requirement and PR link.
- CosmosNotFoundTests: add forceHttp1IfGatewayMode helper applied to every
  fast-group client construction so Gateway-mode clients in the fast group
  use HTTP/1.1 and are guaranteed not to route through the proxy. Keep the
  COSMOS.THINCLIENT_PROBE_ENABLED=false set/clear for the thinclient group's
  defensive teardown.
- PerPartitionCircuitBreakerE2ETests: drop the THINCLIENT_PROBE_ENABLED
  property mutation now that probe behavior is asserted only by tests that
  opt in explicitly.
- Add ReadManyByPartitionKeyQueryPlanRoutingTest unit tests that pin the useGatewayMode gate in QueryPlanRetriever: gateway mode when DocumentCollection is null and partitioned mode when a PartitionKeyDefinition is present.

- Add three readManyByPartitionKeys E2E tests to ThinClientQueryE2ETest that exercise the validation QueryPlan path through Direct TCP (baseline) and Gateway V2 (thin client), covering no-custom-query, projection+filter, and parameterized variants. Each thin-client diagnostics page is asserted to use the :10250 endpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

…ill switch in forceUnhealthy

Stage 1990 of build 6432767 had 83 failures in ClientRetryPolicyE2ETestsWithGatewayV2#serviceUnavailableWithGatewayV2 because the probe gate added by PR Azure#49437 flips routing from :10250 to :443 when CI test accounts lack /connectivity-probe (failureThreshold=1 trips proxyHealthy=false on the first failed probe and the assertion at TestSuiteBase#assertThinClientEndpointUsed then sees :443 and fails).

Test fix (per CosmosNotFoundTests precedent, commit 3b392e3):

  - ClientRetryPolicyE2ETestsWithGatewayV2: disable probe in @BeforeClass / clear in @afterclass.

  - ThinClientTestBase: same lifecycle in enableThinClientForTest / clearThinClientForTest so all ThinClient*E2ETest subclasses inherit the guard.

Source fix: honor COSMOS.THINCLIENT_PROBE_ENABLED kill switch inside EndpointProbeClient.forceUnhealthy(), closing the item-5 invariant gap (previously only runProbeCycle() checked the flag, so a mismatch-driven forceUnhealthy could mutate proxyHealthy even with the probe disabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

When COSMOS.THINCLIENT_PROBE_ENABLED=false, no probe cycles run, so the

probe-health signal is stale/meaningless. Treat probe as healthy in that

case so the gate is driven solely by COSMOS.THINCLIENT_ENABLED (already

folded into useThinClient at construction). This fixes regressions in CI

test classes (GatewayReadConsistencyStrategyE2ETest,

FaultInjectionWithAvailabilityStrategyTestsBase, ThinClientQueryE2ETest)

where probe to /connectivity-probe fails on test accounts and flips routing

to :443 instead of :10250.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

…onsistency tests

Adds COSMOS.THINCLIENT_PROBE_ENABLED=false in @BeforeClass (and clears in @afterclass) for PerPartitionCircuitBreakerE2ETests, PerPartitionAutomaticFailoverE2ETests, FaultInjectionServerErrorRuleOnGatewayV2Tests, and GatewayReadConsistencyStrategyE2ETest.

These tests run against CI accounts whose /connectivity-probe endpoint is not deployed. Without this, the first failed probe trips proxyHealthy=false and routing falls back to Gateway V1 (port 443), causing assertThinClientEndpointUsed (which expects port 10250) to fail.

Locally validated PPAF (16 tests, 0 failures), GRCS (3 tests, 0 failures), FI (3 tests, 0 failures). PPCB still in progress at commit time; pushing to let local + CI run in parallel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

QueryPlan requests intentionally carry no RCS/CL headers (matches the V1
HTTP behavior). When the V2 thin-client routes the QueryPlan precursor
through the same :10250 endpoint as the data query, the spy must skip
the QueryPlan frame so the assertion checks the actual data-query frame.

This mirrors the IS_QUERY_PLAN_REQUEST filter on the V1 path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants