Skip to content

xds/internal: fix xds security-config race flaky test#9183

Open
mbissa wants to merge 2 commits into
grpc:masterfrom
mbissa:fix-xds-security-config-race
Open

xds/internal: fix xds security-config race flaky test#9183
mbissa wants to merge 2 commits into
grpc:masterfrom
mbissa:fix-xds-security-config-race

Conversation

@mbissa

@mbissa mbissa commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Fixes #9015

Summary

Avoid recreating and closing xDS certificate provider wrappers when the incoming security configuration has not changed. This fixes a race condition where asynchronous connection handshakes fail because their active certificate provider wrappers are prematurely closed.

Root Cause

During a client connection handshake, the transport credentials layer retrieves trusted roots and identity certificates using wrappers (singleCloseWrappedProvider) around the active certificate providers.

If the balancer receives an xDS cluster configuration update while a handshake is in progress—even if the security configuration remains unchanged—it unconditionally recreated the provider wrappers and called Close() on the old ones. This swapped the active provider reference in the old wrapper to a closedProvider, causing the handshake to fail with provider instance is closed.

Fix

  • Cache the active SecurityConfig in clusterImplBalancer as currentSecCfg.
  • In handleSecurityConfig, compare the incoming security configuration with currentSecCfg using config.Equal. Skip recreating and closing provider wrappers if the configuration has not changed.
  • Export a test-only SetClientConnUpdateHookForTesting hook setter in clusterimpl to allow tests to synchronize on connection state update completion.

Verification

  • Implemented a deterministic regression test TestSecurityConfigUpdate_NoRaceOnSameConfig in clusterimpl_security_test.go that blocks a handshake mid-flight, updates the configuration, and verifies that the handshake succeeds. The test uses channel-based synchronization and the new update hook rather than arbitrary sleeps.
  • Verified that the regression test fails consistently without the fix and passes reliably with the fix.

RELEASE NOTES: none

@mbissa mbissa added this to the 1.83 Release milestone Jun 14, 2026
@mbissa

mbissa commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the clusterImplBalancer to avoid closing active certificate providers when a configuration update is received with an unchanged security configuration, and adds a corresponding test case. The review feedback highlights a potential data race on the global clientConnUpdateHook variable, suggests a defensive nil-check for b.currentSecCfg before calling Equal, and recommends copying the global blockingProvidersChan to a local variable under a lock in the test to prevent concurrent access issues.

Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go
Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go Outdated
Comment thread internal/xds/balancer/clusterimpl/tests/clusterimpl_security_test.go Outdated
Comment thread internal/xds/balancer/clusterimpl/tests/clusterimpl_security_test.go Outdated
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.20%. Comparing base (0f3086d) to head (1062688).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9183      +/-   ##
==========================================
- Coverage   83.28%   83.20%   -0.09%     
==========================================
  Files         418      419       +1     
  Lines       33741    33877     +136     
==========================================
+ Hits        28102    28188      +86     
- Misses       4232     4267      +35     
- Partials     1407     1422      +15     
Files with missing lines Coverage Δ
internal/xds/balancer/clusterimpl/clusterimpl.go 87.66% <100.00%> (-0.59%) ⬇️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal

Copy link
Copy Markdown
Contributor

From the maintainers chat

Similar to Java, keeping the provider alive until the handshake is complete makes sense. RPCs should not be failing when the security config is updated by the control plane

So the complete fix seems to be to gracefully close the provider. Ignoring duplicate updates doesn't address the case when the config is updated

@easwars do you think we should be avoiding RPC failures for any security config updates in general and not only for duplicates?

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test : ClientSideXDS_WithValidAndInvalidSecurityConfigurationSPIFFE

3 participants