Skip to content

xds/internal : Fix Connected metric test#9181

Open
mbissa wants to merge 2 commits into
grpc:masterfrom
mbissa:fix-connected-metric-reconnection-flakiness
Open

xds/internal : Fix Connected metric test#9181
mbissa wants to merge 2 commits into
grpc:masterfrom
mbissa:fix-connected-metric-reconnection-flakiness

Conversation

@mbissa

@mbissa mbissa commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Fixes #9141

Fix flaky TestConnectedMetric_Reconnection in xDS Client metrics test

Description / Root Cause

A race condition existed in TestConnectedMetric_Reconnection between the server-side stream open event and the client-side stream establishment registration:

  1. After restarting the listener, the test awaited waitForStreamSuccess(), which unblocks when the control plane executes the OnStreamOpen callback.
  2. Immediately after, the test scraped the client's async metrics, expecting XDSClientConnected to be 1.
  3. However, the client-side runner goroutine was still completing the NewStream call stack and had not yet set the internal streamEstablished flag to true.
  4. Under CPU contention or slower goroutine scheduling, the scrape captured 0, causing the test to block indefinitely and time out.

Solution

Introduced event-based synchronization to guarantee the client has completed stream establishment before scraping the metric value:

  1. Registered an OnStreamRequest callback on the test's management server that writes to a new requestReceived channel.
  2. In the test, blocked on <-requestReceived after the stream succeeds.
  3. Since a discovery request is only sent to the server after the client runner has successfully created the stream and set the streamEstablished flag, receiving the request on the server guarantees the client-side state is updated.

This completely removes the race condition without relying on periodic polling loops or arbitrary sleeps.

Verification

  • Reproduced the failure by injecting a 100ms delay on the client runner immediately before registering stream establishment, causing 100% failure rate.
  • Verified that the test now compiles and passes successfully (even with the delay active).

RELEASE NOTES: none

@mbissa mbissa added this to the 1.83 Release milestone Jun 13, 2026
@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.15%. Comparing base (0f3086d) to head (1f8472c).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9181      +/-   ##
==========================================
- Coverage   83.28%   83.15%   -0.14%     
==========================================
  Files         418      419       +1     
  Lines       33741    33858     +117     
==========================================
+ Hits        28102    28155      +53     
- Misses       4232     4276      +44     
- Partials     1407     1427      +20     

see 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbissa

mbissa commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the TestConnectedMetric_Reconnection test in internal/xds/clients/xdsclient/test/metrics_test.go to ensure that the client's first request is received by the management server before checking connection metrics. This is implemented by introducing a requestReceived channel and utilizing the OnStreamRequest callback. There are no review comments, and the changes look correct with no further feedback to provide.

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test already uses too many channels for synchronization (six); adding another makes it even harder to understand the test and prove that all events are correctly synchronized. Without this confidence, we'll keep introducing race conditions and fixing them after the fact. For example, here is another racy block:

// Clear channels to ensure fresh readings.
for len(streamOpened) > 0 {
<-streamOpened
}
waitForStreamSuccess()

It waits for a channel to be cleared and then immediately reads from that same channel.

I suggest simplifying the test by reducing the number of channels and waiting for a single event that guarantees the metrics have been emitted before checking them. A straightforward approach would be to have the metrics recorder pass all received metrics to the test's closure. The closure can then use a channel or grpcsync.Event to signal when the target metrics are received.

@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Jun 15, 2026
@mbissa mbissa force-pushed the fix-connected-metric-reconnection-flakiness branch from 4c76588 to 1f8472c Compare June 15, 2026 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: ConnectedMetric_Reconnection

2 participants