client: avoid finishing a stream twice when stream creation fails#9191
Open
utkuozdemir wants to merge 1 commit into
Open
client: avoid finishing a stream twice when stream creation fails#9191utkuozdemir wants to merge 1 commit into
utkuozdemir wants to merge 1 commit into
Conversation
Author
|
For transparency: I used an LLM, namely Opus 4.8 heavily on reproducing and fixing this issue. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #9191 +/- ##
==========================================
+ Coverage 83.09% 83.20% +0.11%
==========================================
Files 419 419
Lines 33858 33858
==========================================
+ Hits 28134 28172 +38
+ Misses 4291 4264 -27
+ Partials 1433 1422 -11
🚀 New features to boost your workflow:
|
utkuozdemir
added a commit
to utkuozdemir/sidero-omni
that referenced
this pull request
Jun 19, 2026
Omni could stop responding to all API requests after a period without API activity, leaving the web UI on a blank page while static assets still loaded, and required a restart to recover. The connections that carry the API gateway and internal proxy traffic run in-process and relied on gRPC idle mode, which tears down and later re-establishes a connection that has been inactive. A bug in gRPC could leave such a connection permanently stuck once it went idle, so every subsequent request blocked forever. Idle mode provides no benefit for these connections because they are in-process and live for the whole lifetime of the process, so disable it. This avoids the stuck connection and also removes the latency of re-establishing it on the first request after a quiet period. A fix for the underlying gRPC issue has been submitted upstream in grpc/grpc-go#9191, but disabling idle mode here is the correct configuration regardless. Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
In `clientStream.withRetry`, when `newAttemptLocked` fails before an attempt is created, `withRetry` called `cs.finish` and then returned the error. `newClientStream` also runs `endOfClientStream` from its deferred cleanup on an error return, so the stream was finished twice and every `grpc.OnFinish` callback ran twice for a single RPC, breaking its documented "called only once" contract. One of those callbacks is the idleness manager's `OnCallEnd`. The extra `OnCallEnd` pushes the manager's int32 `activeCallsCount` one below its `-math.MaxInt32` idle sentinel and corrupts the counter. After that the channel can get permanently stuck in IDLE, and later RPCs fail with "context deadline exceeded while waiting for connections to become ready" even though the backend is reachable. It happens when RPCs are canceled around the idle-timeout boundary. On this path no attempt is created, so there is nothing to finish: the context is canceled by `newClientStreamWithParams`'s deferred cleanup, and `newClientStream`'s deferred `endOfClientStream` finishes the stream exactly once. Replace the `cs.finish` call with `cs.commitAttemptLocked` so the stream is still committed, releasing resources held for retries such as the config selector's `OnCommitted` callback, without being finished a second time. This mirrors the `retryLocked` give-up path. Add a regression test that fails before this change and passes after it. RELEASE NOTES: * client: Fix a bug where a ClientConn could get permanently stuck in IDLE after an RPC was canceled while the channel was exiting idle mode
19d6e56 to
c33f3be
Compare
Author
|
@Pranjali-2501 gentle nudge - can you or another member of the project have a look into this when you get a chance? Just in case this was missed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In
clientStream.withRetry, whennewAttemptLockedfails before an attempt is created,withRetrycalledcs.finishand then returned the error.newClientStreamalso runsendOfClientStreamfrom its deferred cleanup on an error return, so the stream was finished twice and everygrpc.OnFinishcallback ran twice for a single RPC, breaking its documented "called only once" contract.One of those callbacks is the idleness manager's
OnCallEnd. The extraOnCallEndpushes the manager's int32activeCallsCountone below its-math.MaxInt32idle sentinel and corrupts the counter. After that the channel can get permanently stuck in IDLE, and later RPCs fail with "context deadline exceeded while waiting for connections to become ready" even though the backend is reachable. It happens when RPCs are canceled around the idle-timeout boundary.On this path no attempt is created, so there is nothing to finish: the context is canceled by
newClientStreamWithParams's deferred cleanup, andnewClientStream's deferredendOfClientStreamfinishes the stream exactly once. Replace thecs.finishcall withcs.commitAttemptLockedso the stream is still committed, releasing resources held for retries such as the config selector'sOnCommittedcallback, without being finished a second time. This mirrors theretryLockedgive-up path.Add a regression test that fails before this change and passes after it.
RELEASE NOTES: