Skip to content

client: avoid finishing a stream twice when stream creation fails#9191

Open
utkuozdemir wants to merge 1 commit into
grpc:masterfrom
utkuozdemir:fix/idle-double-finish-newattempt
Open

client: avoid finishing a stream twice when stream creation fails#9191
utkuozdemir wants to merge 1 commit into
grpc:masterfrom
utkuozdemir:fix/idle-double-finish-newattempt

Conversation

@utkuozdemir

Copy link
Copy Markdown

In clientStream.withRetry, when newAttemptLocked fails before an attempt is created, withRetry called cs.finish and then returned the error. newClientStream also runs endOfClientStream from its deferred cleanup on an error return, so the stream was finished twice and every grpc.OnFinish callback ran twice for a single RPC, breaking its documented "called only once" contract.

One of those callbacks is the idleness manager's OnCallEnd. The extra OnCallEnd pushes the manager's int32 activeCallsCount one below its -math.MaxInt32 idle sentinel and corrupts the counter. After that the channel can get permanently stuck in IDLE, and later RPCs fail with "context deadline exceeded while waiting for connections to become ready" even though the backend is reachable. It happens when RPCs are canceled around the idle-timeout boundary.

On this path no attempt is created, so there is nothing to finish: the context is canceled by newClientStreamWithParams's deferred cleanup, and newClientStream's deferred endOfClientStream finishes the stream exactly once. Replace the cs.finish call with cs.commitAttemptLocked so the stream is still committed, releasing resources held for retries such as the config selector's OnCommitted callback, without being finished a second time. This mirrors the retryLocked give-up path.

Add a regression test that fails before this change and passes after it.

RELEASE NOTES:

  • client: Fix a bug where a ClientConn could get permanently stuck in IDLE after an RPC was canceled while the channel was exiting idle mode

@utkuozdemir

Copy link
Copy Markdown
Author

For transparency: I used an LLM, namely Opus 4.8 heavily on reproducing and fixing this issue.

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.20%. Comparing base (5c7f936) to head (19d6e56).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9191      +/-   ##
==========================================
+ Coverage   83.09%   83.20%   +0.11%     
==========================================
  Files         419      419              
  Lines       33858    33858              
==========================================
+ Hits        28134    28172      +38     
+ Misses       4291     4264      -27     
+ Partials     1433     1422      -11     
Files with missing lines Coverage Δ
stream.go 81.73% <100.00%> (ø)

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

utkuozdemir added a commit to utkuozdemir/sidero-omni that referenced this pull request Jun 19, 2026
Omni could stop responding to all API requests after a period without API activity, leaving the web UI on a blank page while static assets still loaded, and required a restart to recover.

The connections that carry the API gateway and internal proxy traffic run in-process and relied on gRPC idle mode, which tears down and later re-establishes a connection that has been inactive. A bug in gRPC could leave such a connection permanently stuck once it went idle, so every subsequent request blocked forever.

Idle mode provides no benefit for these connections because they are in-process and live for the whole lifetime of the process, so disable it. This avoids the stuck connection and also removes the latency of re-establishing it on the first request after a quiet period.

A fix for the underlying gRPC issue has been submitted upstream in grpc/grpc-go#9191, but disabling idle mode here is the correct configuration regardless.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
In `clientStream.withRetry`, when `newAttemptLocked` fails before an attempt is
created, `withRetry` called `cs.finish` and then returned the error.
`newClientStream` also runs `endOfClientStream` from its deferred cleanup on an
error return, so the stream was finished twice and every `grpc.OnFinish`
callback ran twice for a single RPC, breaking its documented "called only once"
contract.

One of those callbacks is the idleness manager's `OnCallEnd`. The extra
`OnCallEnd` pushes the manager's int32 `activeCallsCount` one below its
`-math.MaxInt32` idle sentinel and corrupts the counter. After that the channel
can get permanently stuck in IDLE, and later RPCs fail with "context deadline
exceeded while waiting for connections to become ready" even though the backend
is reachable. It happens when RPCs are canceled around the idle-timeout
boundary.

On this path no attempt is created, so there is nothing to finish: the context
is canceled by `newClientStreamWithParams`'s deferred cleanup, and
`newClientStream`'s deferred `endOfClientStream` finishes the stream exactly
once. Replace the `cs.finish` call with `cs.commitAttemptLocked` so the stream
is still committed, releasing resources held for retries such as the config
selector's `OnCommitted` callback, without being finished a second time. This
mirrors the `retryLocked` give-up path.

Add a regression test that fails before this change and passes after it.

RELEASE NOTES:
* client: Fix a bug where a ClientConn could get permanently stuck in IDLE after an RPC was canceled while the channel was exiting idle mode
@utkuozdemir utkuozdemir force-pushed the fix/idle-double-finish-newattempt branch from 19d6e56 to c33f3be Compare June 19, 2026 21:45
@utkuozdemir

Copy link
Copy Markdown
Author

@Pranjali-2501 gentle nudge - can you or another member of the project have a look into this when you get a chance? Just in case this was missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant