Describe the bug
When BlobClient.downloadToFileWithResponse(...) performs a multi-chunk parallel download and the storage service returns a retryable status (observed with HTTP 500) for one chunk, the call can hang indefinitely — it never returns a value and never throws. With the synchronous client and a null timeout it blocks forever; the worker thread is permanently parked in Mono.block.
The root cause is an ordering bug in RequestRetryPolicy.attemptAsync: on a retryable response it captures the body, closes the response, and only then subscribes to the captured body to drain it before retrying. With the default Reactor Netty client, close() discards the inbound and releases the connection back to the (JVM-wide, shared) pool, so the drain subscription races connection recycling.
To Reproduce
Deterministic unit test (isolates the ordering defect). A MockHttpResponse whose body behavior is decided at subscription time by whether close() has already been called — mirroring a recycled pooled connection. The client returns a 500 with this racy body first, then a 200, so a correct policy must complete with 200.
- Body =
Flux.never() after close → reproduces the hang (StepVerifier times out).
- Body =
Flux.error(IllegalStateException("Only one connection receive subscriber allowed.")) after close → reproduces the skipped retry (surfaces the unrelated error instead of retrying).
Both fail against current RequestRetryPolicy. (Full test source available; ~120 lines, no live service needed.)
I'll include the full test class as an attachment. RequestRetryPolicyDrainRaceTest.java
Code Snippet
sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryPolicy.java (attemptAsync, the retryable-status branch):
Flux<ByteBuffer> responseBody = response.getBody();
response.close(); // (1) closes/recycles the connection FIRST
if (responseBody == null) {
return attemptAsync(context, next, originalRequest, newConsiderSecondary, newPrimaryTry,
attempt + 1, suppressed);
} else {
return responseBody.ignoreElements() // (2) subscribes to the body of the
.then(attemptAsync(context, next, originalRequest, newConsiderSecondary, // already-closed response
newPrimaryTry, attempt + 1, suppressed));
}
This branch only runs for retryable status codes (429/500/503, see shouldStatusCodeBeRetried). For the Netty client, response.close() →
com.azure.core.http.netty.implementation.NettyUtility#closeConnection → ChannelOperations.discard(), which drains the inbound on the event loop and recycles the connection. The drain subscription in (2) then attaches to connection.inbound().receive() (NettyAsyncHttpResponse#bodyIntern) on that same channel, racing the discard/recycle.
Three outcomes depending on who wins the race:
- Drain wins → drains, completes,
.then(retry) fires. Works (the common case — which is why this is intermittent).
- Discard wins the single inbound subscription → the late drain subscription errors (
IllegalStateException, "only one receive subscriber allowed"). Because A.then(B) propagates A's error without subscribing to B, the retry is silently skipped; onErrorResume (line ~196) may even rewrap it into the misleading "the provided Flux did not match the provided data size … not being replayable" exception.
- Connection already recycled (possibly serving a sibling chunk by then) → the late subscriber attaches to reset/reused channel operations and receives no terminal signal at all. The drain
Mono never completes, .then(retry) never fires, no error is ever emitted → infinite hang.
downloadToFileWithResponse runs up to maxConcurrency (default 8) chunk range-GETs in parallel, so a freed connection has eager takers within microseconds, widening the window for outcomes (2)/(3).
Expected behavior
A single retryable 500 on one chunk should be retried (up to maxTries), and the download should either complete successfully or fail with a surfaced exception (e.g. BlobStorageException). It must never hang without a terminal signal.
Setup (please complete the following information):
Code references are against main, where azure-storage-blob is 12.35.0-beta.2 and azure-storage-common is 12.34.0-beta.2. The issue seems to happen also with latest main. But below are the versions from environment where the issue was originally observed.
azure-storage-blob — 12.31.3
azure-core-http-netty — 1.16.2
- Default
HttpClient (Reactor Netty), default ConnectionProvider
- JDK 11 (dump captured on
OpenJDK 64-Bit Server VM 11.0.21+9-LTS, Linux/epoll)
Additional context
Included a jstack from the time when the incident happened. jstack_202606142235_minimal_repro.txt
Using timeout helps as a workaround. I'm assuming that might be partly the reason why the bug hasn't been observed previously. However, it does look like a clear bug and it should be possible to also use downloadToFileWithResponse without using the timeout option.
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
Describe the bug
When
BlobClient.downloadToFileWithResponse(...)performs a multi-chunk parallel download and the storage service returns a retryable status (observed with HTTP 500) for one chunk, the call can hang indefinitely — it never returns a value and never throws. With the synchronous client and anulltimeout it blocks forever; the worker thread is permanently parked inMono.block.The root cause is an ordering bug in
RequestRetryPolicy.attemptAsync: on a retryable response it captures the body, closes the response, and only then subscribes to the captured body to drain it before retrying. With the default Reactor Netty client,close()discards the inbound and releases the connection back to the (JVM-wide, shared) pool, so the drain subscription races connection recycling.To Reproduce
Deterministic unit test (isolates the ordering defect). A
MockHttpResponsewhose body behavior is decided at subscription time by whetherclose()has already been called — mirroring a recycled pooled connection. The client returns a 500 with this racy body first, then a 200, so a correct policy must complete with 200.Flux.never()after close → reproduces the hang (StepVerifier times out).Flux.error(IllegalStateException("Only one connection receive subscriber allowed."))after close → reproduces the skipped retry (surfaces the unrelated error instead of retrying).Both fail against current
RequestRetryPolicy. (Full test source available; ~120 lines, no live service needed.)I'll include the full test class as an attachment. RequestRetryPolicyDrainRaceTest.java
Code Snippet
sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryPolicy.java (
attemptAsync, the retryable-status branch):This branch only runs for retryable status codes (
429/500/503, seeshouldStatusCodeBeRetried). For the Netty client,response.close()→com.azure.core.http.netty.implementation.NettyUtility#closeConnection→ChannelOperations.discard(), which drains the inbound on the event loop and recycles the connection. The drain subscription in (2) then attaches toconnection.inbound().receive()(NettyAsyncHttpResponse#bodyIntern) on that same channel, racing the discard/recycle.Three outcomes depending on who wins the race:
.then(retry)fires. Works (the common case — which is why this is intermittent).IllegalStateException, "only one receive subscriber allowed"). BecauseA.then(B)propagates A's error without subscribing to B, the retry is silently skipped;onErrorResume(line ~196) may even rewrap it into the misleading "the provided Flux did not match the provided data size … not being replayable" exception.Mononever completes,.then(retry)never fires, no error is ever emitted → infinite hang.downloadToFileWithResponseruns up tomaxConcurrency(default 8) chunk range-GETs in parallel, so a freed connection has eager takers within microseconds, widening the window for outcomes (2)/(3).Expected behavior
A single retryable 500 on one chunk should be retried (up to
maxTries), and the download should either complete successfully or fail with a surfaced exception (e.g.BlobStorageException). It must never hang without a terminal signal.Setup (please complete the following information):
Code references are against
main, whereazure-storage-blobis12.35.0-beta.2andazure-storage-commonis12.34.0-beta.2. The issue seems to happen also with latest main. But below are the versions from environment where the issue was originally observed.azure-storage-blob— 12.31.3azure-core-http-netty— 1.16.2HttpClient(Reactor Netty), defaultConnectionProviderOpenJDK 64-Bit Server VM 11.0.21+9-LTS, Linux/epoll)Additional context
Included a jstack from the time when the incident happened. jstack_202606142235_minimal_repro.txt
Using timeout helps as a workaround. I'm assuming that might be partly the reason why the bug hasn't been observed previously. However, it does look like a clear bug and it should be possible to also use downloadToFileWithResponse without using the timeout option.
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report