Skip to content

[BUG] BlobClient.downloadToFileWithResponse hangs forever after an HTTP 500 on one chunk #49507

@Heikki-T

Description

@Heikki-T

Describe the bug
When BlobClient.downloadToFileWithResponse(...) performs a multi-chunk parallel download and the storage service returns a retryable status (observed with HTTP 500) for one chunk, the call can hang indefinitely — it never returns a value and never throws. With the synchronous client and a null timeout it blocks forever; the worker thread is permanently parked in Mono.block.

The root cause is an ordering bug in RequestRetryPolicy.attemptAsync: on a retryable response it captures the body, closes the response, and only then subscribes to the captured body to drain it before retrying. With the default Reactor Netty client, close() discards the inbound and releases the connection back to the (JVM-wide, shared) pool, so the drain subscription races connection recycling.

To Reproduce
Deterministic unit test (isolates the ordering defect). A MockHttpResponse whose body behavior is decided at subscription time by whether close() has already been called — mirroring a recycled pooled connection. The client returns a 500 with this racy body first, then a 200, so a correct policy must complete with 200.

  • Body = Flux.never() after close → reproduces the hang (StepVerifier times out).
  • Body = Flux.error(IllegalStateException("Only one connection receive subscriber allowed.")) after close → reproduces the skipped retry (surfaces the unrelated error instead of retrying).

Both fail against current RequestRetryPolicy. (Full test source available; ~120 lines, no live service needed.)

I'll include the full test class as an attachment. RequestRetryPolicyDrainRaceTest.java

Code Snippet

sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryPolicy.java (attemptAsync, the retryable-status branch):

Flux<ByteBuffer> responseBody = response.getBody();
response.close();                                   // (1) closes/recycles the connection FIRST

if (responseBody == null) {
    return attemptAsync(context, next, originalRequest, newConsiderSecondary, newPrimaryTry,
        attempt + 1, suppressed);
} else {
    return responseBody.ignoreElements()            // (2) subscribes to the body of the
        .then(attemptAsync(context, next, originalRequest, newConsiderSecondary,   // already-closed response
            newPrimaryTry, attempt + 1, suppressed));
}

This branch only runs for retryable status codes (429/500/503, see shouldStatusCodeBeRetried). For the Netty client, response.close()
com.azure.core.http.netty.implementation.NettyUtility#closeConnectionChannelOperations.discard(), which drains the inbound on the event loop and recycles the connection. The drain subscription in (2) then attaches to connection.inbound().receive() (NettyAsyncHttpResponse#bodyIntern) on that same channel, racing the discard/recycle.

Three outcomes depending on who wins the race:

  1. Drain wins → drains, completes, .then(retry) fires. Works (the common case — which is why this is intermittent).
  2. Discard wins the single inbound subscription → the late drain subscription errors (IllegalStateException, "only one receive subscriber allowed"). Because A.then(B) propagates A's error without subscribing to B, the retry is silently skipped; onErrorResume (line ~196) may even rewrap it into the misleading "the provided Flux did not match the provided data size … not being replayable" exception.
  3. Connection already recycled (possibly serving a sibling chunk by then) → the late subscriber attaches to reset/reused channel operations and receives no terminal signal at all. The drain Mono never completes, .then(retry) never fires, no error is ever emitted → infinite hang.

downloadToFileWithResponse runs up to maxConcurrency (default 8) chunk range-GETs in parallel, so a freed connection has eager takers within microseconds, widening the window for outcomes (2)/(3).

Expected behavior
A single retryable 500 on one chunk should be retried (up to maxTries), and the download should either complete successfully or fail with a surfaced exception (e.g. BlobStorageException). It must never hang without a terminal signal.

Setup (please complete the following information):
Code references are against main, where azure-storage-blob is 12.35.0-beta.2 and azure-storage-common is 12.34.0-beta.2. The issue seems to happen also with latest main. But below are the versions from environment where the issue was originally observed.

  • azure-storage-blob12.31.3
  • azure-core-http-netty1.16.2
  • Default HttpClient (Reactor Netty), default ConnectionProvider
  • JDK 11 (dump captured on OpenJDK 64-Bit Server VM 11.0.21+9-LTS, Linux/epoll)

Additional context
Included a jstack from the time when the incident happened. jstack_202606142235_minimal_repro.txt
Using timeout helps as a workaround. I'm assuming that might be partly the reason why the bug hasn't been observed previously. However, it does look like a clear bug and it should be possible to also use downloadToFileWithResponse without using the timeout option.

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions