[BUG] BlobClient.downloadToFileWithResponse hangs forever after an HTTP 500 on one chunk

**Describe the bug**
When `BlobClient.downloadToFileWithResponse(...)` performs a multi-chunk parallel download and the storage service returns a retryable status (observed with **HTTP 500**) for one chunk, the call can **hang indefinitely** — it never returns a value and never throws. With the synchronous client and a `null` timeout it blocks forever; the worker thread is permanently parked in `Mono.block`.

The root cause is an ordering bug in `RequestRetryPolicy.attemptAsync`: on a retryable response it captures the body, **closes the response**, and only then subscribes to the captured body to drain it before retrying. With the default Reactor Netty client, `close()` discards the inbound and releases the connection back to the (JVM-wide, shared) pool, so the drain subscription races connection recycling.

**To Reproduce**
Deterministic unit test (isolates the ordering defect). A `MockHttpResponse` whose body behavior is decided *at subscription time* by whether `close()` has already been called — mirroring a recycled pooled connection. The client returns a 500 with this racy body first, then a 200, so a correct policy must complete with 200.

- Body = `Flux.never()` after close → reproduces the **hang** (StepVerifier times out).
- Body = `Flux.error(IllegalStateException("Only one connection receive subscriber allowed."))` after close → reproduces the **skipped retry** (surfaces the unrelated error instead of retrying).

Both fail against current `RequestRetryPolicy`. (Full test source available; ~120 lines, no live service needed.)

I'll include the full test class as an attachment. [RequestRetryPolicyDrainRaceTest.java](https://github.com/user-attachments/files/28955171/RequestRetryPolicyDrainRaceTest.java) 

**Code Snippet**

[sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryPolicy.java](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/policy/RequestRetryPolicy.java#L177) (`attemptAsync`, the retryable-status branch):

```java
Flux<ByteBuffer> responseBody = response.getBody();
response.close();                                   // (1) closes/recycles the connection FIRST

if (responseBody == null) {
    return attemptAsync(context, next, originalRequest, newConsiderSecondary, newPrimaryTry,
        attempt + 1, suppressed);
} else {
    return responseBody.ignoreElements()            // (2) subscribes to the body of the
        .then(attemptAsync(context, next, originalRequest, newConsiderSecondary,   // already-closed response
            newPrimaryTry, attempt + 1, suppressed));
}
```

This branch only runs for retryable status codes (`429`/`500`/`503`, see `shouldStatusCodeBeRetried`). For the Netty client, `response.close()` →
`com.azure.core.http.netty.implementation.NettyUtility#closeConnection` → `ChannelOperations.discard()`, which drains the inbound on the event loop and recycles the connection. The drain subscription in (2) then attaches to `connection.inbound().receive()` (`NettyAsyncHttpResponse#bodyIntern`) on that same channel, racing the discard/recycle.

Three outcomes depending on who wins the race:

1. **Drain wins** → drains, completes, `.then(retry)` fires. Works (the common case — which is why this is intermittent).
2. **Discard wins the single inbound subscription** → the late drain subscription errors (`IllegalStateException`, "only one receive subscriber allowed"). Because `A.then(B)` propagates A's error without subscribing to B, the **retry is silently skipped**; `onErrorResume` (line ~196) may even rewrap it into the misleading *"the provided Flux did not match the provided data size … not being replayable"* exception.
3. **Connection already recycled** (possibly serving a sibling chunk by then) → the late subscriber attaches to reset/reused channel operations and receives **no terminal signal at all**. The drain `Mono` never completes, `.then(retry)` never fires, no error is ever emitted → **infinite hang**.

`downloadToFileWithResponse` runs up to `maxConcurrency` (default 8) chunk range-GETs in parallel, so a freed connection has eager takers within microseconds, widening the window for outcomes (2)/(3).


**Expected behavior**
A single retryable 500 on one chunk should be retried (up to `maxTries`), and the download should either complete successfully or fail with a surfaced exception (e.g. `BlobStorageException`). It must **never** hang without a terminal signal.

**Setup (please complete the following information):**
Code references are against `main`, where `azure-storage-blob` is `12.35.0-beta.2` and `azure-storage-common` is `12.34.0-beta.2`. The issue seems to happen also with latest main. But below are the versions from environment where the issue was originally observed. 

- `azure-storage-blob` — **12.31.3**
- `azure-core-http-netty` — **1.16.2**
- Default `HttpClient` (Reactor Netty), default `ConnectionProvider`
- JDK 11 (dump captured on `OpenJDK 64-Bit Server VM 11.0.21+9-LTS`, Linux/epoll)

**Additional context**
Included a jstack from the time when the incident happened. [jstack_202606142235_minimal_repro.txt](https://github.com/user-attachments/files/28955168/jstack_202606142235_minimal_repro.txt)
Using timeout helps as a workaround. I'm assuming that might be partly the reason why the bug hasn't been observed previously. However, it does look like a clear bug and it should be possible to also use downloadToFileWithResponse without using the timeout option. 

**Information Checklist**
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- [x] Bug Description Added
- [x] Repro Steps Added
- [x] Setup information Added


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] BlobClient.downloadToFileWithResponse hangs forever after an HTTP 500 on one chunk #49507

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] BlobClient.downloadToFileWithResponse hangs forever after an HTTP 500 on one chunk #49507

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions