Performance characteristics of Zarrs' Sync vs Async APIs

There are some performance considerations between zarrs' sync/async APIs.

In particular, because Zarrs' async API is entirely runtime-agnostic, it **doesn't implement any parallelism natively**. It **only** makes concurrent fetches, but any CPU-bound process, like decoding, is single-threaded.

This is particularly an issue in the zarrs APIs that fetch/store multiple chunks.

Read more in the Zarrs docs' section on [Parallelism and Concurrency](https://docs.rs/zarrs/latest/zarrs/array/struct.Array.html#parallelism-and-concurrency).

-----
🤖 Written by Claude:

## Async aggregate reads don't parallelize over chunks

### Background

`zarrs` is async-runtime-agnostic and never spawns tasks internally. Its async
methods give **concurrency (overlapping storage I/O) but not parallelism**:
codec decode of multiple chunks is driven on a single polling task. Per the
upstream docs, `async_retrieve_array_subset` / `async_retrieve_chunks` "do not
parallelise over chunks and can be slow compared to the sync API." (A single
chunk's codec still uses rayon internally; what's lost is overlapping one
chunk's decode with the next chunk's decode/fetch.)

The sync API has no such constraint — sync `retrieve_array_subset` uses rayon
and parallelizes over chunks.

### Where this bites us

| Method | Cross-chunk parallelism |
|---|---|
| sync `retrieve_array_subset` (src/array/sync.rs:69) | ✅ rayon |
| async `retrieve_array_subset` (src/array/async.rs:89) | ❌ serial decode |
| async `retrieve_chunks` | ❌ serial decode |

`future_into_py` (pyo3-async-runtimes, default multithreaded tokio runtime)
spawns each Python call as its own tokio task. So **per-chunk** async calls
fanned out from Python already parallelize for free:

```python
chunks = await asyncio.gather(*(arr.retrieve_chunk(c) for c in coords))
```

The problem is only the **aggregate** methods, which do their fan-out *inside* a
single zarrs call we can't reach into.

### Split by store type

- **Sync-capable store (local FS):** sync API is already the fast path (rayon).
  For async ergonomics without losing it, run the *sync* method inside
  `tokio::task::spawn_blocking` and return via `future_into_py`.
- **Async-only store (object_store / S3):** sync traits unavailable, so we're
  stuck with serial decode unless we build the fan-out ourselves.

### Proposed fix

Build one Rust-side fan-out helper and route async aggregate reads through it
instead of calling `async_retrieve_array_subset` directly:

1. Enumerate chunks overlapping the requested `ArraySubset`.
2. For each, `tokio::spawn` an `async_retrieve_chunk_subset` for the
   intersection (full chunk interior, clipped edges).
3. `join_all` and copy each decoded piece into its offset in the output array.

This is what zarrs does internally, but with `spawn` instead of inline `await` —
i.e. the "parallelism over chunks by spawning outside zarrs" the upstream docs
recommend. Reuse the helper for both `retrieve_chunks` and
`retrieve_array_subset`. Keep the sync API documented as the fast path for local
stores.

When fanning out many chunks, consider lowering the codec `concurrent_target` to
avoid oversubscribing cores (each task may itself use intra-codec rayon).

### Decode stays in Rust

Note `retrieve_encoded_chunk` returns *compressed* bytes; pushing decode to
Python would mean reimplementing codecs under the GIL. Decoding must stay in
Rust, off the GIL. `retrieve_encoded_chunk` remains useful for raw-bytes
passthrough / custom caching, not as the performance path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance characteristics of Zarrs' Sync vs Async APIs #50

Async aggregate reads don't parallelize over chunks

Background

Where this bites us

Split by store type

Proposed fix

Decode stays in Rust

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Method	Cross-chunk parallelism
sync `retrieve_array_subset` (src/array/sync.rs:69)	✅ rayon
async `retrieve_array_subset` (src/array/async.rs:89)	❌ serial decode
async `retrieve_chunks`	❌ serial decode

Uh oh!

Performance characteristics of Zarrs' Sync vs Async APIs #50

Description

Async aggregate reads don't parallelize over chunks

Background

Where this bites us

Split by store type

Proposed fix

Decode stays in Rust

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions