There are some performance considerations between zarrs' sync/async APIs.
In particular, because Zarrs' async API is entirely runtime-agnostic, it doesn't implement any parallelism natively. It only makes concurrent fetches, but any CPU-bound process, like decoding, is single-threaded.
This is particularly an issue in the zarrs APIs that fetch/store multiple chunks.
Read more in the Zarrs docs' section on Parallelism and Concurrency.
🤖 Written by Claude:
Async aggregate reads don't parallelize over chunks
Background
zarrs is async-runtime-agnostic and never spawns tasks internally. Its async
methods give concurrency (overlapping storage I/O) but not parallelism:
codec decode of multiple chunks is driven on a single polling task. Per the
upstream docs, async_retrieve_array_subset / async_retrieve_chunks "do not
parallelise over chunks and can be slow compared to the sync API." (A single
chunk's codec still uses rayon internally; what's lost is overlapping one
chunk's decode with the next chunk's decode/fetch.)
The sync API has no such constraint — sync retrieve_array_subset uses rayon
and parallelizes over chunks.
Where this bites us
| Method |
Cross-chunk parallelism |
sync retrieve_array_subset (src/array/sync.rs:69) |
✅ rayon |
async retrieve_array_subset (src/array/async.rs:89) |
❌ serial decode |
async retrieve_chunks |
❌ serial decode |
future_into_py (pyo3-async-runtimes, default multithreaded tokio runtime)
spawns each Python call as its own tokio task. So per-chunk async calls
fanned out from Python already parallelize for free:
chunks = await asyncio.gather(*(arr.retrieve_chunk(c) for c in coords))
The problem is only the aggregate methods, which do their fan-out inside a
single zarrs call we can't reach into.
Split by store type
- Sync-capable store (local FS): sync API is already the fast path (rayon).
For async ergonomics without losing it, run the sync method inside
tokio::task::spawn_blocking and return via future_into_py.
- Async-only store (object_store / S3): sync traits unavailable, so we're
stuck with serial decode unless we build the fan-out ourselves.
Proposed fix
Build one Rust-side fan-out helper and route async aggregate reads through it
instead of calling async_retrieve_array_subset directly:
- Enumerate chunks overlapping the requested
ArraySubset.
- For each,
tokio::spawn an async_retrieve_chunk_subset for the
intersection (full chunk interior, clipped edges).
join_all and copy each decoded piece into its offset in the output array.
This is what zarrs does internally, but with spawn instead of inline await —
i.e. the "parallelism over chunks by spawning outside zarrs" the upstream docs
recommend. Reuse the helper for both retrieve_chunks and
retrieve_array_subset. Keep the sync API documented as the fast path for local
stores.
When fanning out many chunks, consider lowering the codec concurrent_target to
avoid oversubscribing cores (each task may itself use intra-codec rayon).
Decode stays in Rust
Note retrieve_encoded_chunk returns compressed bytes; pushing decode to
Python would mean reimplementing codecs under the GIL. Decoding must stay in
Rust, off the GIL. retrieve_encoded_chunk remains useful for raw-bytes
passthrough / custom caching, not as the performance path.
There are some performance considerations between zarrs' sync/async APIs.
In particular, because Zarrs' async API is entirely runtime-agnostic, it doesn't implement any parallelism natively. It only makes concurrent fetches, but any CPU-bound process, like decoding, is single-threaded.
This is particularly an issue in the zarrs APIs that fetch/store multiple chunks.
Read more in the Zarrs docs' section on Parallelism and Concurrency.
🤖 Written by Claude:
Async aggregate reads don't parallelize over chunks
Background
zarrsis async-runtime-agnostic and never spawns tasks internally. Its asyncmethods give concurrency (overlapping storage I/O) but not parallelism:
codec decode of multiple chunks is driven on a single polling task. Per the
upstream docs,
async_retrieve_array_subset/async_retrieve_chunks"do notparallelise over chunks and can be slow compared to the sync API." (A single
chunk's codec still uses rayon internally; what's lost is overlapping one
chunk's decode with the next chunk's decode/fetch.)
The sync API has no such constraint — sync
retrieve_array_subsetuses rayonand parallelizes over chunks.
Where this bites us
retrieve_array_subset(src/array/sync.rs:69)retrieve_array_subset(src/array/async.rs:89)retrieve_chunksfuture_into_py(pyo3-async-runtimes, default multithreaded tokio runtime)spawns each Python call as its own tokio task. So per-chunk async calls
fanned out from Python already parallelize for free:
The problem is only the aggregate methods, which do their fan-out inside a
single zarrs call we can't reach into.
Split by store type
For async ergonomics without losing it, run the sync method inside
tokio::task::spawn_blockingand return viafuture_into_py.stuck with serial decode unless we build the fan-out ourselves.
Proposed fix
Build one Rust-side fan-out helper and route async aggregate reads through it
instead of calling
async_retrieve_array_subsetdirectly:ArraySubset.tokio::spawnanasync_retrieve_chunk_subsetfor theintersection (full chunk interior, clipped edges).
join_alland copy each decoded piece into its offset in the output array.This is what zarrs does internally, but with
spawninstead of inlineawait—i.e. the "parallelism over chunks by spawning outside zarrs" the upstream docs
recommend. Reuse the helper for both
retrieve_chunksandretrieve_array_subset. Keep the sync API documented as the fast path for localstores.
When fanning out many chunks, consider lowering the codec
concurrent_targettoavoid oversubscribing cores (each task may itself use intra-codec rayon).
Decode stays in Rust
Note
retrieve_encoded_chunkreturns compressed bytes; pushing decode toPython would mean reimplementing codecs under the GIL. Decoding must stay in
Rust, off the GIL.
retrieve_encoded_chunkremains useful for raw-bytespassthrough / custom caching, not as the performance path.