Skip to content

Performance characteristics of Zarrs' Sync vs Async APIs #50

Description

@kylebarron

There are some performance considerations between zarrs' sync/async APIs.

In particular, because Zarrs' async API is entirely runtime-agnostic, it doesn't implement any parallelism natively. It only makes concurrent fetches, but any CPU-bound process, like decoding, is single-threaded.

This is particularly an issue in the zarrs APIs that fetch/store multiple chunks.

Read more in the Zarrs docs' section on Parallelism and Concurrency.


🤖 Written by Claude:

Async aggregate reads don't parallelize over chunks

Background

zarrs is async-runtime-agnostic and never spawns tasks internally. Its async
methods give concurrency (overlapping storage I/O) but not parallelism:
codec decode of multiple chunks is driven on a single polling task. Per the
upstream docs, async_retrieve_array_subset / async_retrieve_chunks "do not
parallelise over chunks and can be slow compared to the sync API." (A single
chunk's codec still uses rayon internally; what's lost is overlapping one
chunk's decode with the next chunk's decode/fetch.)

The sync API has no such constraint — sync retrieve_array_subset uses rayon
and parallelizes over chunks.

Where this bites us

Method Cross-chunk parallelism
sync retrieve_array_subset (src/array/sync.rs:69) ✅ rayon
async retrieve_array_subset (src/array/async.rs:89) ❌ serial decode
async retrieve_chunks ❌ serial decode

future_into_py (pyo3-async-runtimes, default multithreaded tokio runtime)
spawns each Python call as its own tokio task. So per-chunk async calls
fanned out from Python already parallelize for free:

chunks = await asyncio.gather(*(arr.retrieve_chunk(c) for c in coords))

The problem is only the aggregate methods, which do their fan-out inside a
single zarrs call we can't reach into.

Split by store type

  • Sync-capable store (local FS): sync API is already the fast path (rayon).
    For async ergonomics without losing it, run the sync method inside
    tokio::task::spawn_blocking and return via future_into_py.
  • Async-only store (object_store / S3): sync traits unavailable, so we're
    stuck with serial decode unless we build the fan-out ourselves.

Proposed fix

Build one Rust-side fan-out helper and route async aggregate reads through it
instead of calling async_retrieve_array_subset directly:

  1. Enumerate chunks overlapping the requested ArraySubset.
  2. For each, tokio::spawn an async_retrieve_chunk_subset for the
    intersection (full chunk interior, clipped edges).
  3. join_all and copy each decoded piece into its offset in the output array.

This is what zarrs does internally, but with spawn instead of inline await
i.e. the "parallelism over chunks by spawning outside zarrs" the upstream docs
recommend. Reuse the helper for both retrieve_chunks and
retrieve_array_subset. Keep the sync API documented as the fast path for local
stores.

When fanning out many chunks, consider lowering the codec concurrent_target to
avoid oversubscribing cores (each task may itself use intra-codec rayon).

Decode stays in Rust

Note retrieve_encoded_chunk returns compressed bytes; pushing decode to
Python would mean reimplementing codecs under the GIL. Decoding must stay in
Rust, off the GIL. retrieve_encoded_chunk remains useful for raw-bytes
passthrough / custom caching, not as the performance path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions