Skip to content

Add opt-in Zarr v3 sharding via shardShape -> sharding_indexed codec#171

Open
blasscoc wants to merge 1 commit into
mainfrom
feat/zarr3-sharding
Open

Add opt-in Zarr v3 sharding via shardShape -> sharding_indexed codec#171
blasscoc wants to merge 1 commit into
mainfrom
feat/zarr3-sharding

Conversation

@blasscoc

Copy link
Copy Markdown
Collaborator

MDIO v3 datasets write one storage object per chunk, which forces a single trade-off: small chunks give fine read granularity but produce many tiny objects (painful on object stores), while large chunks give few big objects but coarsen every read. Zarr v3 sharding resolves this by packing many small inner chunks into one large shard (the storage object) behind a per-shard index, so reads still fetch only the inner chunks they need.

This exposes that as an opt-in: a RegularChunkShape may now carry an optional shardShape alongside chunkShape. When present, the v3 codec pipeline already built for the variable is nested inside a sharding_indexed codec (inner chunk_shape = chunkShape, index_codecs = [bytes, crc32c], index_location end), and the array-level chunk_grid is rewritten to the shard. chunkShape becomes the inner read chunk; shardShape becomes the storage object.

Fully backward compatible: with no shardShape the metadata is unchanged (today's one-chunk-per-object behavior). shardShape must be a positive integer multiple of chunkShape on every axis, validated at spec-build time.

MDIO v3 datasets write one storage object per chunk, which forces a single
trade-off: small chunks give fine read granularity but produce many tiny
objects (painful on object stores), while large chunks give few big objects
but coarsen every read. Zarr v3 sharding resolves this by packing many small
inner chunks into one large shard (the storage object) behind a per-shard
index, so reads still fetch only the inner chunks they need.

This exposes that as an opt-in: a RegularChunkShape may now carry an optional
shardShape alongside chunkShape. When present, the v3 codec pipeline already
built for the variable is nested inside a sharding_indexed codec (inner
chunk_shape = chunkShape, index_codecs = [bytes, crc32c], index_location end),
and the array-level chunk_grid is rewritten to the shard. chunkShape becomes
the inner read chunk; shardShape becomes the storage object.

Fully backward compatible: with no shardShape the metadata is unchanged
(today's one-chunk-per-object behavior). shardShape must be a positive integer
multiple of chunkShape on every axis, validated at spec-build time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants