From 02a573cd70efa9284502e20a2420da6b9481e0e5 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Tue, 10 Mar 2026 11:11:56 +0100 Subject: [PATCH 1/2] docs: enhance architecture and examples with STAC integration details --- docs/architecture.md | 112 +++++++++++++++++++++++++++++++++++++++++++ docs/examples.md | 3 ++ docs/index.md | 1 + 3 files changed, 116 insertions(+) diff --git a/docs/architecture.md b/docs/architecture.md index 5a949e65..3b9717d1 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -200,6 +200,118 @@ output.zarr/ └── .zmetadata # Consolidated metadata ``` +## STAC Integration and Zarr URL Resolution + +### The Store Root vs. Hierarchy Path Problem + +Zarr is defined as a **key/value store protocol**, not a file format. A Zarr hierarchy lives inside a *store*, and nodes (groups, arrays) are identified by *hierarchy paths* that map to storage keys. For example, a group at hierarchy path `my/group` means the key `my/group/zarr.json` exists in the store. + +When a URL such as `https://example.stac/path/something.zarr/my/group` is encountered, two conceptually distinct pieces have been concatenated: + +| Concept | Example value | +|---|---| +| **Store root** | `https://example.stac/path/something.zarr` | +| **Hierarchy path** | `my/group` | + +Reverse-engineering this split from the flattened URL is inherently fragile: the `.zarr` suffix is **only advisory** (per the [Zarr v3 specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html)) and could legally appear in folder or group names. As confirmed by Zarr specification contributors, "the existence of a group or array at some path `p` is defined by `p/zarr.json` resolving to a group / array metadata document" — there is no requirement that the store root carry any particular suffix. + +### Canonical Resolution Algorithm for Implementors + +The following algorithm resolves a Zarr group or array URL unambiguously: + +``` +1. CHECK for a STAC `rel: store` link in the STAC Item or Collection. + → If present: store_root = store link href + group_path = asset href, stripped of the store_root prefix + → This is the ONLY fully unambiguous resolution path. Always prefer it. + +2. FALLBACK (no store link — fragile, best-effort only): + → Find the last path segment that ends with `.zarr`. + store_root = everything up to and including that segment + group_path = remaining path segments + → Document clearly that this heuristic may fail if `.zarr` appears + in group or array names. + +3. VERIFY programmatically (optional but recommended): + → Append `/zarr.json` to the candidate store_root. + If the response is valid Zarr group metadata: confirmed. + Otherwise: try removing the last path segment and retry. + → Note: this requires an additional HTTP request and should only be + used when a store link is absent and the heuristic is uncertain. +``` + +!!! warning "Why fragment notation (`#`) does not work here" + A URL fragment like `something.zarr#/my/group` would require the client to first fetch `something.zarr` as a resource. Since a Zarr store root is not a directly fetchable HTTP resource (only `something.zarr/zarr.json` is), the fragment model does not fit HTTP semantics for Zarr hierarchies. The explicit `rel: store` link is the correct mechanism. + +### STAC `rel: store` Link + +The [STAC Zarr Best Practices](https://github.com/radiantearth/stac-best-practices/blob/main/best-practices-zarr.md#store-link-relationship) define a `"store"` relationship for exactly this purpose. All EOPF-produced STAC Items and Collections **MUST** include this link: + +```json +"links": [ + { + "rel": "store", + "href": "s3://bucket/S2A_MSIL2A_20251008T100041.zarr", + "type": "application/vnd.zarr; version=3", + "title": "Zarr Store" + } +] +``` + +With this link present, a client parses a group asset href like `s3://bucket/S2A_MSIL2A_20251008T100041.zarr/measurements/reflectance/r10m` by simply stripping the store root prefix, yielding hierarchy path `measurements/reflectance/r10m` — no heuristics needed. + +!!! important "Best-practices assumption" + The STAC Zarr Best Practices assume that **all assets in a STAC object are contained within the same Zarr store**. If assets span multiple stores, each store must have its own `rel: store` link and the relationship between an asset and its store link must be made explicit (e.g., via a shared `store_id` convention or separate STAC Items). + +### URL Naming Constraint + +To minimise ambiguity for clients that must fall back to the heuristic path above, the following constraint applies to all EOPF Zarr products: + +- The `.zarr` suffix **SHOULD** appear **at most once** in the full URL — only at the store root. +- Group names, array names, and any intermediate path segments **MUST NOT** end with `.zarr`. + +Example: + +``` +✅ s3://bucket/S2A_MSIL2A_20251008T100041.zarr/measurements/reflectance/r10m +❌ s3://bucket/S2A_MSIL2A_20251008T100041.zarr/measurements.zarr/reflectance +``` + +### EOPF Product URL Anatomy + +For a Sentinel-2 L2A EOPF product the URL breakdown looks like: + +``` +Store root: s3://bucket/S2A_MSIL2A_20251008T100041.zarr + │ + └── measurements/ + └── reflectance/ ← group path in STAC asset href + ├── r10m/ + │ ├── b02 ← array accessed via band `name` field + │ ├── b03 + │ └── b04 + ├── r20m/ + └── r60m/ +``` + +The band `name` field in the STAC `bands` array is designed so that `asset_href + "/" + band_name` constructs the correct full Zarr array URL: + +```python +import zarr + +asset_href = "s3://bucket/S2A_MSIL2A.zarr/measurements/reflectance/r10m" +band_name = "b04" +red_band = zarr.open_array(asset_href + "/" + band_name, mode="r") +``` + +### Related Specifications + +- **[Zarr v3 specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html)** — defines the abstract store interface, hierarchy paths, and `zarr.json` metadata documents +- **[STAC Zarr Best Practices](https://github.com/radiantearth/stac-best-practices/blob/main/best-practices-zarr.md)** — defines the `rel: store` link, asset media types, and band representation patterns +- **[Zarr STAC Extension](https://github.com/stac-extensions/zarr)** — adds `zarr:node_type`, `zarr:zarr_format`, and `zarr:consolidated` fields to STAC assets + +--- + ## Metadata Architecture ### 1. CF Conventions Compliance diff --git a/docs/examples.md b/docs/examples.md index 65dc7991..1eb2eed5 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -2,6 +2,9 @@ Practical examples demonstrating common use cases for the EOPF GeoZarr library. +!!! tip "Understanding Zarr asset URLs" + An asset href like `s3://bucket/data.zarr/measurements/reflectance/r10m` is the concatenation of a **store root** (`s3://bucket/data.zarr`) and a **hierarchy path** (`measurements/reflectance/r10m`). Clients MUST use the `rel: store` link in the STAC object to split these unambiguously. See the [STAC Integration and Zarr URL Resolution](architecture.md#stac-integration-and-zarr-url-resolution) section of the Architecture docs for the full resolution algorithm and naming constraints. + ## Basic Examples ### Simple Local Conversion diff --git a/docs/index.md b/docs/index.md index b9c0f483..0f61bf61 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,6 +15,7 @@ Welcome to the EOPF GeoZarr library documentation. This library provides tools t - **[API Reference](api-reference.md)** - Complete Python API documentation - **[Examples](examples.md)** - Practical examples for common use cases - **[Architecture](architecture.md)** - Technical architecture and design principles + - **[STAC Integration and Zarr URL Resolution](architecture.md#stac-integration-and-zarr-url-resolution)** - How to unambiguously parse Zarr group/array URLs from STAC assets using the `rel: store` link - **[GeoZarr Mini Spec](geozarr-minispec.md)** - Implementation-specific GeoZarr specification ### Support From 1e3d005ef18b427de8601bf2513988a517c61b80 Mon Sep 17 00:00:00 2001 From: Emmanuel Mathot Date: Tue, 10 Mar 2026 11:17:30 +0100 Subject: [PATCH 2/2] docs: update architecture and examples to clarify Zarr asset access and metadata requirements --- docs/architecture.md | 106 +++++++++++++++++++++---------------------- docs/examples.md | 4 +- 2 files changed, 54 insertions(+), 56 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index 3b9717d1..5f713e4a 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -202,48 +202,47 @@ output.zarr/ ## STAC Integration and Zarr URL Resolution -### The Store Root vs. Hierarchy Path Problem +### Every Zarr Path is openable by client -Zarr is defined as a **key/value store protocol**, not a file format. A Zarr hierarchy lives inside a *store*, and nodes (groups, arrays) are identified by *hierarchy paths* that map to storage keys. For example, a group at hierarchy path `my/group` means the key `my/group/zarr.json` exists in the store. +Zarr is a **key/value store protocol**, not a file format. Crucially for clients, this means that **any Zarr group path is itself a valid store entry point**. A URL like: -When a URL such as `https://example.stac/path/something.zarr/my/group` is encountered, two conceptually distinct pieces have been concatenated: +``` +s3://bucket/S2A_MSIL2A_20251008T100041.zarr/measurements/reflectance/r10m +``` + +is not a path that needs to be split or reverse-parsed to find some "real" store root. It *is* the store. The Zarr spec defines the existence of a node by whether `{path}/zarr.json` resolves to valid metadata — and any valid group path satisfies this. Clients like xarray, zarr-python, GDAL, and OpenLayers should therefore **open the asset href directly as a Zarr store**, without needing to know anything about the hierarchy above it. + +```python +import xarray as xr -| Concept | Example value | -|---|---| -| **Store root** | `https://example.stac/path/something.zarr` | -| **Hierarchy path** | `my/group` | +# Open the asset href directly — no splitting or parsing needed +asset_href = "s3://bucket/S2A_MSIL2A.zarr/measurements/reflectance/r10m" +ds = xr.open_dataset(asset_href, engine="zarr") +``` -Reverse-engineering this split from the flattened URL is inherently fragile: the `.zarr` suffix is **only advisory** (per the [Zarr v3 specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html)) and could legally appear in folder or group names. As confirmed by Zarr specification contributors, "the existence of a group or array at some path `p` is defined by `p/zarr.json` resolving to a group / array metadata document" — there is no requirement that the store root carry any particular suffix. +This is the fundamental principle: **the STAC asset href is the URL to open, and it works as a complete, self-contained Zarr store**. -### Canonical Resolution Algorithm for Implementors +### Consolidated Metadata Enables Standalone Group Access -The following algorithm resolves a Zarr group or array URL unambiguously: +A Zarr group becomes fully self-contained for clients when it carries [consolidated metadata](https://zarr.readthedocs.io/en/main/user-guide/consolidated_metadata.html). Consolidated metadata embeds the metadata of all descendant nodes inside the group's own `zarr.json` (Zarr v3) or `.zmetadata` (Zarr v2), so a client can discover the entire sub-hierarchy structure in a single request — no traversal, no requests to parent groups. -``` -1. CHECK for a STAC `rel: store` link in the STAC Item or Collection. - → If present: store_root = store link href - group_path = asset href, stripped of the store_root prefix - → This is the ONLY fully unambiguous resolution path. Always prefer it. - -2. FALLBACK (no store link — fragile, best-effort only): - → Find the last path segment that ends with `.zarr`. - store_root = everything up to and including that segment - group_path = remaining path segments - → Document clearly that this heuristic may fail if `.zarr` appears - in group or array names. - -3. VERIFY programmatically (optional but recommended): - → Append `/zarr.json` to the candidate store_root. - If the response is valid Zarr group metadata: confirmed. - Otherwise: try removing the last path segment and retry. - → Note: this requires an additional HTTP request and should only be - used when a store link is absent and the heuristic is uncertain. +All EOPF-produced Zarr groups pointed to by STAC assets **MUST** have consolidated metadata. This is indicated in the STAC asset using the [Zarr STAC Extension](https://github.com/stac-extensions/zarr) field `zarr:consolidated: true`. + +```json +"assets": { + "reflectance": { + "href": "s3://bucket/S2A_MSIL2A.zarr/measurements/reflectance/r10m", + "type": "application/vnd.zarr; version=3", + "zarr:consolidated": true, + "zarr:node_type": "group", + "zarr:zarr_format": 3 + } +} ``` -!!! warning "Why fragment notation (`#`) does not work here" - A URL fragment like `something.zarr#/my/group` would require the client to first fetch `something.zarr` as a resource. Since a Zarr store root is not a directly fetchable HTTP resource (only `something.zarr/zarr.json` is), the fragment model does not fit HTTP semantics for Zarr hierarchies. The explicit `rel: store` link is the correct mechanism. +With consolidated metadata present, a client opening `asset_href` directly has everything it needs to work with the group and its children — without any knowledge of the parent hierarchy. -### STAC `rel: store` Link +### Role of the `rel: store` Link The [STAC Zarr Best Practices](https://github.com/radiantearth/stac-best-practices/blob/main/best-practices-zarr.md#store-link-relationship) define a `"store"` relationship for exactly this purpose. All EOPF-produced STAC Items and Collections **MUST** include this link: @@ -253,24 +252,23 @@ The [STAC Zarr Best Practices](https://github.com/radiantearth/stac-best-practic "rel": "store", "href": "s3://bucket/S2A_MSIL2A_20251008T100041.zarr", "type": "application/vnd.zarr; version=3", - "title": "Zarr Store" + "title": "Zarr Store Root" } ] ``` -With this link present, a client parses a group asset href like `s3://bucket/S2A_MSIL2A_20251008T100041.zarr/measurements/reflectance/r10m` by simply stripping the store root prefix, yielding hierarchy path `measurements/reflectance/r10m` — no heuristics needed. - -!!! important "Best-practices assumption" - The STAC Zarr Best Practices assume that **all assets in a STAC object are contained within the same Zarr store**. If assets span multiple stores, each store must have its own `rel: store` link and the relationship between an asset and its store link must be made explicit (e.g., via a shared `store_id` convention or separate STAC Items). +Its purpose is **navigation and discovery**, not URL parsing: -### URL Naming Constraint +- It lets clients traverse or inspect the **full Zarr hierarchy** above the asset group (siblings, parent groups, global attributes). +- It provides a single stable reference to the underlying storage location, useful for tools that need to know where the data lives (e.g., to construct pre-signed URLs, or list all groups in a store). +- It allows a client to verify that all assets in the STAC object share coverage under the same store. -To minimise ambiguity for clients that must fall back to the heuristic path above, the following constraint applies to all EOPF Zarr products: +!!! note + Opening the `rel: store` href directly is equivalent to opening the top-level Zarr root — useful for exploring the complete dataset structure, but **not required** for using any individual asset. -- The `.zarr` suffix **SHOULD** appear **at most once** in the full URL — only at the store root. -- Group names, array names, and any intermediate path segments **MUST NOT** end with `.zarr`. +### URL Naming Constraint -Example: +Group names, array names, and any intermediate path segments **MUST NOT** end with `.zarr`. The `.zarr` suffix SHOULD appear at most once in a full URL — only at the store root level — as a human-readable convention. This avoids confusion when reading URLs, even though no client should rely on this suffix for parsing. ``` ✅ s3://bucket/S2A_MSIL2A_20251008T100041.zarr/measurements/reflectance/r10m @@ -279,19 +277,19 @@ Example: ### EOPF Product URL Anatomy -For a Sentinel-2 L2A EOPF product the URL breakdown looks like: +For a Sentinel-2 L2A EOPF product, the store and asset relationship looks like this: ``` -Store root: s3://bucket/S2A_MSIL2A_20251008T100041.zarr - │ - └── measurements/ - └── reflectance/ ← group path in STAC asset href - ├── r10m/ - │ ├── b02 ← array accessed via band `name` field - │ ├── b03 - │ └── b04 - ├── r20m/ - └── r60m/ +rel: store → s3://bucket/S2A_MSIL2A_20251008T100041.zarr (top-level root) + │ + └── measurements/ + └── reflectance/ ← asset href (open this directly as a store) + ├── r10m/ ← sub-group: open directly, or path-join with band name + │ ├── b02 ← array: asset_href + "/" + band_name + │ ├── b03 + │ └── b04 + ├── r20m/ + └── r60m/ ``` The band `name` field in the STAC `bands` array is designed so that `asset_href + "/" + band_name` constructs the correct full Zarr array URL: @@ -307,7 +305,7 @@ red_band = zarr.open_array(asset_href + "/" + band_name, mode="r") ### Related Specifications - **[Zarr v3 specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html)** — defines the abstract store interface, hierarchy paths, and `zarr.json` metadata documents -- **[STAC Zarr Best Practices](https://github.com/radiantearth/stac-best-practices/blob/main/best-practices-zarr.md)** — defines the `rel: store` link, asset media types, and band representation patterns +- **[STAC Zarr Best Practices](https://github.com/radiantearth/stac-best-practices/blob/main/best-practices-zarr.md)** — defines the `rel: store` link, asset media types, band representation patterns, and consolidated metadata guidance - **[Zarr STAC Extension](https://github.com/stac-extensions/zarr)** — adds `zarr:node_type`, `zarr:zarr_format`, and `zarr:consolidated` fields to STAC assets --- diff --git a/docs/examples.md b/docs/examples.md index 1eb2eed5..1a611644 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -2,8 +2,8 @@ Practical examples demonstrating common use cases for the EOPF GeoZarr library. -!!! tip "Understanding Zarr asset URLs" - An asset href like `s3://bucket/data.zarr/measurements/reflectance/r10m` is the concatenation of a **store root** (`s3://bucket/data.zarr`) and a **hierarchy path** (`measurements/reflectance/r10m`). Clients MUST use the `rel: store` link in the STAC object to split these unambiguously. See the [STAC Integration and Zarr URL Resolution](architecture.md#stac-integration-and-zarr-url-resolution) section of the Architecture docs for the full resolution algorithm and naming constraints. +!!! tip "Opening Zarr assets — no URL parsing required" + Every STAC asset href pointing to a Zarr group (e.g. `s3://bucket/data.zarr/measurements/reflectance/r10m`) **is itself a valid Zarr store** and can be opened directly by any Zarr-compatible client. No reverse-parsing or store-root extraction is needed. See [STAC Integration and Zarr URL Resolution](architecture.md#stac-integration-and-zarr-url-resolution) in the Architecture docs for the full model, the role of the `rel: store` link, and consolidated metadata requirements. ## Basic Examples