From 73231a87ba38479595d7acd873886af50556cfc8 Mon Sep 17 00:00:00 2001
From: Max Jones <14077947+maxrjones@users.noreply.github.com>
Date: Wed, 27 May 2026 22:53:47 -0400
Subject: [PATCH 1/3] docs: add Data structures concepts page with glossary
stubs
---
docs/concepts/data-structures.md | 151 ++++++++++++++++++
docs/concepts/images/gridded-vs-ungridded.svg | 71 ++++++++
docs/glossary.md | 89 ++++++++++-
mkdocs.yml | 1 +
4 files changed, 311 insertions(+), 1 deletion(-)
create mode 100644 docs/concepts/data-structures.md
create mode 100644 docs/concepts/images/gridded-vs-ungridded.svg
diff --git a/docs/concepts/data-structures.md b/docs/concepts/data-structures.md
new file mode 100644
index 0000000..f6a280d
--- /dev/null
+++ b/docs/concepts/data-structures.md
@@ -0,0 +1,151 @@
+# Data structures
+
+This page defines the data-representation concepts the rest of the guide leans
+on. Two orthogonal axes — *gridded vs. ungridded* and *structured vs.
+unstructured* — set up the vocabulary, **regridding** is the operation that
+moves data between them, and a **datacube** is the analysis-ready destination.
+
+Reference material on satellite product taxonomy (swath geometry, analysis-ready
+data specifications, agency processing-level schemes) is out of scope here — see
+the glossary's
+[Satellite data products](../glossary.md#satellite-data-products) section,
+whose entries link to canonical external references.
+
+## Two spaces
+
+Most of what follows describes the relationship between two distinct spaces:
+
+- **Index space** — the integer-indexed structure of the array itself. A value
+ is addressed by its position `(i, j[, k])`. Has no inherent units.
+- **World space** (also "physical space" or, in geospatial work, "geographic
+ space") — where each value actually sits in a [CRS](../glossary.md). Has
+ units (meters, degrees, kelvin-along-a-vertical-axis, …) defined by the CRS.
+
+**Gridded** data is data that has both spaces, plus a mapping between them
+(the "grid geometry"). **Ungridded** data lives only in world space, with no
+index space at all — each record carries its own world-space coordinates.
+
+## Gridded vs. ungridded
+
+The first question to ask of any dataset: **is it on a grid at all?**
+
+- **Gridded** data lives in **both** index space and world space, with the
+ mapping between them — the "grid geometry" — stored separately from the
+ values. The mapping can one of three common forms:
+ - an **affine transform plus a CRS** — the GeoTIFF / COG convention.
+ `(x, y) = affine @ (i, j)`; **no per-cell coordinate arrays stored**.
+ - **1-D coordinate arrays per axis** for a regular grid (`x[nx]`, `y[ny]`)
+ — the CF / NetCDF / Zarr convention.
+ - **2-D coordinate arrays per cell** for a curvilinear grid
+ (`x[i, j]`, `y[i, j]`).
+
+ Examples: a satellite Level-3 product on a regular lat/lon grid, a
+ climate-model output, a reanalysis dataset, a COG.
+
+- **Ungridded** data lives **only in world space** — no index space, no array
+ structure. Each value carries its own coordinates `(x[k], y[k])` in whatever
+ CRS the producer chose (geographic *or* projected). There's no row `i` and
+ column `j`, only individual observations. Examples: weather stations, ocean
+ buoys, GNSS receivers, lidar/GPS point clouds, in-situ vertical profiles,
+ aircraft tracks.
+
+{ width="100%" }
+
+The same physical quantity (say, surface temperature) can be represented either
+way. Ungridded observations are often the *input* to a regridding step that
+produces a gridded product (see [Regridding](#regridding-resampling) below).
+
+## Structured vs. unstructured
+
+A second, *orthogonal* axis applies to gridded data: **how is cell connectivity
+defined?** This axis says nothing about ungridded data — ungridded data has no
+grid topology at all.
+
+- **Structured grid:** cells form a regular logical array addressable by integer
+ indices `(i, j[, k])`; **connectivity is implicit** (neighbors of `(i, j)`
+ are `(i±1, j)` and `(i, j±1)`). Includes **regular** (rectilinear) grids and
+ **curvilinear** grids — logically rectangular but physically warped, common in
+ ocean models.
+- **Unstructured grid (mesh):** cells (triangles, polygons, sometimes mixed) are
+ joined by an **explicit connectivity list**; nodes have variable numbers of
+ neighbors. Examples: ICON, MPAS, FVCOM, finite-element meshes. Storage and
+ access patterns are fundamentally different from structured grids.
+- **Discrete Global Grid Systems (DGGS):** a third option that doesn't fit
+ neatly into structured-vs-unstructured. A DGGS tiles the *whole* sphere with
+ a single (often equal-area) cell family and a hierarchical refinement scheme;
+ cells are addressed by a **specialized cell ID**, with connectivity and
+ refinement encoded in the ID's arithmetic — no `(i, j)` array shape, and no
+ explicit connectivity list. Examples: **HEALPix** (equal-area quadrilateral
+ cells with ring/nested indexing), **H3** (hexagonal cells with a hierarchical
+ hex ID), **S2** (quadrilateral cells on a cubed sphere), **rHEALPix**,
+ **cubed-sphere**. Standardized by the
+ [OGC DGGS abstract specification](https://www.ogc.org/standards/dggs/).
+
+{ width="100%" }
+
+The two axes combine: a dataset can be **gridded + structured** (most satellite
+Level-3 products), **gridded + unstructured** (an ocean-model output on a
+triangular mesh), **gridded + DGGS** (a HEALPix cosmology map; an H3 hex map),
+or **ungridded** (irrelevant to this axis — there's no grid).
+
+## Regridding / resampling
+
+**Regridding** (or **resampling**) is the operation that moves data from one
+spatial sampling to another — from ungridded or unstructured input onto a
+regular grid, or between two grids. It is the verb connecting the preceding
+nouns to the datacube that follows. The previous section's diagram, read
+right-to-left, illustrates the simplest case: scattered ungridded points
+resampled onto a regular grid.
+
+The choice of **interpolation method** matters more than people expect:
+
+| Method | Behavior | Use for |
+|---|---|---|
+| **Nearest** | Picks the closest source value; preserves values exactly; blocky. | Categorical / class data (land cover, flags). |
+| **Bilinear** (linear) | Weighted average of the four surrounding cells; smooth; blurs sharp edges. | Smooth continuous fields (temperature, reflectance). |
+| **Conservative** | Area-weighted; preserves area-integrated totals across cells. | Extensive quantities — fluxes, precipitation, mass. |
+
+A useful rule of thumb: **match the method to the quantity.** The wrong choice
+silently corrupts downstream analysis — bilinear on precipitation does not
+conserve total water; nearest on a categorical mask preserves classes but
+bilinear on the same mask produces nonsense fractional categories.
+
+Common Python tooling: **`xESMF`** (xarray-friendly, supports conservative
+regridding via ESMF), **`pyresample`** (especially for swath → grid),
+**`rasterio.warp.reproject`** (GDAL-backed, the GeoTIFF/COG path),
+**`scipy.interpolate`** for one-off cases.
+
+For empirical performance trade-offs across these tools, see Development Seed's
+[warp/resample profiling benchmark][warp-resample-profiling], which measures
+memory and time across local vs. S3 storage and NetCDF, Zarr, and GeoTIFF
+sources.
+
+[warp-resample-profiling]: https://developmentseed.org/warp-resample-profiling/
+
+## Datacube
+
+A **datacube** is a labeled, regularly-gridded N-dimensional array — dimensions
+carry coordinates (e.g. `time`, `level`, `lat`, `lon`, `band`), and the data is
+addressable by those coordinates rather than only by integer index. Typical
+sizes span 3–5 dimensions.
+
+A datacube is inherently a **structured, gridded** representation. It is most
+often the *product* of gridding either ungridded or unstructured-mesh data onto
+regular grids — i.e. the destination of the previous section's operation.
+Common containers include Zarr (cloud-optimized), NetCDF, and HDF5; the
+in-memory representation is typically an Xarray `Dataset` when using Python.
+
+For a deeper look at how a datacube's dimensions reduce to common viewing
+shapes (maps, time series, profiles, animations), see the
+[visualization overview](../visualization/overview.md).
+
+{ width="100%" }
+
+## External references
+
+- **UGRID conventions** (unstructured-mesh in NetCDF):
+
+- **CF conventions:**
+- **xESMF** (regridding for xarray):
+- **pyresample** (geospatial resampling):
+- **Open Data Cube:**
diff --git a/docs/concepts/images/gridded-vs-ungridded.svg b/docs/concepts/images/gridded-vs-ungridded.svg
new file mode 100644
index 0000000..b0f4bc6
--- /dev/null
+++ b/docs/concepts/images/gridded-vs-ungridded.svg
@@ -0,0 +1,71 @@
+
+
diff --git a/docs/glossary.md b/docs/glossary.md
index c77a631..766d79d 100644
--- a/docs/glossary.md
+++ b/docs/glossary.md
@@ -75,7 +75,8 @@ Range request
Datacube
: A multi-dimensional array of data, for example time × level × latitude ×
- longitude × band, typically spanning 3 to 5 dimensions.
+ longitude × band, typically spanning 3 to 5 dimensions. See
+ [Data structures](concepts/data-structures.md#datacube).
Chunk
: A contiguous block of a chunked array, read and written as a unit. Chunk
@@ -102,3 +103,89 @@ STAC (SpatioTemporal Asset Catalog)
OPeNDAP
: A protocol for remote access to subsets of scientific datasets over HTTP.
+
+## Data structures
+
+Foundational concepts covered on the
+[Data structures](concepts/data-structures.md) page.
+
+Index space
+: The integer-indexed structure of an array. A value is addressed by its
+ position `(i, j[, k])`; has no inherent units. See
+ [Data structures](concepts/data-structures.md#two-spaces).
+
+World space
+: Where each value actually sits in a CRS — also "physical space" or
+ "geographic space". Has units defined by the CRS (meters, degrees, …). See
+ [Data structures](concepts/data-structures.md#two-spaces).
+
+Gridded
+: Data tied to the cells (or nodes) of a grid; a value's location is implied
+ by its index in the array. See
+ [Data structures](concepts/data-structures.md#gridded-vs-ungridded).
+
+Ungridded
+: Scattered or point observations that are not arranged on a grid; each
+ value carries its own explicit coordinates. See
+ [Data structures](concepts/data-structures.md#gridded-vs-ungridded).
+
+Structured grid
+: A grid whose cells form a regular logical array addressable by integer
+ indices, with connectivity implicit. Includes regular (rectilinear) and
+ curvilinear grids. See
+ [Data structures](concepts/data-structures.md#structured-vs-unstructured).
+
+Unstructured grid
+: A mesh whose cells are joined by an explicit connectivity list, with
+ variable numbers of neighbors per node. See
+ [Data structures](concepts/data-structures.md#structured-vs-unstructured).
+
+DGGS (Discrete Global Grid System)
+: A global tessellation of the sphere by a single cell family (often
+ equal-area), with hierarchical refinement and a specialized cell-ID
+ indexing scheme — connectivity is implicit in the ID arithmetic rather
+ than in `(i, j)` array shape or an explicit connectivity list. Examples:
+ HEALPix, H3, S2, cubed-sphere. See
+ [Data structures](concepts/data-structures.md#structured-vs-unstructured).
+
+Regridding (resampling)
+: The operation that moves data from one spatial sampling to another, for
+ example from ungridded points onto a regular grid. Method matters: nearest,
+ bilinear, and conservative each suit different quantities. See
+ [Data structures](concepts/data-structures.md#regridding-resampling).
+
+## Satellite data products
+
+Pointers to canonical external references for satellite Earth-observation
+product taxonomy. These terms are out of scope for an in-depth treatment here.
+
+Swath
+: The strip of Earth's surface observed by a sensor as the platform moves
+ along its orbit; the sensor's native acquisition geometry, indexed by
+ along-track × across-track with 2-D geolocation arrays. Typical of
+ Level-1/Level-2 products, distinct from a Level-3 product resampled onto a
+ regular map grid. See ESA Sentinel Online —
+ [Sentinel-1 product types and processing levels][s1-products].
+
+Analysis-ready data (ARD)
+: Satellite data processed to a minimum set of requirements and organized so
+ it can be analyzed immediately, with minimal further user effort, and
+ interoperably across products. The formal reference is **CEOS-ARD**
+ (formerly CARD4L). See CEOS — [Analysis Ready Data][ceos-ard].
+
+Data processing levels
+: A maturity ladder describing how far a product has been processed, from raw
+ instrument data (Level 0) to model output (Level 4). The numbers are *not*
+ portable across agencies — NASA, ESA/Copernicus, and USGS each define their
+ own scheme. See NASA Earthdata —
+ [Data Processing Levels][nasa-levels].
+
+Timeliness (NRT/STC/NTC)
+: ESA's latency axis for a product: Near Real Time (hours), Short Time
+ Critical, and Non Time Critical (best calibration accuracy). Independent of
+ processing level. See Copernicus SentiWiki — [Sentinel-3 products][s3-products].
+
+[s1-products]: https://sentinel.esa.int/web/sentinel/user-guides/sentinel-1-sar/product-types-processing-levels
+[ceos-ard]: https://ceos.org/ard/
+[nasa-levels]: https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels
+[s3-products]: https://sentiwiki.copernicus.eu/web/s3-products
diff --git a/mkdocs.yml b/mkdocs.yml
index d89f35f..2dc0d34 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -15,6 +15,7 @@ extra:
nav:
- "index.md"
- Glossary: "glossary.md"
+ - Data structures: "concepts/data-structures.md"
- Datacube Worst Practices:
- Common production gotchas:
- Tiny data chunks: "worst-practices/tiny-chunks.ipynb"
From 3b4475ea32a78b7612843e590b297fda6701f609 Mon Sep 17 00:00:00 2001
From: Max Jones <14077947+maxrjones@users.noreply.github.com>
Date: Wed, 27 May 2026 23:02:57 -0400
Subject: [PATCH 2/3] fix links
---
docs/glossary.md | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/docs/glossary.md b/docs/glossary.md
index 766d79d..b42f649 100644
--- a/docs/glossary.md
+++ b/docs/glossary.md
@@ -164,14 +164,16 @@ Swath
along its orbit; the sensor's native acquisition geometry, indexed by
along-track × across-track with 2-D geolocation arrays. Typical of
Level-1/Level-2 products, distinct from a Level-3 product resampled onto a
- regular map grid. See ESA Sentinel Online —
- [Sentinel-1 product types and processing levels][s1-products].
+ regular map grid. See Copernicus SentiWiki —
+ [Sentinel-1 products][s1-products].
Analysis-ready data (ARD)
-: Satellite data processed to a minimum set of requirements and organized so
- it can be analyzed immediately, with minimal further user effort, and
- interoperably across products. The formal reference is **CEOS-ARD**
- (formerly CARD4L). See CEOS — [Analysis Ready Data][ceos-ard].
+: Any dataset that has been preprocessed such that it fulfills the quality
+ standards required by the analysis to be performed on it. For satellite
+ Earth observation specifically, CEOS-ARD (formerly CARD4L) is the
+ community standard. See Stern et al., *Frontiers in Climate* (2021) —
+ [Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data
+ Production][ard-frontiers].
Data processing levels
: A maturity ladder describing how far a product has been processed, from raw
@@ -185,7 +187,7 @@ Timeliness (NRT/STC/NTC)
Critical, and Non Time Critical (best calibration accuracy). Independent of
processing level. See Copernicus SentiWiki — [Sentinel-3 products][s3-products].
-[s1-products]: https://sentinel.esa.int/web/sentinel/user-guides/sentinel-1-sar/product-types-processing-levels
-[ceos-ard]: https://ceos.org/ard/
+[s1-products]: https://sentiwiki.copernicus.eu/web/s1-products
+[ard-frontiers]: https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full
[nasa-levels]: https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels
-[s3-products]: https://sentiwiki.copernicus.eu/web/s3-products
+[s3-products]: https://sentiwiki.copernicus.eu/web/sentinel-3
From 4d532c1364635a8f7101e377a3f1cfd679f41236 Mon Sep 17 00:00:00 2001
From: Max Jones <14077947+maxrjones@users.noreply.github.com>
Date: Wed, 27 May 2026 23:17:34 -0400
Subject: [PATCH 3/3] add figure
---
docs/concepts/data-structures.md | 2 +-
docs/visualization/images/grid-types.svg | 103 +++++++++++++++++++++++
2 files changed, 104 insertions(+), 1 deletion(-)
create mode 100644 docs/visualization/images/grid-types.svg
diff --git a/docs/concepts/data-structures.md b/docs/concepts/data-structures.md
index f6a280d..33666ba 100644
--- a/docs/concepts/data-structures.md
+++ b/docs/concepts/data-structures.md
@@ -81,7 +81,7 @@ grid topology at all.
**cubed-sphere**. Standardized by the
[OGC DGGS abstract specification](https://www.ogc.org/standards/dggs/).
-{ width="100%" }
+{ width="100%" }
The two axes combine: a dataset can be **gridded + structured** (most satellite
Level-3 products), **gridded + unstructured** (an ocean-model output on a
diff --git a/docs/visualization/images/grid-types.svg b/docs/visualization/images/grid-types.svg
new file mode 100644
index 0000000..8cf3fe2
--- /dev/null
+++ b/docs/visualization/images/grid-types.svg
@@ -0,0 +1,103 @@
+
+