diff --git a/docs/concepts/data-structures.md b/docs/concepts/data-structures.md new file mode 100644 index 0000000..33666ba --- /dev/null +++ b/docs/concepts/data-structures.md @@ -0,0 +1,151 @@ +# Data structures + +This page defines the data-representation concepts the rest of the guide leans +on. Two orthogonal axes — *gridded vs. ungridded* and *structured vs. +unstructured* — set up the vocabulary, **regridding** is the operation that +moves data between them, and a **datacube** is the analysis-ready destination. + +Reference material on satellite product taxonomy (swath geometry, analysis-ready +data specifications, agency processing-level schemes) is out of scope here — see +the glossary's +[Satellite data products](../glossary.md#satellite-data-products) section, +whose entries link to canonical external references. + +## Two spaces + +Most of what follows describes the relationship between two distinct spaces: + +- **Index space** — the integer-indexed structure of the array itself. A value + is addressed by its position `(i, j[, k])`. Has no inherent units. +- **World space** (also "physical space" or, in geospatial work, "geographic + space") — where each value actually sits in a [CRS](../glossary.md). Has + units (meters, degrees, kelvin-along-a-vertical-axis, …) defined by the CRS. + +**Gridded** data is data that has both spaces, plus a mapping between them +(the "grid geometry"). **Ungridded** data lives only in world space, with no +index space at all — each record carries its own world-space coordinates. + +## Gridded vs. ungridded + +The first question to ask of any dataset: **is it on a grid at all?** + +- **Gridded** data lives in **both** index space and world space, with the + mapping between them — the "grid geometry" — stored separately from the + values. The mapping can one of three common forms: + - an **affine transform plus a CRS** — the GeoTIFF / COG convention. + `(x, y) = affine @ (i, j)`; **no per-cell coordinate arrays stored**. + - **1-D coordinate arrays per axis** for a regular grid (`x[nx]`, `y[ny]`) + — the CF / NetCDF / Zarr convention. + - **2-D coordinate arrays per cell** for a curvilinear grid + (`x[i, j]`, `y[i, j]`). + + Examples: a satellite Level-3 product on a regular lat/lon grid, a + climate-model output, a reanalysis dataset, a COG. + +- **Ungridded** data lives **only in world space** — no index space, no array + structure. Each value carries its own coordinates `(x[k], y[k])` in whatever + CRS the producer chose (geographic *or* projected). There's no row `i` and + column `j`, only individual observations. Examples: weather stations, ocean + buoys, GNSS receivers, lidar/GPS point clouds, in-situ vertical profiles, + aircraft tracks. + +![Side-by-side: left panel shows a regular grid of colored cells with values addressed by integer indices; right panel shows scattered points each labeled with its own (lat, lon) coordinate pair.](images/gridded-vs-ungridded.svg){ width="100%" } + +The same physical quantity (say, surface temperature) can be represented either +way. Ungridded observations are often the *input* to a regridding step that +produces a gridded product (see [Regridding](#regridding-resampling) below). + +## Structured vs. unstructured + +A second, *orthogonal* axis applies to gridded data: **how is cell connectivity +defined?** This axis says nothing about ungridded data — ungridded data has no +grid topology at all. + +- **Structured grid:** cells form a regular logical array addressable by integer + indices `(i, j[, k])`; **connectivity is implicit** (neighbors of `(i, j)` + are `(i±1, j)` and `(i, j±1)`). Includes **regular** (rectilinear) grids and + **curvilinear** grids — logically rectangular but physically warped, common in + ocean models. +- **Unstructured grid (mesh):** cells (triangles, polygons, sometimes mixed) are + joined by an **explicit connectivity list**; nodes have variable numbers of + neighbors. Examples: ICON, MPAS, FVCOM, finite-element meshes. Storage and + access patterns are fundamentally different from structured grids. +- **Discrete Global Grid Systems (DGGS):** a third option that doesn't fit + neatly into structured-vs-unstructured. A DGGS tiles the *whole* sphere with + a single (often equal-area) cell family and a hierarchical refinement scheme; + cells are addressed by a **specialized cell ID**, with connectivity and + refinement encoded in the ID's arithmetic — no `(i, j)` array shape, and no + explicit connectivity list. Examples: **HEALPix** (equal-area quadrilateral + cells with ring/nested indexing), **H3** (hexagonal cells with a hierarchical + hex ID), **S2** (quadrilateral cells on a cubed sphere), **rHEALPix**, + **cubed-sphere**. Standardized by the + [OGC DGGS abstract specification](https://www.ogc.org/standards/dggs/). + +![Four grid types: rectilinear (regular Cartesian cells), curvilinear (logically rectangular but spatially warped), discrete global grid system (hexagonal cells), and unstructured (irregular triangular mesh).](../visualization/images/grid-types.svg){ width="100%" } + +The two axes combine: a dataset can be **gridded + structured** (most satellite +Level-3 products), **gridded + unstructured** (an ocean-model output on a +triangular mesh), **gridded + DGGS** (a HEALPix cosmology map; an H3 hex map), +or **ungridded** (irrelevant to this axis — there's no grid). + +## Regridding / resampling + +**Regridding** (or **resampling**) is the operation that moves data from one +spatial sampling to another — from ungridded or unstructured input onto a +regular grid, or between two grids. It is the verb connecting the preceding +nouns to the datacube that follows. The previous section's diagram, read +right-to-left, illustrates the simplest case: scattered ungridded points +resampled onto a regular grid. + +The choice of **interpolation method** matters more than people expect: + +| Method | Behavior | Use for | +|---|---|---| +| **Nearest** | Picks the closest source value; preserves values exactly; blocky. | Categorical / class data (land cover, flags). | +| **Bilinear** (linear) | Weighted average of the four surrounding cells; smooth; blurs sharp edges. | Smooth continuous fields (temperature, reflectance). | +| **Conservative** | Area-weighted; preserves area-integrated totals across cells. | Extensive quantities — fluxes, precipitation, mass. | + +A useful rule of thumb: **match the method to the quantity.** The wrong choice +silently corrupts downstream analysis — bilinear on precipitation does not +conserve total water; nearest on a categorical mask preserves classes but +bilinear on the same mask produces nonsense fractional categories. + +Common Python tooling: **`xESMF`** (xarray-friendly, supports conservative +regridding via ESMF), **`pyresample`** (especially for swath → grid), +**`rasterio.warp.reproject`** (GDAL-backed, the GeoTIFF/COG path), +**`scipy.interpolate`** for one-off cases. + +For empirical performance trade-offs across these tools, see Development Seed's +[warp/resample profiling benchmark][warp-resample-profiling], which measures +memory and time across local vs. S3 storage and NetCDF, Zarr, and GeoTIFF +sources. + +[warp-resample-profiling]: https://developmentseed.org/warp-resample-profiling/ + +## Datacube + +A **datacube** is a labeled, regularly-gridded N-dimensional array — dimensions +carry coordinates (e.g. `time`, `level`, `lat`, `lon`, `band`), and the data is +addressable by those coordinates rather than only by integer index. Typical +sizes span 3–5 dimensions. + +A datacube is inherently a **structured, gridded** representation. It is most +often the *product* of gridding either ungridded or unstructured-mesh data onto +regular grids — i.e. the destination of the previous section's operation. +Common containers include Zarr (cloud-optimized), NetCDF, and HDF5; the +in-memory representation is typically an Xarray `Dataset` when using Python. + +For a deeper look at how a datacube's dimensions reduce to common viewing +shapes (maps, time series, profiles, animations), see the +[visualization overview](../visualization/overview.md). + +![Dimensionality fan: a 4–5D datacube (t · z · y · x · band) reduces to a 2D map, a 1D timeseries, a 1D vertical profile, an animation (2D map swept over t), or a volumetric 3D rendering depending on which dimensions are held fixed and which are displayed.](../visualization/images/dimensionality-fan.svg){ width="100%" } + +## External references + +- **UGRID conventions** (unstructured-mesh in NetCDF): + +- **CF conventions:** +- **xESMF** (regridding for xarray): +- **pyresample** (geospatial resampling): +- **Open Data Cube:** diff --git a/docs/concepts/images/gridded-vs-ungridded.svg b/docs/concepts/images/gridded-vs-ungridded.svg new file mode 100644 index 0000000..b0f4bc6 --- /dev/null +++ b/docs/concepts/images/gridded-vs-ungridded.svg @@ -0,0 +1,71 @@ + + + + + + + Gridded + location implied by index (i, j) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + data[3, 4] = 0.84 (position only — grid geometry stored separately) + + + + Ungridded + each value carries its own (lat, lon) + + + (42.1,-73.5) + (41.8,-72.2) + (42.4,-71.0) + (41.2,-73.1) + (40.7,-72.6) + (41.5,-70.8) + (40.9,-73.4) + + obs[k] = 0.84 at (x[k], y[k]) in some CRS + diff --git a/docs/glossary.md b/docs/glossary.md index c77a631..b42f649 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -75,7 +75,8 @@ Range request Datacube : A multi-dimensional array of data, for example time × level × latitude × - longitude × band, typically spanning 3 to 5 dimensions. + longitude × band, typically spanning 3 to 5 dimensions. See + [Data structures](concepts/data-structures.md#datacube). Chunk : A contiguous block of a chunked array, read and written as a unit. Chunk @@ -102,3 +103,91 @@ STAC (SpatioTemporal Asset Catalog) OPeNDAP : A protocol for remote access to subsets of scientific datasets over HTTP. + +## Data structures + +Foundational concepts covered on the +[Data structures](concepts/data-structures.md) page. + +Index space +: The integer-indexed structure of an array. A value is addressed by its + position `(i, j[, k])`; has no inherent units. See + [Data structures](concepts/data-structures.md#two-spaces). + +World space +: Where each value actually sits in a CRS — also "physical space" or + "geographic space". Has units defined by the CRS (meters, degrees, …). See + [Data structures](concepts/data-structures.md#two-spaces). + +Gridded +: Data tied to the cells (or nodes) of a grid; a value's location is implied + by its index in the array. See + [Data structures](concepts/data-structures.md#gridded-vs-ungridded). + +Ungridded +: Scattered or point observations that are not arranged on a grid; each + value carries its own explicit coordinates. See + [Data structures](concepts/data-structures.md#gridded-vs-ungridded). + +Structured grid +: A grid whose cells form a regular logical array addressable by integer + indices, with connectivity implicit. Includes regular (rectilinear) and + curvilinear grids. See + [Data structures](concepts/data-structures.md#structured-vs-unstructured). + +Unstructured grid +: A mesh whose cells are joined by an explicit connectivity list, with + variable numbers of neighbors per node. See + [Data structures](concepts/data-structures.md#structured-vs-unstructured). + +DGGS (Discrete Global Grid System) +: A global tessellation of the sphere by a single cell family (often + equal-area), with hierarchical refinement and a specialized cell-ID + indexing scheme — connectivity is implicit in the ID arithmetic rather + than in `(i, j)` array shape or an explicit connectivity list. Examples: + HEALPix, H3, S2, cubed-sphere. See + [Data structures](concepts/data-structures.md#structured-vs-unstructured). + +Regridding (resampling) +: The operation that moves data from one spatial sampling to another, for + example from ungridded points onto a regular grid. Method matters: nearest, + bilinear, and conservative each suit different quantities. See + [Data structures](concepts/data-structures.md#regridding-resampling). + +## Satellite data products + +Pointers to canonical external references for satellite Earth-observation +product taxonomy. These terms are out of scope for an in-depth treatment here. + +Swath +: The strip of Earth's surface observed by a sensor as the platform moves + along its orbit; the sensor's native acquisition geometry, indexed by + along-track × across-track with 2-D geolocation arrays. Typical of + Level-1/Level-2 products, distinct from a Level-3 product resampled onto a + regular map grid. See Copernicus SentiWiki — + [Sentinel-1 products][s1-products]. + +Analysis-ready data (ARD) +: Any dataset that has been preprocessed such that it fulfills the quality + standards required by the analysis to be performed on it. For satellite + Earth observation specifically, CEOS-ARD (formerly CARD4L) is the + community standard. See Stern et al., *Frontiers in Climate* (2021) — + [Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data + Production][ard-frontiers]. + +Data processing levels +: A maturity ladder describing how far a product has been processed, from raw + instrument data (Level 0) to model output (Level 4). The numbers are *not* + portable across agencies — NASA, ESA/Copernicus, and USGS each define their + own scheme. See NASA Earthdata — + [Data Processing Levels][nasa-levels]. + +Timeliness (NRT/STC/NTC) +: ESA's latency axis for a product: Near Real Time (hours), Short Time + Critical, and Non Time Critical (best calibration accuracy). Independent of + processing level. See Copernicus SentiWiki — [Sentinel-3 products][s3-products]. + +[s1-products]: https://sentiwiki.copernicus.eu/web/s1-products +[ard-frontiers]: https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full +[nasa-levels]: https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels +[s3-products]: https://sentiwiki.copernicus.eu/web/sentinel-3 diff --git a/docs/visualization/images/grid-types.svg b/docs/visualization/images/grid-types.svg new file mode 100644 index 0000000..8cf3fe2 --- /dev/null +++ b/docs/visualization/images/grid-types.svg @@ -0,0 +1,103 @@ + + + + + + + + + + + Grid Types + + + + + + + + + + + + + + + + + + + + + + rectilinear + + + + + + + + + + + + + + + + + + + + + + + + + + curvilinear + + + + + + + + + + + + + + + + + discrete global grid system + + + + + + + + + + + + + + + + + + + unstructured + diff --git a/mkdocs.yml b/mkdocs.yml index d89f35f..2dc0d34 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,6 +15,7 @@ extra: nav: - "index.md" - Glossary: "glossary.md" + - Data structures: "concepts/data-structures.md" - Datacube Worst Practices: - Common production gotchas: - Tiny data chunks: "worst-practices/tiny-chunks.ipynb"