Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions docs/concepts/data-structures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Data structures

This page defines the data-representation concepts the rest of the guide leans
on. Two orthogonal axes — *gridded vs. ungridded* and *structured vs.
unstructured* — set up the vocabulary, **regridding** is the operation that
moves data between them, and a **datacube** is the analysis-ready destination.

Reference material on satellite product taxonomy (swath geometry, analysis-ready
data specifications, agency processing-level schemes) is out of scope here — see
the glossary's
[Satellite data products](../glossary.md#satellite-data-products) section,
whose entries link to canonical external references.

## Two spaces

Most of what follows describes the relationship between two distinct spaces:

- **Index space** — the integer-indexed structure of the array itself. A value
is addressed by its position `(i, j[, k])`. Has no inherent units.
- **World space** (also "physical space" or, in geospatial work, "geographic
space") — where each value actually sits in a [CRS](../glossary.md). Has
units (meters, degrees, kelvin-along-a-vertical-axis, …) defined by the CRS.

**Gridded** data is data that has both spaces, plus a mapping between them
(the "grid geometry"). **Ungridded** data lives only in world space, with no
index space at all — each record carries its own world-space coordinates.

## Gridded vs. ungridded

The first question to ask of any dataset: **is it on a grid at all?**

- **Gridded** data lives in **both** index space and world space, with the
mapping between them — the "grid geometry" — stored separately from the
values. The mapping can one of three common forms:
- an **affine transform plus a CRS** — the GeoTIFF / COG convention.
`(x, y) = affine @ (i, j)`; **no per-cell coordinate arrays stored**.
- **1-D coordinate arrays per axis** for a regular grid (`x[nx]`, `y[ny]`)
— the CF / NetCDF / Zarr convention.
- **2-D coordinate arrays per cell** for a curvilinear grid
(`x[i, j]`, `y[i, j]`).

Examples: a satellite Level-3 product on a regular lat/lon grid, a
climate-model output, a reanalysis dataset, a COG.

- **Ungridded** data lives **only in world space** — no index space, no array
structure. Each value carries its own coordinates `(x[k], y[k])` in whatever
CRS the producer chose (geographic *or* projected). There's no row `i` and
column `j`, only individual observations. Examples: weather stations, ocean
buoys, GNSS receivers, lidar/GPS point clouds, in-situ vertical profiles,
aircraft tracks.

![Side-by-side: left panel shows a regular grid of colored cells with values addressed by integer indices; right panel shows scattered points each labeled with its own (lat, lon) coordinate pair.](images/gridded-vs-ungridded.svg){ width="100%" }

The same physical quantity (say, surface temperature) can be represented either
way. Ungridded observations are often the *input* to a regridding step that
produces a gridded product (see [Regridding](#regridding-resampling) below).

## Structured vs. unstructured

A second, *orthogonal* axis applies to gridded data: **how is cell connectivity
defined?** This axis says nothing about ungridded data — ungridded data has no
grid topology at all.

- **Structured grid:** cells form a regular logical array addressable by integer
indices `(i, j[, k])`; **connectivity is implicit** (neighbors of `(i, j)`
are `(i±1, j)` and `(i, j±1)`). Includes **regular** (rectilinear) grids and
**curvilinear** grids — logically rectangular but physically warped, common in
ocean models.
- **Unstructured grid (mesh):** cells (triangles, polygons, sometimes mixed) are
joined by an **explicit connectivity list**; nodes have variable numbers of
neighbors. Examples: ICON, MPAS, FVCOM, finite-element meshes. Storage and
access patterns are fundamentally different from structured grids.
- **Discrete Global Grid Systems (DGGS):** a third option that doesn't fit
neatly into structured-vs-unstructured. A DGGS tiles the *whole* sphere with
a single (often equal-area) cell family and a hierarchical refinement scheme;
cells are addressed by a **specialized cell ID**, with connectivity and
refinement encoded in the ID's arithmetic — no `(i, j)` array shape, and no
explicit connectivity list. Examples: **HEALPix** (equal-area quadrilateral
cells with ring/nested indexing), **H3** (hexagonal cells with a hierarchical
hex ID), **S2** (quadrilateral cells on a cubed sphere), **rHEALPix**,
**cubed-sphere**. Standardized by the
[OGC DGGS abstract specification](https://www.ogc.org/standards/dggs/).

![Four grid types: rectilinear (regular Cartesian cells), curvilinear (logically rectangular but spatially warped), discrete global grid system (hexagonal cells), and unstructured (irregular triangular mesh).](../visualization/images/grid-types.svg){ width="100%" }

The two axes combine: a dataset can be **gridded + structured** (most satellite
Level-3 products), **gridded + unstructured** (an ocean-model output on a
triangular mesh), **gridded + DGGS** (a HEALPix cosmology map; an H3 hex map),
or **ungridded** (irrelevant to this axis — there's no grid).

## Regridding / resampling

**Regridding** (or **resampling**) is the operation that moves data from one
spatial sampling to another — from ungridded or unstructured input onto a
regular grid, or between two grids. It is the verb connecting the preceding
nouns to the datacube that follows. The previous section's diagram, read
right-to-left, illustrates the simplest case: scattered ungridded points
resampled onto a regular grid.

The choice of **interpolation method** matters more than people expect:

| Method | Behavior | Use for |
|---|---|---|
| **Nearest** | Picks the closest source value; preserves values exactly; blocky. | Categorical / class data (land cover, flags). |
| **Bilinear** (linear) | Weighted average of the four surrounding cells; smooth; blurs sharp edges. | Smooth continuous fields (temperature, reflectance). |
| **Conservative** | Area-weighted; preserves area-integrated totals across cells. | Extensive quantities — fluxes, precipitation, mass. |

A useful rule of thumb: **match the method to the quantity.** The wrong choice
silently corrupts downstream analysis — bilinear on precipitation does not
conserve total water; nearest on a categorical mask preserves classes but
bilinear on the same mask produces nonsense fractional categories.

Common Python tooling: **`xESMF`** (xarray-friendly, supports conservative
regridding via ESMF), **`pyresample`** (especially for swath → grid),
**`rasterio.warp.reproject`** (GDAL-backed, the GeoTIFF/COG path),
**`scipy.interpolate`** for one-off cases.

For empirical performance trade-offs across these tools, see Development Seed's
[warp/resample profiling benchmark][warp-resample-profiling], which measures
memory and time across local vs. S3 storage and NetCDF, Zarr, and GeoTIFF
sources.

[warp-resample-profiling]: https://developmentseed.org/warp-resample-profiling/

## Datacube

A **datacube** is a labeled, regularly-gridded N-dimensional array — dimensions
carry coordinates (e.g. `time`, `level`, `lat`, `lon`, `band`), and the data is
addressable by those coordinates rather than only by integer index. Typical
sizes span 3–5 dimensions.

A datacube is inherently a **structured, gridded** representation. It is most
often the *product* of gridding either ungridded or unstructured-mesh data onto
regular grids — i.e. the destination of the previous section's operation.
Common containers include Zarr (cloud-optimized), NetCDF, and HDF5; the
in-memory representation is typically an Xarray `Dataset` when using Python.

For a deeper look at how a datacube's dimensions reduce to common viewing
shapes (maps, time series, profiles, animations), see the
[visualization overview](../visualization/overview.md).

![Dimensionality fan: a 4–5D datacube (t · z · y · x · band) reduces to a 2D map, a 1D timeseries, a 1D vertical profile, an animation (2D map swept over t), or a volumetric 3D rendering depending on which dimensions are held fixed and which are displayed.](../visualization/images/dimensionality-fan.svg){ width="100%" }

## External references

- **UGRID conventions** (unstructured-mesh in NetCDF):
<https://ugrid-conventions.github.io/ugrid-conventions/>
- **CF conventions:** <https://cfconventions.org/>
- **xESMF** (regridding for xarray): <https://xesmf.readthedocs.io/>
- **pyresample** (geospatial resampling): <https://pyresample.readthedocs.io/>
- **Open Data Cube:** <https://www.opendatacube.org/>
71 changes: 71 additions & 0 deletions docs/concepts/images/gridded-vs-ungridded.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
91 changes: 90 additions & 1 deletion docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,8 @@ Range request

Datacube
: A multi-dimensional array of data, for example time × level × latitude ×
longitude × band, typically spanning 3 to 5 dimensions.
longitude × band, typically spanning 3 to 5 dimensions. See
[Data structures](concepts/data-structures.md#datacube).

Chunk
: A contiguous block of a chunked array, read and written as a unit. Chunk
Expand All @@ -102,3 +103,91 @@ STAC (SpatioTemporal Asset Catalog)

OPeNDAP
: A protocol for remote access to subsets of scientific datasets over HTTP.

## Data structures

Foundational concepts covered on the
[Data structures](concepts/data-structures.md) page.

Index space
: The integer-indexed structure of an array. A value is addressed by its
position `(i, j[, k])`; has no inherent units. See
[Data structures](concepts/data-structures.md#two-spaces).

World space
: Where each value actually sits in a CRS — also "physical space" or
"geographic space". Has units defined by the CRS (meters, degrees, …). See
[Data structures](concepts/data-structures.md#two-spaces).

Gridded
: Data tied to the cells (or nodes) of a grid; a value's location is implied
by its index in the array. See
[Data structures](concepts/data-structures.md#gridded-vs-ungridded).

Ungridded
: Scattered or point observations that are not arranged on a grid; each
value carries its own explicit coordinates. See
[Data structures](concepts/data-structures.md#gridded-vs-ungridded).

Structured grid
: A grid whose cells form a regular logical array addressable by integer
indices, with connectivity implicit. Includes regular (rectilinear) and
curvilinear grids. See
[Data structures](concepts/data-structures.md#structured-vs-unstructured).

Unstructured grid
: A mesh whose cells are joined by an explicit connectivity list, with
variable numbers of neighbors per node. See
[Data structures](concepts/data-structures.md#structured-vs-unstructured).

DGGS (Discrete Global Grid System)
: A global tessellation of the sphere by a single cell family (often
equal-area), with hierarchical refinement and a specialized cell-ID
indexing scheme — connectivity is implicit in the ID arithmetic rather
than in `(i, j)` array shape or an explicit connectivity list. Examples:
HEALPix, H3, S2, cubed-sphere. See
[Data structures](concepts/data-structures.md#structured-vs-unstructured).

Regridding (resampling)
: The operation that moves data from one spatial sampling to another, for
example from ungridded points onto a regular grid. Method matters: nearest,
bilinear, and conservative each suit different quantities. See
[Data structures](concepts/data-structures.md#regridding-resampling).

## Satellite data products

Pointers to canonical external references for satellite Earth-observation
product taxonomy. These terms are out of scope for an in-depth treatment here.

Swath
: The strip of Earth's surface observed by a sensor as the platform moves
along its orbit; the sensor's native acquisition geometry, indexed by
along-track × across-track with 2-D geolocation arrays. Typical of
Level-1/Level-2 products, distinct from a Level-3 product resampled onto a
regular map grid. See Copernicus SentiWiki —
[Sentinel-1 products][s1-products].

Analysis-ready data (ARD)
: Any dataset that has been preprocessed such that it fulfills the quality
standards required by the analysis to be performed on it. For satellite
Earth observation specifically, CEOS-ARD (formerly CARD4L) is the
community standard. See Stern et al., *Frontiers in Climate* (2021) —
[Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data
Production][ard-frontiers].

Data processing levels
: A maturity ladder describing how far a product has been processed, from raw
instrument data (Level 0) to model output (Level 4). The numbers are *not*
portable across agencies — NASA, ESA/Copernicus, and USGS each define their
own scheme. See NASA Earthdata —
[Data Processing Levels][nasa-levels].

Timeliness (NRT/STC/NTC)
: ESA's latency axis for a product: Near Real Time (hours), Short Time
Critical, and Non Time Critical (best calibration accuracy). Independent of
processing level. See Copernicus SentiWiki — [Sentinel-3 products][s3-products].

[s1-products]: https://sentiwiki.copernicus.eu/web/s1-products
[ard-frontiers]: https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909/full
[nasa-levels]: https://www.earthdata.nasa.gov/learn/earth-observation-data-basics/data-processing-levels
[s3-products]: https://sentiwiki.copernicus.eu/web/sentinel-3
Loading