Skip to content

hypertidy/gdalxarray

Repository files navigation

gdx

The goal of gdx is to integrate GDAL with xarray, especially for the multidimensional API which is still relatively underutilized.

Todo

  • apply xarray indexes when relevant in Raster
  • apply xarray indexes when relevant in Multidim
  • define stance on representation of “default transform”, and when GCPs, RPCs, or geolocation arrays are present
  • explore when we need to control driver choice (netcdf and hdf in particular)
  • explore registering as an xarray backend, via engine = "gdal"

Here’s a basic example:

from gdx import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()
dsn =  "/vsicurl/https://projects.pawsey.org.au/idea-sealevel-glo-phy-l4-nrt-008-046/data.marine.copernicus.eu/SEALEVEL_GLO_PHY_L4_NRT_008_046/cmems_obs-sl_glo_phy-ssh_nrt_allsat-l4-duacs-0.125deg_P1D_202506/2025/08/nrt_global_allsat_phy_l4_20250825_20250825.nc"
ds = backend.open_dataset(f'vrt://{dsn}?sd_name=vgos', chunks = {}, multidim = False)
ds1 = backend.open_dataset(dsn, multidim = True, chunks = {}) 

We have a Raster xarray:

ds

<xarray.Dataset> Size: 17MB
Dimensions:  (x: 2880, y: 1440)
Coordinates:
  * x        (x) float64 23kB -180.0 -179.9 -179.8 -179.6 ... 179.6 179.8 179.9
  * y        (y) float64 12kB 90.0 89.88 89.75 89.62 ... -89.62 -89.75 -89.88
Data variables:
    band_1   (y, x) int32 17MB dask.array<chunksize=(1440, 2880), meta=np.ndarray>
Attributes:
    crs:           GEOGCS["unknown",DATUM["unnamed",SPHEROID["Spheroid",63781...
    geotransform:  (-180.0, 0.125, 0.0, 90.0, 0.0, -0.125)

and a Multidim xarray:

ds1

<xarray.Dataset> Size: 166MB
Dimensions:    (latitude: 1440, nv: 2, longitude: 2880, time: 1)
Coordinates:
  * latitude   (latitude) float32 6kB -89.94 -89.81 -89.69 ... 89.69 89.81 89.94
  * nv         (nv) int32 8B 0 1
  * longitude  (longitude) float32 12kB -179.9 -179.8 -179.7 ... 179.8 179.9
  * time       (time) float32 4B 2.763e+04
Data variables:
    lat_bnds   (latitude, nv) float32 12kB dask.array<chunksize=(1440, 2), meta=np.ndarray>
    lon_bnds   (longitude, nv) float32 23kB dask.array<chunksize=(2880, 2), meta=np.ndarray>
    sla        (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    err_sla    (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    ugosa      (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    err_ugosa  (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    vgosa      (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    err_vgosa  (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    adt        (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    ugos       (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    vgos       (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
    flag_ice   (time, latitude, longitude) int32 17MB dask.array<chunksize=(1, 1440, 2880), meta=np.ndarray>
Attributes: (12/44)
    Conventions:                     CF-1.6
    Metadata_Conventions:            Unidata Dataset Discovery v1.0
    cdm_data_type:                   Grid
    comment:                         Sea Surface Height measured by Altimetry...
    contact:                         servicedesk.cmems@mercator-ocean.eu
    creator_email:                   servicedesk.cmems@mercator-ocean.eu
    ...                              ...
    summary:                         DUACS Near-Real-Time Level-4 sea surface...
    time_coverage_duration:          P1D
    time_coverage_end:               2025-08-25T12:00:00Z
    time_coverage_resolution:        P1D
    time_coverage_start:             2025-08-24T12:00:00Z
    title:                           NRT merged all satellites Global Ocean G...

There’s one variable called ‘band_1’ for the raster:

ds.band_1.isel(x = 0)
# <xarray.DataArray 'band_1' (y: 1440)> Size: 6kB
# dask.array<getitem, shape=(1440,), dtype=int32, chunksize=(1440,), chunktype=numpy.ndarray>
# Coordinates:
#     x        float64 8B -180.0
#   * y        (y) float64 12kB 90.0 89.88 89.75 89.62 ... -89.62 -89.75 -89.88
# Attributes:
#     nodata:   -2147483647.0
#     scale:    0.0001
#     offset:   0.0

we can access actual values

## the raw values for now
ds.band_1.sel(x = 100, y = -50, method = "nearest").values
#> array(441, dtype=int32)

ds1.sla.isel(longitude = 0, latitude = 1000).values
#> array([2404], dtype=int32)

What about a ZARR from CMEMS?

from gdx import GDALBackendEntrypoint
dsn = 'ZARR:"/vsicurl/https://s3.waw3-1.cloudferro.com/mdl-arco-time-045/arco/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/timeChunked.zarr"'
backend = GDALBackendEntrypoint()
#ds = backend.open_dataset(filename_or_obj, multidim = True, chunks = None)
ds = backend.open_dataset(dsn, multidim = True, chunks = {})
ds

# <xarray.Dataset> Size: 2TB
# Dimensions:         (latitude: 1440, longitude: 2880, time: 11902, nv: 2)
# Coordinates:
#   * latitude        (latitude) float32 6kB -89.94 -89.81 -89.69 ... 89.81 89.94
#   * longitude       (longitude) float32 12kB -179.9 -179.8 ... 179.8 179.9
#   * time            (time) float32 48kB 1.571e+04 1.571e+04 ... 2.761e+04
#   * nv              (nv) int32 8B 0 1
# Data variables: (12/13)
#     adt             (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     err_sla         (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     err_ugosa       (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     err_vgosa       (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     flag_ice        (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     lat_bnds        (latitude, nv) float32 12kB dask.array<chunksize=(1440, 2), meta=np.ndarray>
#     ...              ...
#     sla             (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     tpa_correction  (time) int32 48kB dask.array<chunksize=(1,), meta=np.ndarray>
#     ugos            (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     ugosa           (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     vgos            (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
#     vgosa           (time, latitude, longitude) int32 197GB dask.array<chunksize=(1, 512, 1024), meta=np.ndarray>
# Attributes: (12/43)
#     Conventions:                     CF-1.6
#     Metadata_Conventions:            Unidata Dataset Discovery v1.0
#     cdm_data_type:                   Grid
#     comment:                         Sea Surface Height measured by Altimetry...
#     contact:                         servicedesk.cmems@mercator-ocean.eu
#     coordinates:                     lat_bnds lon_bnds
#     ...                              ...
#     summary:                         SSALTO/DUACS Delayed-Time Level-4 sea su...
#     time_coverage_duration:          P1D
#     time_coverage_end:               2023-12-31T12:00:00Z
#     time_coverage_resolution:        P1D
#     time_coverage_start:             2023-12-30T12:00:00Z
#     title:                           DT merged all satellites Global Ocean Gr...

This example is a virtualized mosaic of NetCDF in multidim VRT.

big_virtual_mdim = "/vsicurl/https://gist.githubusercontent.com/mdsumner/18c5d302d00b9a456bb73d30ac758764/raw/f26e1b2e202f759d6aace4d7deb3e04ea3c85f15/mdim.vrt"

bvm = backend.open_dataset(big_virtual_mdim, multidim = True, chunks = {})
# <xarray.Dataset> Size: 3TB
# Dimensions:   (Time: 5479, st_ocean: 51, yt_ocean: 1500, xt_ocean: 3600)
# Coordinates:
#   * Time      (Time) float64 44kB 1.132e+04 1.132e+04 ... 1.68e+04 1.68e+04
#   * st_ocean  (st_ocean) float64 408B 2.5 7.5 12.5 ... 3.603e+03 4.509e+03
#   * yt_ocean  (yt_ocean) float64 12kB -74.95 -74.85 -74.75 ... 74.75 74.85 74.95
#   * xt_ocean  (xt_ocean) float64 29kB 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
# Data variables:
#     temp      (Time, st_ocean, yt_ocean, xt_ocean) int16 3TB dask.array<chunksize=(5479, 51, 1500, 3600), meta=np.ndarray>
    

bvm.sel(xt_ocean = slice(140, 150), yt_ocean = slice(-55, -45), st_ocean = slice(8, 13)).isel(Time = -1).temp.values

# array([[[-30770, -30784, -30799, ..., -30418, -30424, -30445],
#         [-30755, -30771, -30788, ..., -30418, -30425, -30446],
#         [-30744, -30764, -30788, ..., -30417, -30426, -30448],
#         ...,
#         [-29852, -29868, -29889, ..., -29413, -29338, -29325],
#         [-29835, -29851, -29883, ..., -29385, -29327, -29324],
#         [-29821, -29840, -29879, ..., -29353, -29319, -29322]]],
#       shape=(1, 100, 100), dtype=int16)

A grib example

What a joy to simply be able to use GDAL for what it is good at without intermediate layers.

from gdx import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()
backend.open_dataset("/vsicurl/https://noaa-nbm-grib2-pds.s3.amazonaws.com/blend.20251031/16/core/blend.t16z.core.f260.co.grib2", multidim = False, chunks = None)
<xarray.Dataset> Size: 989MB
Dimensions:                                                                           (
                                                                                       y: 1597,
                                                                                       x: 2345)
Coordinates:
  * x                                                                                 (x) float64 19kB ...
  * y                                                                                 (y) float64 13kB ...
  * crs                                                                               int64 8B ...
Data variables: (12/33)
    APTMP:2 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
    CAPE:surface:260 hour fcst                                                        (y, x) float64 30MB ...
    CWASP:surface:260 hour fcst                                                       (y, x) float64 30MB ...
    DPT:2 m above ground:260 hour fcst                                                (y, x) float64 30MB ...
    DSWRF:surface:260 hour fcst                                                       (y, x) float64 30MB ...
    FICEAC:surface:254-260 hour acc fcst                                              (y, x) float64 30MB ...
    ...                                                                                ...
    WDIR:80 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
    WDIR:10 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
    GUST:10 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
    WIND:30 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
    WIND:80 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
    WIND:10 m above ground:260 hour fcst                                              (y, x) float64 30MB ...
Indexes:
  ┌ x        RasterIndex
  └ y
    crs      CRSIndex (crs=PROJCS["unnamed",GEOGCS["Coordinate System imported from GRIB file" ...)

There’s a lot more to do, scaling works but I turned that off to test for now. .

Template a list of netcdf files and mosaic them to VRT, then open with this xarray backend. (Note this requires GDAL>=3.12.0 ).

month = "202501"
url = [f"/vsicurl/https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/{month}/oisst-avhrr-v02r01.{month}{(day+1):02d}.nc" for day in range(31)]
gdal.Run("mdim mosaic", input = url, output =  "oisst.vrt", array = "sst")
from gdx import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()

backend.open_dataset("oisst.vrt", multidim = True)

# <xarray.Dataset> Size: 64MB
# Dimensions:  (lat: 720, lon: 1440, time: 31, zlev: 1)
# Coordinates:
#   * lat      (lat) float64 6kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
#   * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
#   * time     (time) float64 248B 1.717e+04 1.72e+04 ... 1.717e+04 1.717e+04
#   * zlev     (zlev) float64 8B 0.0
# Data variables:
#     sst      (time, zlev, lat, lon) int16 64MB ...
# 

Code of Conduct

Please note that the gdx project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

About

GDAL backend for xarray, enabling Classic or Multidim

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages