Skip to content

difference in check_dtype for vlen compared to h5py #228

@kmuehlbauer

Description

@kmuehlbauer

pyfive.check_dtype(vlen=var.dtype) and h5py.check_dtype(vlen=var.dtype) return different. This fails in downstream xarray when using engine="h5netcdf" with pyfive backend.

In xarray check_dtype is used to check for vlen strings and decodes it from object to U. See below.

MCVE, pyfive/h5netcdf/xarray latest versions, h5py=13.5.1, hdf5=1.14.6

import h5py
import pyfive
import xarray as xr
import os

input_string = ["foó", "bár", "baź"]
original = xr.Dataset({"x": input_string})

kwargs = dict(encoding={"x": {"dtype": str}})
fname = "test.nc"
original.to_netcdf(fname, engine="h5netcdf", **kwargs)

print("----- PYFIVE --------------------")
with pyfive.File("test.nc") as fh:
    var = fh["x"]
    print(pyfive.check_dtype(vlen=var.dtype))
    print(var.dtype.metadata)
    print(fh["x"][...])

print("\n----- H5PY --------------------")
with h5py.File("test.nc") as fh:
    var = fh["x"]
    print(h5py.check_dtype(vlen=var.dtype))
    print(var.dtype.metadata)
    print(fh["x"][...])


backend = "h5py"
os.environ["H5NETCDF_READ_BACKEND"] = backend
print(f"\n----- xarray - h5netcdf - {backend} --------------------")
with xr.open_dataset("test.nc", engine="h5netcdf") as ds:
    print(ds["x"])

backend = "pyfive"
os.environ["H5NETCDF_READ_BACKEND"] = backend
print(f"\n----- xarray - h5netcdf - {backend} --------------------")
with xr.open_dataset("test.nc", engine="h5netcdf") as ds:
    print(ds["x"])
----- PYFIVE --------------------
string_info(encoding='ascii', length=None)
{'vlen': <class 'str'>}
['foó' 'bár' 'baź']

----- H5PY --------------------
<class 'str'>
{'vlen': <class 'str'>}
[b'fo\xc3\xb3' b'b\xc3\xa1r' b'ba\xc5\xba']

----- xarray - h5netcdf - h5py --------------------
<xarray.DataArray 'x' (x: 3)> Size: 36B
array(['foó', 'bár', 'baź'], dtype='<U3')
Coordinates:
  * x        (x) <U3 36B 'foó' 'bár' 'baź'

----- xarray - h5netcdf - pyfive --------------------
<xarray.DataArray 'x' (x: 3)> Size: 24B
array(['foó', 'bár', 'baź'], dtype=object)
Coordinates:
  * x        (x) object 24B 'foó' 'bár' 'baź'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions