Creating HDF Datasets#
Dataset creation works almost as known from h5py. However, to facilitate and streamline the work with HDF5 files further some featurs are added.
import h5rdmtoolbox as h5tbx
import numpy as np
import xarray as xr
h5tbx.use(None)
using("h5py")
Obligatory parameters during dataset creation know from the base package h5py are name and data or shape. Additionally, attributes can be passed during dataset creation right away:
with h5tbx.File() as h5:
h5.create_dataset('x', shape=(4,),
attrs=dict(description='x coordinate'))
h5.dump()
-
-
(4) [float32]
- description: x coordinate
The name of the dataset is the path within the HDF5 file. It is possible to create the dataset although the (sub-)groups don’t exist.
with h5tbx.File() as h5:
h5.create_dataset('grp/subgrp/x', shape=(4,))
h5.dump()
-
-
-
-
(4) [float32]
-
-
Attributes#
More flexibility and additional features are given also to attributes. One of the main ones to mention is the ability to intepret the attribute strings as “value and quantity” using the package pint:
Let’s say we store the attribute length then most probably it will inlcude the unit,e.g. 1 m. We could also saved it as a dataset, but we did not. By calling .to_pint() on the return object (which is a subclass of str) we receive a pint.Qunatity (see https://pint.readthedocs.io/en/stable/getting/tutorial.html for more info):
with h5tbx.File() as h5:
h5.attrs['length'] = '1 m'
p = h5.attrs.length.to_pint()
p
Dimension scales#
Dimension scales can be defined during dataset creation. Let time be the dimension scale and pressure be the dataset to which it is attached.
In order to make seamingless use of the HDF dimension scales, the feature is provided back to the user by returning a xarray.DataArray instead of a np.ndarray object. See more on this slicing datasets.
fname_dimcales = h5tbx.utils.generate_temporary_filename()
with h5tbx.File(fname_dimcales, 'w') as h5:
h5.create_dataset('time', data=[0,1,2,3,4,5],
make_scale=True,
attrs={'units': 's'})
h5.create_dataset('pressure', data=np.random.rand(6),
attach_scale=((h5['time'])),
attrs={'units': 'Pa'})
h5.dump()
-
-
(time: 6) [float64]
- units: Pa
-
(6) [int64]
- units: s
In order to be compliant with xarray objects, single value “dimension scales” are set via the attribute COORDINATES. An example is the location of the pressure sensor in our case. Let’s first create the datasets and then add them as attributes to “pressure”:
with h5tbx.File(fname_dimcales, 'r+') as h5:
h5.create_dataset('x', data=5.32)
h5.create_dataset('y', data=-3.1)
h5['pressure'].attrs['COORDINATES'] = ('x', 'y')
h5.dump()
-
-
(time: 6) [float64]
- units: Pa
-
(6) [int64]
- units: s
-
5.32 [float64]
-
-3.1 [float64]
String datasets#
String datasets can be created very quickly. No standard_name, long_name or units must be given. As units generally anyhow makes no sense, there is still the option to pass long and standard name via the method parameters.
The dump method will display single strings but not lists of strings.
The return value when sliced will still be a xarray.DataArray as attributes should still be attached to the object. Use .values to get the raw string:
with h5tbx.File() as h5:
h5.create_string_dataset('astr', 'hello_world')
h5.create_string_dataset('string_list', ['hello', 'world'])
h5.dump()
print('> ', h5['astr'][()])
print('> ',h5['astr'].values[()])
print('> ', h5['string_list'][:])
print('> ',h5['string_list'].values[:])
-
-
: [|S11] data=b'hello_world'
-
: [|S5]
> hello_world
> b'hello_world'
> <xarray.DataArray (dim_0: 2)> Size: 40B
'hello' 'world'
Dimensions without coordinates: dim_0
> [b'hello' b'world']
Time datasets#
Time data is stored as string datasets. Use create_time_dataset. Provide data as datetime objects and indicate the time format (simplest is to pass ‘iso’).
from datetime import datetime, timedelta
with h5tbx.File() as h5:
now = datetime.now()
h5.create_time_dataset('t',
data=[now, now + timedelta(seconds=1), now + timedelta(seconds=2)],
time_format='iso',
make_scale=True)
h5.create_dataset('x', data=[1, 2, 3], attach_scale='t')
# h5.dump()
txr = h5['t'][()]
txr
<xarray.DataArray (dim_0: 3)> Size: 24B
2025-05-19T16:53:06.102846 2025-05-19T16:53:07.102846 2025-05-19T16:53:08.102846
Dimensions without coordinates: dim_0
Attributes:
RDF_PREDICATE: {'time_format': 'https://matthiasprobst.github.io/pivmeta...
RDF_TYPE: https://schema.org/DateTime
time_format: %Y-%m-%dT%H:%M:%S.%fAdvanced dataset creation#
There is more to dataset creation. You can:
add attributes
with h5tbx.File() as h5:
h5.create_dataset('ds', shape=(10, ), attrs=dict(long_name='a long name', anothera='another attr')) # unitless dataset. long_name is passed via parameter attrs
make and attach scales (Note the output using
dump(): the scale “link” is shown)
with h5tbx.File() as h5:
h5.create_dataset('x', data=[1,2,3], attrs=dict(units='m', standard_name='x_coordinate'), make_scale=True)
h5.create_dataset('t', data=[20.1, 18.5, 24.7], attrs=dict(units='degC', standard_name='temperature'), attach_scale=h5['x'])
print(h5.t.x) # note, that you can access the dimension scale using attribute-style-syntax
h5.dump()
<HDF5 dataset "x": shape (3,), type "<i8", convention "h5py">
-
-
(x: 3) [float64]
- standard_name: temperature
- units: degC
-
(3) [int64]
- standard_name: x_coordinate
- units: m
add
xarry.DataArrays
arr = xr.DataArray(dims=('y', 'x'), data=np.random.rand(3, 2),
coords={'y': xr.DataArray(dims='y', data=[1, 2, 3],
attrs={'units': 'm',
'standard_name': 'y_coordinate'}),
'x': xr.DataArray(dims='x',
data=[0, 1],
attrs={'standard_name': 'x_coordinate'})
},
attrs={'long_name': 'a long name',
'units': 'm/s'})
with h5tbx.File() as h5:
h5.create_dataset('temperature', data=arr)
h5.dump()
-
-
(y: 3, x: 2) [float64]
- long_name: a long name
- units: m/s
-
(2) [int64]
- standard_name: x_coordinate
-
(3) [int64]
- standard_name: y_coordinate
- units: m
add
xarry.Dataset
ds = xr.Dataset({'foo': [1,2,3], 'bar': ('x', [1, 2]), 'baz': np.pi})
ds
<xarray.Dataset> Size: 48B
Dimensions: (foo: 3, x: 2)
Coordinates:
* foo (foo) int64 24B 1 2 3
Dimensions without coordinates: x
Data variables:
bar (x) int64 16B 1 2
baz float64 8B 3.142try:
with h5tbx.File() as h5:
h5.create_dataset_from_xarray_dataset(ds)
except h5tbx.errors.UnitsError as e:
print(e)
ds.foo.attrs['units']='m'
ds.foo.attrs['long_name']='foo'
ds.bar.attrs['units']='m'
ds.bar.attrs['long_name']='bar'
ds.baz.attrs['units']='m'
ds.baz.attrs['long_name']='baz'
ds
<xarray.Dataset> Size: 48B
Dimensions: (foo: 3, x: 2)
Coordinates:
* foo (foo) int64 24B 1 2 3
Dimensions without coordinates: x
Data variables:
bar (x) int64 16B 1 2
baz float64 8B 3.142with h5tbx.File() as h5:
h5.create_dataset_from_xarray_dataset(ds)
We may also create a dataset by using the __setitem__:
with h5tbx.File() as h5:
h5['x'] = ([1,2,3], dict(attrs={'hello': 'world'}, compression='gzip'))
h5.dump()
-
-
(3) [int64]
- hello: world