Creating HDF Datasets

Creating HDF Datasets#

Dataset creation works almost as known from h5py. However, to facilitate and streamline the work with HDF5 files further some featurs are added.

import h5rdmtoolbox as h5tbx
import numpy as np
import xarray as xr

h5tbx.use(None)

using("h5py")

Obligatory parameters during dataset creation know from the base package h5py are name and data or shape. Additionally, attributes can be passed during dataset creation right away:

with h5tbx.File() as h5:
    h5.create_dataset('x', shape=(4,),
                      attrs=dict(description='x coordinate'))
    h5.dump()

/(1)

The name of the dataset is the path within the HDF5 file. It is possible to create the dataset although the (sub-)groups don’t exist.

with h5tbx.File() as h5:
    h5.create_dataset('grp/subgrp/x', shape=(4,))
    h5.dump()

/(1)
- grp(1)
  - subgrp(1)

Attributes#

More flexibility and additional features are given also to attributes. One of the main ones to mention is the ability to intepret the attribute strings as “value and quantity” using the package pint:
Let’s say we store the attribute length then most probably it will inlcude the unit,e.g. 1 m. We could also saved it as a dataset, but we did not. By calling .to_pint() on the return object (which is a subclass of str) we receive a pint.Qunatity (see https://pint.readthedocs.io/en/stable/getting/tutorial.html for more info):

with h5tbx.File() as h5:
    h5.attrs['length'] = '1 m'
    p = h5.attrs.length.to_pint()
p

1 m

Dimension scales#

Dimension scales can be defined during dataset creation. Let time be the dimension scale and pressure be the dataset to which it is attached.
In order to make seamingless use of the HDF dimension scales, the feature is provided back to the user by returning a xarray.DataArray instead of a np.ndarray object. See more on this slicing datasets.

fname_dimcales = h5tbx.utils.generate_temporary_filename()
with h5tbx.File(fname_dimcales, 'w') as h5:
    h5.create_dataset('time', data=[0,1,2,3,4,5],
                      make_scale=True,
                      attrs={'units': 's'})
    h5.create_dataset('pressure', data=np.random.rand(6),
                      attach_scale=((h5['time'])),
                      attrs={'units': 'Pa'})
    h5.dump()

/(2)

In order to be compliant with xarray objects, single value “dimension scales” are set via the attribute COORDINATES. An example is the location of the pressure sensor in our case. Let’s first create the datasets and then add them as attributes to “pressure”:

with h5tbx.File(fname_dimcales, 'r+') as h5:
    h5.create_dataset('x', data=5.32)
    h5.create_dataset('y', data=-3.1)
    h5['pressure'].attrs['COORDINATES'] = ('x', 'y')
    h5.dump()

/(4)

String datasets#

String datasets can be created very quickly. No standard_name, long_name or units must be given. As units generally anyhow makes no sense, there is still the option to pass long and standard name via the method parameters.
The dump method will display single strings but not lists of strings.
The return value when sliced will still be a xarray.DataArray as attributes should still be attached to the object. Use .values to get the raw string:

with h5tbx.File() as h5:
    h5.create_string_dataset('astr', 'hello_world')
    h5.create_string_dataset('string_list', ['hello', 'world'])
    h5.dump()
    
    print('> ', h5['astr'][()])
    print('> ',h5['astr'].values[()])
    
    print('> ', h5['string_list'][:])
    print('> ',h5['string_list'].values[:])

/(2)

>  hello_world
>  b'hello_world'
>  <xarray.DataArray (dim_0: 2)> Size: 40B
'hello' 'world'
Dimensions without coordinates: dim_0
>  [b'hello' b'world']

Time datasets#

Time data is stored as string datasets. Use create_time_dataset. Provide data as datetime objects and indicate the time format (simplest is to pass ‘iso’).

from datetime import datetime, timedelta

with h5tbx.File() as h5:
    now = datetime.now()
    h5.create_time_dataset('t',
                           data=[now, now + timedelta(seconds=1), now + timedelta(seconds=2)],
                           time_format='iso',
                           make_scale=True)
    h5.create_dataset('x', data=[1, 2, 3], attach_scale='t')
    # h5.dump()
    txr = h5['t'][()]
txr

<xarray.DataArray (dim_0: 3)> Size: 24B
2025-05-19T16:53:06.102846 2025-05-19T16:53:07.102846 2025-05-19T16:53:08.102846
Dimensions without coordinates: dim_0
Attributes:
    RDF_PREDICATE:  {'time_format': 'https://matthiasprobst.github.io/pivmeta...
    RDF_TYPE:       https://schema.org/DateTime
    time_format:    %Y-%m-%dT%H:%M:%S.%f