Quick Overview#

This chapter gives a quick overview into how to use the package. Detailed explanations can be found in the userguide.

Start by importing the package:

import h5rdmtoolbox as h5tbx

Difference to h5py package#

The h5RDMtoolbox is built upon the h5py package. The base functionality is kept, but convenient features and interfaces are added.

Filename#

A filename must not be provided when creating a new file. If none is provided, a temporary file is created. Also, hdf_filename is provided as an additional property allowing to work with the filename even after the file has been closed and to work with pathlib.Path objects instead of strings:

with h5tbx.use(None):
    with h5tbx.File() as h5:
        pass
h5.hdf_filename.name  # equal to h5.filename but a pathlib.Path and exists also after the file is closed
'tmp0.hdf'

Additional arguments#

Here and there, the toolbox allows one-liners, e.g. by creating attributes during dataset creation. In the following example the dataset is also marked as a dimension scale:

with h5tbx.File() as h5:
    ds_time = h5.create_dataset(name='time',
                                data=[0, 1, 2, 3],
                                attrs=dict(units='s',
                                           long_name='measurement time'),
                                make_scale=True)

Datasets/xarray interface#

Data access will not return np.ndarray but a xr.DataArray object. It is capable of storing attributes and coordinates (similar concept as HDF dimension scales). Find out about all possibilities this give on xarray’s documentation.

Let’s create some sample data and see how this new return object can help:

import numpy as np

time = np.linspace(0, np.pi/4, 21) # units [s]
signal = np.sin(2*np.pi*3*time) # units [V], physical: [m/s]

with h5tbx.File() as h5:
    vel_hdf_filename = h5.hdf_filename # store for later use
    
    ds_time = h5.create_dataset(name='time',
                                data=time,
                                attrs=dict(units='s',
                                           long_name='measurement time'),
                                make_scale=True)
    
    ds_signal = h5.create_dataset(name='vel',
                                  data=signal,
                                  attrs=dict(units='m/s',
                                             long_name='air velocity in pipe'),
                                  attach_scale=ds_time)

Inspired by xarray the methods sel and isel are implemented:

with h5tbx.File(vel_hdf_filename) as h5:
    vel2 = h5['vel'].sel(time=2, method='nearest')
vel2
<xarray.DataArray 'vel' ()>
0.7855
Coordinates:
    time     float64 0.7854
Attributes:
    long_name:  air velocity in pipe
    units:      m/s

Another advantage is using the plotting util form xarray:

with h5tbx.File(vel_hdf_filename) as h5:
    vel_data = h5['vel'][:]
    vel_data.plot(marker='o')
    
vel_data  # this returns the interactive view of the array and its meta data
<xarray.DataArray 'vel' (time: 21)>
0.0 0.6745 0.9959 0.7962 0.1797 -0.5308 ... -0.6615 0.01737 0.6872 0.9973 0.7855
Coordinates:
  * time     (time) float64 0.0 0.03927 0.07854 0.1178 ... 0.7069 0.7461 0.7854
Attributes:
    long_name:  air velocity in pipe
    units:      m/s
../_images/1039d8d4d618f41635915b80c9d1f02007ffe11a093a5694ee1d98739a293a40.png

Natural Naming#

Until here, we used the conventional way of addressing variables and groups in a dictionary-like style. h5RDMtoolbox allows using “natural naming” which means that we can address those objects as if they were attributes. Make sure h5tbx.config.natural_naming is set to True (the default)

Let’s first disable natural_naming:

with h5tbx.set_config(natural_naming=False):
    with h5tbx.File(vel_hdf_filename, 'r') as h5:
        try:
            ds = h5.vel[:]
        except Exception as e:
            print(e)
'File' object has no attribute 'vel'

Enable it:

with h5tbx.set_config(natural_naming=True):
    with h5tbx.File(vel_hdf_filename, 'r') as h5:
        ds = h5.vel[:]

Inspect file content#

Often it is necessary to inspect the content of a file (structure, metadata, not the raw data). Calling dump() on a group represents the content (dataset, groups and attributes) as a pretty and interactive (!) HTML representation. This is adopted from the xarray package. All credits for this idea go there. The representation here avoids showing data, though. Outside an IPython environment, call sdump() to get a string representation of the file.

with h5tbx.File(vel_hdf_filename) as h5:
    h5.dump()
      (21) [float64]
      • long_name: measurement time
      • units: s
      (time: 21) [float64]
      • long_name: air velocity in pipe
      • units: m/s
with h5tbx.File(vel_hdf_filename) as h5:
    h5.sdump()
time: (21,), dtype: float64
    a: long_name: measurement time
    a: units: s
vel: (21,), dtype: float64
    a: long_name: air velocity in pipe
    a: units: m/s

Conventions#

The file content is controlled by means of a convention. This means that specific attributes are required for HDF groups or datasets.

They can be understood as rules, which are validated during usage. To make those rules to become effective, the convention must be imported and enabled. Conventions can be created by the user, too. More on this here.

For now, we select the existing one, which is published on Zenodo

cv = h5tbx.convention.from_zenodo('10428822')
cv
Convention("h5rdmtoolbox-tutorial-convention")

From the above representation string of the convention object we can read which attributes are optional or required for file creation (__init__), dataset creation (create_dataset) or group creation (create_group).

Without enabling the convention, the working with HDF5 files through the h5rdmtoolbox is almost (we got a few additional features which make life a bit easier) as by using h5py:

with h5tbx.File() as h5:
    h5.dump()

Now, we enable the convention …

h5tbx.use(cv)
using("h5rdmtoolbox-tutorial-convention")

… and get an error, because we are not providing a “data_type”:

try:
    with h5tbx.File() as h5:
        pass
except Exception as e:
    print(e)
Convention "h5rdmtoolbox-tutorial-convention" expects standard attribute "data_type" to be provided as an argument during file creation.
import numpy as np

time = np.linspace(0, np.pi/4, 21) # units [s]
signal = np.sin(2*np.pi*3*time) # units [V], physical: [m/s]

with h5tbx.File(contact=h5tbx.__author_orcid__, data_type='experimental') as h5:
    vel_hdf_filename = h5.hdf_filename # store for later use
    
    ds_time = h5.create_dataset(name='time',
                                data=time, 
                                units='s',
                                long_name='measurement time',
                                make_scale=True)
    
    ds_signal = h5.create_dataset(name='vel',
                                  data=signal,
                                  units='m/s',
                                  long_name='air velocity in pipe',
                                  attach_scale=ds_time)

RDF#

The files can be described by RDF triples:

h5tbx.use(None)
with h5tbx.File() as h5:
    grp = h5.create_group('contact', attrs=dict(orcid='https://orcid.org/0000-0001-8729-0482'))   
    grp.rdf.predicate = 'https://schema.org/author'
    grp.rdf.type = 'http://xmlns.com/foaf/0.1/Person'  # what the content of group is, namely a foaf:Person
    grp.rdf.subject = 'https://orcid.org/0000-0001-8729-0482'  # corresponds to @ID in JSON-LD
    grp.rdf.predicate['orcid'] =  'http://w3id.org/nfdi4ing/metadata4ing#orcidId'
    grp.attrs['first_name', 'http://xmlns.com/foaf/0.1/firstName'] = 'Matthias'

    h5.dump()

One of the benefits is that the user can understand the meaning of the data. A machine-interpretable and standardised common exchange file format used by Semantic Web technology is JSON-LD. The toolbox also allows exporting to this format:

print(h5tbx.jsonld.dump_file(h5.hdf_filename, skipND=1))
{
    "@context": {
        "owl": "http://www.w3.org/2002/07/owl#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "hdf5": "http://purl.allotrope.org/ontologies/hdf5/1.8#"
    },
    "@type": "hdf5:File",
    "hdf5:rootGroup": {
        "@type": "hdf5:Group",
        "hdf5:attribute": [],
        "hdf5:name": "/",
        "hdf5:member": [
            {
                "@type": "hdf5:Group",
                "hdf5:attribute": [
                    {
                        "@type": "hdf5:Attribute",
                        "hdf5:name": "first_name",
                        "hdf5:value": "Matthias",
                        "@id": "_:N71720321bfbd4118a91ab0dfaef755b8"
                    },
                    {
                        "@type": "hdf5:Attribute",
                        "hdf5:name": "orcid",
                        "hdf5:value": "https://orcid.org/0000-0001-8729-0482",
                        "@id": "_:N584a9bb96d4a48fdb8441a555dc7dc67"
                    }
                ],
                "hdf5:name": "/contact",
                "@id": "_:Nf4009cd3fac146998b47c7d47595c506"
            }
        ],
        "@id": "_:Ne37fdb358738490a9f32a9670eb95d09"
    },
    "@id": "_:Nb5ab11972a0742fe8fe573ee56a6f7ee"
}

Databases#

The h5rdmtoolbox has currently implemented two solutions to use databases with HDF5 file. One solution is mapping metadata into a mongoDB database. The other uses the HDF5 file itself as a database and allows querying without any further step.

In this quick tutorial, we use the second solution. More on the topic can be found in the documentation

Let’s find the dataset with name “/vel” (yes, trivial in this case, but just to get an idea). We use find_one, because we want to find only one (the first) occurrence:

from h5rdmtoolbox.database import FileDB
res = FileDB(vel_hdf_filename).find_one({'$name': '/vel'})
print(res.name)
/vel

The same can be done from an opened file, too:

with h5tbx.File(vel_hdf_filename) as h5:
    res = h5.find_one({'$name': '/vel'})
res.name
'/vel'

Let’s find all (find) datasets with the attribute “units” and any value:

res = FileDB(vel_hdf_filename).find({'units': {'$regex': '.*'}})
for r in res:
    print(r)
<LDataset "/time" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp5.hdf" attrs=(units=s, long_name=measurement time)>
<LDataset "/vel" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp5.hdf" attrs=(long_name=air velocity in pipe, units=m/s)>

Layouts#

Layouts define how a file is expected to be organized, which groups and datasets must exist, which attributes are expected and much more. Layout define expectations and thus help file exchange where multiple users are involved. In the jargon of the toolbox, we call these “specifications”.

Design concept
The module layouts makes use of the database solution for HDF5 files. The idea is, that we should be able to formulate our expectations/specifications in the form of a query. For more detailed information, see here. So we write down our queries, which we expect to find HDF5 objects in a file, when we validate one in the future.

Let’s design a simple one, which requires all datasets to have the attribute “units”:

from h5rdmtoolbox.layout import Layout
lay = Layout()

spec_all_dataset = lay.add(
    FileDB.find,  # query function
    flt={},
    objfilter='dataset',
    n=None
)

# The following specification is added to the previous.
# This will apply the query only on results found by the previous query
spec_compression = spec_all_dataset.add(
    FileDB.find,
    flt={'units': {'$exists': True}}, # attribute "units" exists
    n=1
)

# we added one specification to the layout. let's check:
lay.specifications  # note, that the second specification is not shown, because it is part of the first one
[LayoutSpecification(kwargs={'flt': {}, 'objfilter': 'dataset'})]
res = lay.validate(vel_hdf_filename)
res.is_valid()
True
res.print_summary(exclude_keys=('kwargs', 'target_name', 'target_type'))
Summary of layout validation
+--------------------------------------+----------+--------+----------------------+---------------+-----------------------------------------+
| id                                   | called   |   flag | flag description     | description   | func                                    |
|--------------------------------------+----------+--------+----------------------+---------------+-----------------------------------------|
| 006512ac-67af-446c-aead-743f297c15aa | True     |     17 | SUCCESSFUL, OPTIONAL |               | h5rdmtoolbox.database.hdfdb.filedb.find |
| a29f4109-2a79-4142-bc45-d1baa57bd492 | True     |      1 | SUCCESSFUL           |               | h5rdmtoolbox.database.hdfdb.filedb.find |
| a29f4109-2a79-4142-bc45-d1baa57bd492 | True     |      1 | SUCCESSFUL           |               | h5rdmtoolbox.database.hdfdb.filedb.find |
+--------------------------------------+----------+--------+----------------------+---------------+-----------------------------------------+
--> Layout is valid

The above layout successfully validate the file.

Now, let’s add the specification:

  • The file must have one dataset named “pressure”.

  • The exact location within the file does not play a role.

  • This specific dataset must have the unit “Pa”:

  • The shape of the dataset must be equal to (21, )

lay.add(
    FileDB.find_one,  # query function
    flt={'$name': {'$regex': 'pressure'}, 
         '$shape': (21, ),
         'units': 'Pa'},
    objfilter='dataset',
    n=1
)
lay.specifications
[LayoutSpecification(kwargs={'flt': {}, 'objfilter': 'dataset'}),
 LayoutSpecification(kwargs={'flt': {'$name': {'$regex': 'pressure'}, '$shape': (21,), 'units': 'Pa'}, 'objfilter': 'dataset'})]

The validation now fails:

res = lay.validate(vel_hdf_filename)
res.is_valid()
2024-05-16_14:10:48,772 ERROR    [core.py:330] Applying spec. "LayoutSpecification(kwargs={'flt': {'$name': {'$regex': 'pressure'}, '$shape': (21,), 'units': 'Pa'}, 'objfilter': 'dataset'})" failed due to not matching the number of results: 1 != 0
False

Let’s add such a dataset:

with h5tbx.File(vel_hdf_filename, 'r+') as h5:
    h5.create_dataset('subgrp/pressure', shape=(21,), attrs={'units': 'Pa'})

And perform the validation again:

res = lay.validate(vel_hdf_filename)
res.is_valid()
True

Feel free to play with the layout specifications and the HDF5 file content. For sure, knowledge about performing queries with the used database is needed.

Repositories#

Finally, we can publish our data. The toolbox has implemented an interface to Zenodo. Using it with the sandbox (testing) environment requires an API TOKEN. For this, please provide the environment variable “ZENODO_SANDBOX_API_TOKEN”:

# %set_env ZENODO_SANDBOX_API_TOKEN=<your token>
from h5rdmtoolbox.repository import zenodo
from datetime import datetime

Create a new deposit (repo in the testing environment):

deposit = zenodo.ZenodoSandboxDeposit(None)

Prepare metadata according to the Zenodo API:

meta = zenodo.metadata.Metadata(
    version="1.0.0",
    title='H5TBX Quick Overview Test',
    description=f'The file created in the quick overview script using the h5rdmtoolbox version {h5tbx.__version__}.',
    creators=[zenodo.metadata.Creator(name="Probst, Matthias",
                                      affiliation="Karlsruhe Institute of Technology, Institute for Thermal Turbomachinery",
                                      orcid="0000-0001-8729-0482")],
    upload_type='dataset',
    access_right='open',
    keywords=['h5rdmtoolbox', 'tutorial', 'repository-test'],
    publication_date=datetime.now(),
)

push metadata to the repository:

deposit.metadata = meta

Upload the HDF5 file:
We could upload the HDF5 file alone (using .upload_file). However, HDF5 files are often very big. So it makes sense to upload a metadata file, too.

For this purpose, upload_hdf_file accepts a conversion callable. We will build one below. It extracts the metadata from the HDF5 file and stores it in a JSON file, which is automatically uploaded, too:

from h5rdmtoolbox import jsonld
import pathlib
class HDF_to_JSONLD:
    def __init__(self, skipND):
        self.skipND = skipND
        
    def __call__(self, hdf_filename):
        json_filename = pathlib.Path(hdf_filename).with_suffix('.json')
        with open(json_filename, 'w') as f:
            f.write(jsonld.dump_file(hdf_filename, skipND=self.skipND))
        return json_filename
        
deposit.upload_hdf_file(filename=vel_hdf_filename, metamapper=HDF_to_JSONLD(1))
deposit.get().json()['links']['html']
'https://sandbox.zenodo.org/deposit/55684'