Using HDF5 file(s) as a database

Using HDF5 file(s) as a database#

HDF files can be considered databases itself, however h5py does not implementation query functions. This chapter will take you through everything you need to know to find data within one or more HDF5 files based on attributes or properties. The explanation will be conducted using practical examples.

In addition to the normal import, we will need some tutorial data:

import h5rdmtoolbox as h5tbx
from h5rdmtoolbox import tutorial

h5tbx.use(None)

using("h5py")

Attribute search#

HDF attribute generally are describing groups or datasets and hence are ideal to search for.

One such attribute used in the test file is “standard_name”. The find methods uses a filter argument flt of type dictionary, similar to MongoDB syntax. The return value will be a list of (lazy) objects:

with h5tbx.File(filename) as h5:
    res = h5.find(flt={'standard_name': 'velocity', 'units': 'm/s'})

res

[<LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=velocity)>,
 <LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, units=m/s, standard_name=velocity)>,
 <LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=30.2, units=m/s, standard_name=velocity)>]

If we are only interested in the first occurrence, then find_one should be used.

with h5tbx.File(filename) as h5:
    res = h5.find_one(flt={'standard_name': 'velocity', 'units': 'm'})

res

Both function are also part of the database module and can be called on HDF5 filenames:

res_find = h5tbx.database.find(filename, flt={'standard_name': 'velocity'})
assert len(res_find) == 3
res_one = h5tbx.database.find_one(filename, flt={'standard_name': 'velocity'})

Property search#

It might be reasonable to search for properties of HDF5 objects, like a specific shape, dimension etc.. For this, a dollar sign must be added to the property name:

h5tbx.database.find(filename, flt={'$ndim': 0})

[<LDataset "/group2/z" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=z_coordinate, units=m)>]

Let’s search for all velocity datasets in the file. Velocity datasets are indicated by the attribute “standard_name”:

with h5tbx.File(filename) as h5:
    res = h5.find(flt={'standard_name': 'velocity'})

res

[<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=30.2, units=m/s, standard_name=velocity)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=velocity)>,
 <LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, units=m/s, standard_name=velocity)>]

If we are only interested in the first occurrence, then find_one() does the job:

with h5tbx.File(filename) as h5:
    res = h5.find_one(flt={'standard_name': 'velocity'})

res

<LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, units=m/s, standard_name=velocity)>

Object filter#

By providing objfilter=dataset or objfilter=group only those type of objects are include in the result:

h5tbx.database.find(filename, flt={'check_value': {'$exists': True}}, objfilter='group')

[<LGroup "/group2" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf">,
 <LGroup "/" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf">,
 <LGroup "/group1" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf">]

Special operators#

“$exists”#

In the below example, all objects that have an attribute “units” are returned

results = h5tbx.database.find(filename, {'units': {'$exists': True}})
results

[<LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, units=m/s, standard_name=velocity)>,
 <LDataset "/group2/z" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=z_coordinate, units=m)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=velocity)>,
 <LDataset "/pressure1" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=pressure, units=Pa, check_value=-140.3)>,
 <LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=30.2, units=m/s, standard_name=velocity)>,
 <LDataset "/group2/pressure2" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(units=kPa, check_value=-10.3, standard_name=pressure)>]

$regex#

Regex expressions are usefull to find objects by path name pattern:

results = h5tbx.database.find(filename, {'$name': {'$regex': '^.*/velocity$'}})
results

[<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=30.2, units=m/s, standard_name=velocity)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=velocity)>,
 <LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, units=m/s, standard_name=velocity)>]

$basename#

“Basename” is not a property but could be useful. It is implemented as an “virtual property”:

results = h5tbx.database.find(filename, {'$basename': 'pressure2'})
results

[<LDataset "/group2/pressure2" in "/home/docs/.local/share/h5rdmtoolbox/2.5.3/tmp/tmp_1/tmp0.hdf" attrs=(units=kPa, check_value=-10.3, standard_name=pressure)>]

Mathematical operation (>, >=, <, <=)#

The datasets and groups in the example files all have the attribute “check_value”, let’s identify objects that match certain values of this attribute:

print('object names with check value greater than 0.1:')

results = h5tbx.database.find(filename, {'check_value': {'$gt': 0.1}})
for r in results:
    print('  ', r.name)


print('\nobject names with check value equal to 0:')

results = h5tbx.database.find(filename, {'check_value': {'$eq': 0}})
for r in results:
    print('  ', r.name)


print('\nobject names with check value lower equal -3.3:')


results = h5tbx.database.find(filename, {'check_value': {'$lte': -3.3}})
for r in results:
    print('  ', r.name)

object names with check value greater than 0.1:
   /velocity
   /group2/velocity

object names with check value equal to 0:
   /group1
   /group2
   /

object names with check value lower equal -3.3:
   /pressure1
   /group2/pressure2

User-defined operator#

Let’s take the regex example and turn it into a new operator:

The regex query filter "{'$name': {'$regex': '^.*/velocity$'}}" means, that we are looking for “basenames” of objects, so not the full internal HDF5 path but the name.

All operator functions are stored in database.hdfdb.query. In fact, the "$basename"-operator exists, so we first delete it and add it again by our own function:

from h5rdmtoolbox.database.hdfdb import query

# note, all operator functions are stored in this dictionary: query.operator
query.operator.pop('$basename', None)  # remove existing operator

def my_basename_operator(obj_name, basename) -> bool:
    """calling regex under the hood"""
    print(f'Checking if basename of object name "{obj_name}" is matching pattern "^.*/{basename}$"')
    return query._regex(obj_name, pattern=f'^.*/{basename}$')

query.operator['$basename'] = my_basename_operator

results = h5tbx.database.find(filename, {'$name': {'$basename': 'velocity'}}, 'dataset')

for r in results:
    print(r.name)

Checking if basename of object name "/group1/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/pressure2" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/z" is matching pattern "^.*/velocity$"
Checking if basename of object name "/pressure1" is matching pattern "^.*/velocity$"
Checking if basename of object name "/velocity" is matching pattern "^.*/velocity$"
/group2/velocity
/group1/velocity
/velocity

Working with the results

Let’s investigate the return values of queries. The method find_one returns a so-called “lazy” object, and the find method a generator of this class.

results = h5tbx.database.find_one(filename, {'$name': {'$basename': 'velocity'}}, 'dataset')

# let's directly plot the result:
results[()].plot(marker='o')

Checking if basename of object name "/group1/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/pressure2" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/z" is matching pattern "^.*/velocity$"
Checking if basename of object name "/pressure1" is matching pattern "^.*/velocity$"
Checking if basename of object name "/velocity" is matching pattern "^.*/velocity$"

[<matplotlib.lines.Line2D at 0x722239287490>]

../../_images/7fd8c4bb4b71c93d50278ea2993834bbfe890ef03008c3484c50cfc1b37cd49b.png