Using HDF5 file(s) as a database#

HDF files can be considered databases itself, however h5py does not implementation query functions. This chapter will take you through everything you need to know to find data within one or more HDF5 files based on attributes or properties. The explanation will be conducted using practical examples.

In addition to the normal import, we will need some tutorial data:

import h5rdmtoolbox as h5tbx
from h5rdmtoolbox import tutorial

h5tbx.use(None)
using("h5py")

Test file#

Throughout this section the following test file will be used:

filename = tutorial.generate_sample_file()
h5tbx.dump(filename)
    • check_value: 0
    • project: tutorial
    • timestamp: 2026-01-09T19:50:04.769322
      (10) [float64]
      • check_value: -140.3
      • standard_name: pressure
      • units: Pa
      (3) [int64]
      • check_value: 14.2
      • standard_name: velocity
      • units: m/s
      • check_value: 0
        (5) [int64]
        • standard_name: velocity
        • units: m/s
      • check_value: 0
        (10) [float64]
        • check_value: -10.3
        • standard_name: pressure
        • units: kPa
        (5) [float64]
        • check_value: 30.2
        • standard_name: velocity
        • units: m/s
        5.4 [m] [float64]
        • standard_name: z_coordinate
        • units: m

Object filter#

By providing objfilter=dataset or objfilter=group only those type of objects are include in the result:

h5tbx.database.find(filename, flt={'check_value': {'$exists': True}}, objfilter='group')
[<LGroup "/group2" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf">,
 <LGroup "/" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf">,
 <LGroup "/group1" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf">]

Special operators#

“$exists”#

In the below example, all objects that have an attribute “units” are returned

results = h5tbx.database.find(filename, {'units': {'$exists': True}})
results
[<LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, standard_name=velocity, units=m/s)>,
 <LDataset "/group2/z" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(units=m, standard_name=z_coordinate)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=velocity, units=m/s)>,
 <LDataset "/pressure1" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=pressure, units=Pa, check_value=-140.3)>,
 <LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=velocity, check_value=30.2, units=m/s)>,
 <LDataset "/group2/pressure2" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(check_value=-10.3, standard_name=pressure, units=kPa)>]

$regex#

Regex expressions are usefull to find objects by path name pattern:

results = h5tbx.database.find(filename, {'$name': {'$regex': '^.*/velocity$'}})
results
[<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=velocity, check_value=30.2, units=m/s)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(standard_name=velocity, units=m/s)>,
 <LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(check_value=14.2, standard_name=velocity, units=m/s)>]

$basename#

“Basename” is not a property but could be useful. It is implemented as an “virtual property”:

results = h5tbx.database.find(filename, {'$basename': 'pressure2'})
results
[<LDataset "/group2/pressure2" in "/home/docs/.local/share/h5rdmtoolbox/2.6.0/tmp/tmp_1/tmp0.hdf" attrs=(check_value=-10.3, standard_name=pressure, units=kPa)>]

Mathematical operation (>, >=, <, <=)#

The datasets and groups in the example files all have the attribute “check_value”, let’s identify objects that match certain values of this attribute:

print('object names with check value greater than 0.1:')

results = h5tbx.database.find(filename, {'check_value': {'$gt': 0.1}})
for r in results:
    print('  ', r.name)


print('\nobject names with check value equal to 0:')

results = h5tbx.database.find(filename, {'check_value': {'$eq': 0}})
for r in results:
    print('  ', r.name)


print('\nobject names with check value lower equal -3.3:')


results = h5tbx.database.find(filename, {'check_value': {'$lte': -3.3}})
for r in results:
    print('  ', r.name)
object names with check value greater than 0.1:
   /velocity
   /group2/velocity

object names with check value equal to 0:
   /group1
   /group2
   /

object names with check value lower equal -3.3:
   /pressure1
   /group2/pressure2

User-defined operator#

Let’s take the regex example and turn it into a new operator:

The regex query filter "{'$name': {'$regex': '^.*/velocity$'}}" means, that we are looking for “basenames” of objects, so not the full internal HDF5 path but the name.

All operator functions are stored in database.hdfdb.query. In fact, the "$basename"-operator exists, so we first delete it and add it again by our own function:

from h5rdmtoolbox.database.hdfdb import query

# note, all operator functions are stored in this dictionary: query.operator
query.operator.pop('$basename', None)  # remove existing operator

def my_basename_operator(obj_name, basename) -> bool:
    """calling regex under the hood"""
    print(f'Checking if basename of object name "{obj_name}" is matching pattern "^.*/{basename}$"')
    return query._regex(obj_name, pattern=f'^.*/{basename}$')

query.operator['$basename'] = my_basename_operator
results = h5tbx.database.find(filename, {'$name': {'$basename': 'velocity'}}, 'dataset')

for r in results:
    print(r.name)
Checking if basename of object name "/group1/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/pressure2" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/z" is matching pattern "^.*/velocity$"
Checking if basename of object name "/pressure1" is matching pattern "^.*/velocity$"
Checking if basename of object name "/velocity" is matching pattern "^.*/velocity$"
/group2/velocity
/group1/velocity
/velocity

Working with the results

Let’s investigate the return values of queries. The method find_one returns a so-called “lazy” object, and the find method a generator of this class.

results = h5tbx.database.find_one(filename, {'$name': {'$basename': 'velocity'}}, 'dataset')

# let's directly plot the result:
results[()].plot(marker='o')
Checking if basename of object name "/group1/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/pressure2" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/velocity" is matching pattern "^.*/velocity$"
Checking if basename of object name "/group2/z" is matching pattern "^.*/velocity$"
Checking if basename of object name "/pressure1" is matching pattern "^.*/velocity$"
Checking if basename of object name "/velocity" is matching pattern "^.*/velocity$"
[<matplotlib.lines.Line2D at 0x73c77be7d1b0>]
../../_images/8552110fd1f1c8fab38f90d7c017feb706b87c3b270f94a36d96494ad470f722.png