Serverless HDF Database#

HDF files can be treated as a database themselves.

import h5rdmtoolbox as h5tbx

h5tbx.use(None)
using("h5py")

Search/Find within a single file#

Already, one file can be considered a database. Say we have a nested file and we wont to identify a dataset or group based on an attribute.

The method .find() can be called from any group. It expects a dictionary. As said the syntax is very similar to the one of pymongo. In any case here is how the filter request is built up:

find({<keyword>: <value>}, <object_filter>)

  • keyword is either

    • an attribute name or

    • a property of a dataset or group object, like “name”, “shape”, “dtype”, …

  • To indicate, that you are searching for a porperty, use a “$” in front

  • The object filter can be either “$dataset” or “group”. If not provided results from either of the objects are returned

Let’s build a file first:

with h5tbx.File() as h5:
    h5.write_iso_timestamp(name='timestamp', dt=None) # writes the current date time in iso format to the attribute
    h5.attrs['project'] = 'tutorial'
    h5.attrs['contact'] = {'name': 'John Doe', 'surname': 'Doe'}
    h5.create_dataset('velocity', data=[1,2,-1], attrs=dict(units='m/s', standard_name='x_velocity'))
    g = h5.create_group('group1')
    g.create_dataset('velocity', data=[4,0,-3,12,3], attrs=dict(units='m/s', standard_name='x_velocity'))
    g = h5.create_group('group2')
    g.create_dataset('velocity', data=[12,11.3,4.6,7.3,8.1], attrs=dict(units='m/s', standard_name='x_velocity'))
    g.create_dataset('z', data=5.4, attrs=dict(units='m', standard_name='z_coordinate'))
    h5.dump()
    filename = h5.hdf_filename
    • __h5rdmtoolbox_version__ : 0.13.0
    • contact : {'name': 'John Doe', 'surname': 'Doe'}
    • project : tutorial
    • timestamp : 2023-11-24T16:05:25.651953
      (3) [int64]
      • standard_name : x_velocity
      • units : m/s
        (5) [int64]
        • standard_name : x_velocity
        • units : m/s
        (5) [float64]
        • standard_name : x_velocity
        • units : m/s
        5.4 [m] (float64)
        • standard_name : z_coordinate
        • units : m
with h5tbx.File(filename) as h5:
    results = h5.find({'$basename': 'velocity'}, '$dataset')
    print(results)
[<HDF5 dataset "velocity": shape (5,), type "<f8", convention "h5py">, <HDF5 dataset "velocity": shape (3,), type "<i8", convention "h5py">, <HDF5 dataset "velocity": shape (5,), type "<i8", convention "h5py">]

The above code has one restriction: We cannot use results anymore after the file has been closed:

results
[<Closed HDF5 dataset (convention "h5py")>,
 <Closed HDF5 dataset (convention "h5py")>,
 <Closed HDF5 dataset (convention "h5py")>]

To obtain access to the results even after closing the file, it is recommended to use h5tbx.FileDB. On this object, the same find calles can be executed with the difference, that so-called “lazy” dataset or groups are returned. These special objects are interfaces to the actual dataset but allow accessing the attributes and properties while the source file is closed. As long as the file exists, the dataset values can also be accessed by slicing the lazy dataset. This will open the original file, slice the dataset and returns the value before closing the file again:

results = h5tbx.FileDB(filename).find({'$basename': 'velocity'}, '$dataset')
results
[<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>]

Lazy datasets for example can be sliced even if the file is closed already (the file gets opened upon request):

The result objects are “lazy objects”. We can access all attributes and poperties. Only when slicing, the file is actually opened:

results[0]
<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>
results[0][()]
<xarray.DataArray 'velocity' (dim_0: 5)>
12.0 11.3 4.6 7.3 8.1
Dimensions without coordinates: dim_0
Attributes:
    standard_name:  x_velocity
    units:          m/s

Shortcuts#

Sometimes the above query dictionary might be a bit unhandy. For example, finding any dataset with a certain attribute no matter the value is like querying “Give me every dataset that has this attribute”.

For this to put into a query, we need a regex formulation like so: {'standard_name': {'$regex': '.*'}}. The shortcut here is to just pass the attribute we are looking:

h5tbx.FileDB(filename).find_one('standard_name', '$dataset')
<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>
h5tbx.FileDB(filename).find_one(['standard_name', 'units'], '$dataset')
<LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>

Advanced searches#

Advances attribute searches#

Until here, we searched for exact matches of an attribute or a property values. Sometimes it is required to apply different operators, e.g. checking if a value is greater than a reference or is matching a regular expression.

Here’s a list of what is possible:

  • $regex: Match the attribute string with a pattern

  • $gt: Find only objects where the attribute/property is greater than a given value

  • $lt: …

  • $eq: Exact match

results = h5tbx.FileDB(filename).find({'timestamp': {'$regex': '.*'}}, '$group')
results[0].attrs['timestamp']
'2023-11-24T16:05:25.651953'

Say we want to search for values within dictionaries, e.g. for the surname in the attribute contact (see file creation above):

h5tbx.FileDB(filename).find({'contact.surname': {'$regex': 'Doe'}})
[<LGroup "/" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf">]

Advances dataset searches#

It is also possible to search based on dataset values. Note, that this might be very slow and memory demanding, if you perform the query on large datasets. The main use case is to only query 0D-datasets.

To search based on dataset values, start with an operator-string, like “\$eq” or “\$gt” for instance:

# find the dataset "z". It is 0D with data=5.4. Although irrelevant for this example, it is generally
# better to have a chained search because then the time intensive comparison with dataset value
# is only performed on a pre-filtered result list
res1 = h5tbx.FileDB(filename).find({'standard_name': 'z_coordinate', '$eq': 5.4})
res2 = h5tbx.FileDB(filename).find({'standard_name': 'z_coordinate'}).find({'$eq': 5.4})
res2 == res1, res1
(True,
 [<LDataset "/group2/z" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m, standard_name=z_coordinate)>])

The above query string is a number. If we pass a dictionary, we can tell the query to perform an operation on the dataset before the comparison. In the below example, the mean() of each dataset found is taken first before executing the “greater-than”-comparison (\$gt). Note, that this can be computational intensive if you have many and/or large datasets in your file(s).

results = h5tbx.FileDB(filename).find({'$gt': {'$mean': 2.3}})
results
[<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/group2/z" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m, standard_name=z_coordinate)>]

User-defined operator#

For the given example, we cannot perform the comparison, if the timestamp is greater (or smaller) than a given timestamp, hence, if the data is from a given point in time. For this, we need to write our own operator:

import re
from typing import Dict, Union
from datetime import datetime
from h5rdmtoolbox.database import file

def isodatetime_operator(value, flt: Union[datetime, Dict]) -> bool:
    if value is None:
        return False
        
    if isinstance(flt, dict):
        av_flt = ('$gt', '$gte', '$lt', '$lte', '$eq')
        for k in flt:
            if k not in av_flt:
                raise KeyError(f'Invalid filter operator: {k}, expected one of these: {av_flt}')
    elif isinstance(flt, str):
        raise TypeError(f'You must pass a datetime object, not {type(value)}')
    else:
        flt= {'$eq': value}

    # only check attributes that are string or datetime
    if isinstance(value, str):
        pattern = '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+'
        match = re.search(pattern, value)
        if match is None:
            return False
        if match.group() == '':
            return False
        value = datetime.fromisoformat(value)
    elif isinstance(vaue, datetime):
        pass
    else:
        return False

    # No perform the actual datetime comparison:
    for k, v in flt.items():
        if not h5tbx.database.file.operator[k](value, v):
            return False
    return True
file.operator['$isodatetime'] = isodatetime_operator
from datetime import datetime

results = h5tbx.FileDB(filename).find({'$name': '/',
                                              'timestamp': {'$isodatetime': {'$lt': datetime.now(),
                                                                             '$gt': datetime(2020, 6, 4)}}},
                                             '$group')
results
[<LGroup "/" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf">]

Find in within one or multiple Folders#

To find within a folder, call h5tbx.FolderDB. Pass rec=False if recursive search for files is not wanted:

h5tbx.database.Folder(filename.parent, rec=True).find_one({'$basename': 'velocity'}, '$dataset').name
'/velocity'

Examples of queries:#

h5tbx.database.File(filename).find({'$name': '/velocity'}, '$dataset')
[<LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>]
h5tbx.database.File(filename).find({'$basename': {'$regex': 'group'}}, '$group', rec=False)
[<LGroup "/group1" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf">,
 <LGroup "/group2" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf">]
h5tbx.database.File(filename).find({'$shape': (5,)}, '$dataset')
[<LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>]
datasets = h5tbx.database.File(filename).find({'$ndim': 1}, '$dataset')
from h5rdmtoolbox.database import file
sorted(datasets)
[<LDataset "/group1/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/group2/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>,
 <LDataset "/velocity" in "/home/docs/.local/share/h5rdmtoolbox/tmp/tmp_1/tmp0.hdf" attrs=(units=m/s, standard_name=x_velocity)>]
with sorted(datasets)[0] as ds:
    r = file.find(ds, {'$name': {'$regex': 'group[0-9]'}}, '$dataset', False, True, False)
    print(r)
<HDF5 dataset "velocity": shape (5,), type "<i8", convention "h5py">

Query from an opened file:#

from pprint import pprint

with h5tbx.File(filename) as h5:
    print('find basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset'))
    
    print('\nfind name=/velocity in root:')
    pprint(h5.find({'$name': '/velocity'}, '$dataset'))
    
    print('\nfind name=/sub/velocity in root:')
    pprint(h5.find({'$name': '/sub/velocity'}, '$dataset'))
    
    print('\nfind basename=velocity in group1/:')
    pprint(h5['group1'].find({'$basename': 'velocity'}, '$dataset', rec=False))
    
    print('\nfind basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset', rec=False))
    
    print('\nfind basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset', rec=True))
find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i8", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<f8", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<i8", convention "h5py">]

find name=/velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i8", convention "h5py">]

find name=/sub/velocity in root:
[]

find basename=velocity in group1/:
[<HDF5 dataset "velocity": shape (5,), type "<i8", convention "h5py">]

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i8", convention "h5py">]

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i8", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<f8", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<i8", convention "h5py">]