First steps: HDF5 and databases#

There are two ways of working with HDF5 and databases:

  1. Using HDF5 file(s) as a database itself.

  2. Writing HDF5 content into dedicated database solutions.

Both ways will be described in the next two chapters. The second approach is currently implemented for a mongoDB-interface.

However, before we start, let’s understand how the database interface is designed.

Design idea#

Regardless of whether we want to use HDF5 files themselves as databases or connect the content to third party solutions, we need to write a user interface. The h5RDMtoolbox provides an abstract class (HDF5DatabaseInterface) from which any database interface must inherit.

from h5rdmtoolbox.database import HDF5DBInterface

Four methods must be implemented: insert_dataset, insert_group, find and find_one (and of course an __init__ method). The return types for find and find_one are so-called lazy objects. In short: They are interfaces to the HDF5 dataset and group objects and allow accessing data while the source file is closed. Examples are given at the end.

from typing import List
from h5rdmtoolbox.database.lazy import LHDFObject

class MyDBInterface(HDF5DBInterface):

    def __init__(self, *args, **kwargs):
        """init the db"""
    
    def insert_dataset(self, *args, **kwargs):
        """inserting datasets into a database"""
    
    def insert_group(self, *args, **kwargs):
        """inserting datasets into a database"""

    def find(self, *args, **kwargs) -> LHDFObject:
        """find (many) objects according to the query parameters"""
        
    def find_one(self, *args, **kwargs) -> List[LHDFObject]:
        """find (many) objects according to the query parameters"""

The usage of a database interface will then look like this for all database implementations:

import h5rdmtoolbox as h5tbx

mydb = MyDBInterface()

with h5tbx.File() as h5:
    h5.create_group('a group')
    h5.create_dataset('my_dataset', shape=(4, 2))
    # ... 
    mydb.insert_dataset(h5['my_dataset'])
    mydb.insert_group(h5['a group'])

many_res = mydb.find(...)
single_res = mydb.find_one(...)

A word on lazy objects (return values of database find methods)#

The return types of the find-methods are so-called lazy objects (or generator of lazy objects). What is a lazy object?

There are two types: LDataset and LGroup, the lazy objects for datasets and groups. Those objects are connected to HDF datasets and groups with the only difference, that the user can work with them even if the file is closed.

Example: The standard approach is to open a file whenever data or information needs to be accessed:

with h5tbx.File() as h5:
    h5.create_dataset('my_dataset', shape=(4, 2))

# some other code....

# after a while, we want to access the data again and need to reopen the file again:
with h5tbx.File(h5.hdf_filename) as h5:
    ds = h5['my_dataset'][()]

The LDataset allows accessing a dataset without actively opening the file (the object takes core of it in the background)

with h5tbx.File() as h5:
    x = h5.create_dataset('x', data=[1, 2, 3, 4], make_scale=True)
    y = h5.create_dataset('y', data=[10, 20], make_scale=True)
    h5.create_dataset('my_dataset', shape=(4, 2), attach_scales=(x, y))

    lds = h5tbx.database.lazy.LDataset(h5['my_dataset'])

lds[()]  # access the data although the file is closed
<xarray.DataArray 'my_dataset' (x: 4, y: 2)> Size: 32B
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Coordinates:
  * x        (x) int64 32B 1 2 3 4
  * y        (y) int64 16B 10 20

The “laziness” behind this is that the object takes care of opening and closing the file in the background. It is just a convenient way of accessing data from an hdf file without the extra code and worries of properly opening and closing the file.

Moreover, the object has additional functionality, such as slicing the array based on the dimension scales/coordinates:

lds.sel(x=4, y=20)
<xarray.DataArray 'my_dataset' ()> Size: 4B
0.0
Coordinates:
    x        int64 8B 4
    y        int64 8B 20

We can do the same thing with groups. It is just less useful because the datasets are usually of greater interest…