First steps: HDF5 and databases#
There are two ways of working with HDF5 and databases:
Using HDF5 file(s) as a database itself.
Writing HDF5 content into dedicated database solutions.
Both ways will be described in the next two chapters. The second approach is currently implemented for a mongoDB-interface.
However, before we start, let’s understand how the database interface is designed.
Design idea#
Regardless of whether we want to use HDF5 files themselves as databases or connect the content to third party solutions, we need to write a user interface. The h5RDMtoolbox provides an abstract class (HDF5DatabaseInterface) from which any database interface must inherit.
from h5rdmtoolbox.database import HDF5DBInterface
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from h5rdmtoolbox.database import HDF5DBInterface
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/checkouts/v1.7.0/h5rdmtoolbox/__init__.py:129
125 with File(src) as h5:
126 return h5.dumps()
--> 129 from h5rdmtoolbox.wrapper.ld.hdf.file import get_ld as hdf_get_ld
130 from h5rdmtoolbox.wrapper.ld.user.file import get_ld as user_get_ld
133 def get_ld(
134 hdf_filename: Union[str, pathlib.Path],
135 structural: bool = True,
136 semantic: bool = True,
137 blank_node_iri_base: Optional[str] = None,
138 **kwargs) -> rdflib.Graph:
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/checkouts/v1.7.0/h5rdmtoolbox/wrapper/ld/__init__.py:1
----> 1 import ssnolib.ssno.standard_name
2 from ontolutils.namespacelib import M4I
3 from ontolutils.namespacelib import SCHEMA
ModuleNotFoundError: No module named 'ssnolib'
Four methods must be implemented: insert_dataset, insert_group, find and find_one (and of course an __init__ method). The return types for find and find_one are so-called lazy objects. In short: They are interfaces to the HDF5 dataset and group objects and allow accessing data while the source file is closed. Examples are given at the end.
from typing import List
from h5rdmtoolbox.wrapper.lazy import LHDFObject
class MyDBInterface(HDF5DBInterface):
def __init__(self, *args, **kwargs):
"""init the db"""
def insert_dataset(self, *args, **kwargs):
"""inserting datasets into a database"""
def insert_group(self, *args, **kwargs):
"""inserting datasets into a database"""
def find(self, *args, **kwargs) -> LHDFObject:
"""find (many) objects according to the query parameters"""
def find_one(self, *args, **kwargs) -> List[LHDFObject]:
"""find (many) objects according to the query parameters"""
The usage of a database interface will then look like this for all database implementations:
import h5rdmtoolbox as h5tbx
mydb = MyDBInterface()
with h5tbx.File() as h5:
h5.create_group('a group')
h5.create_dataset('my_dataset', shape=(4, 2))
# ...
mydb.insert_dataset(h5['my_dataset'])
mydb.insert_group(h5['a group'])
many_res = mydb.find(...)
single_res = mydb.find_one(...)
A word on lazy objects (return values of database find methods)#
The return types of the find-methods are so-called lazy objects (or generator of lazy objects). What is a lazy object?
There are two types: LDataset and LGroup, the lazy objects for datasets and groups. Those objects are connected to HDF datasets and groups with the only difference, that the user can work with them even if the file is closed.
Example: The standard approach is to open a file whenever data or information needs to be accessed:
with h5tbx.File() as h5:
h5.create_dataset('my_dataset', shape=(4, 2))
# some other code....
# after a while, we want to access the data again and need to reopen the file again:
with h5tbx.File(h5.hdf_filename) as h5:
ds = h5['my_dataset'][()]
The LDataset allows accessing a dataset without actively opening the file (the object takes core of it in the background)
with h5tbx.File() as h5:
x = h5.create_dataset('x', data=[1, 2, 3, 4], make_scale=True)
y = h5.create_dataset('y', data=[10, 20], make_scale=True)
h5.create_dataset('my_dataset', shape=(4, 2), attach_scales=(x, y))
lds = h5tbx.database.lazy.LDataset(h5['my_dataset'])
lds[()] # access the data although the file is closed
<xarray.DataArray 'my_dataset' (x: 4, y: 2)> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Coordinates: * x (x) int32 1 2 3 4 * y (y) int32 10 20
The “laziness” behind this is that the object takes care of opening and closing the file in the background. It is just a convenient way of accessing data from an hdf file without the extra code and worries of properly opening and closing the file.
Moreover, the object has additional functionality, such as slicing the array based on the dimension scales/coordinates:
lds.sel(x=4, y=20)
<xarray.DataArray 'my_dataset' ()>
0.0
Coordinates:
x int32 4
y int32 20We can do the same thing with groups. It is just less useful because the datasets are usually of greater interest…