First steps: HDF5 and databases

First steps: HDF5 and databases#

There are two ways of working with HDF5 and databases:

Using HDF5 file(s) as a database itself.
Writing HDF5 content into dedicated database solutions.

Both ways will be described in the next two chapters. The second approach is currently implemented for a mongoDB-interface.

However, before we start, let’s understand how the database interface is designed.

Design idea#

Regardless of whether we want to use HDF5 files themselves as databases or connect the content to third party solutions, we need to write a user interface. The h5RDMtoolbox provides an abstract class (HDF5DatabaseInterface) from which any database interface must inherit.

from h5rdmtoolbox.database import HDF5DBInterface

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from h5rdmtoolbox.database import HDF5DBInterface

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/checkouts/v1.7.0/h5rdmtoolbox/__init__.py:129
    125     with File(src) as h5:
    126         return h5.dumps()
--> 129 from h5rdmtoolbox.wrapper.ld.hdf.file import get_ld as hdf_get_ld
    130 from h5rdmtoolbox.wrapper.ld.user.file import get_ld as user_get_ld
    133 def get_ld(
    134         hdf_filename: Union[str, pathlib.Path],
    135         structural: bool = True,
    136         semantic: bool = True,
    137         blank_node_iri_base: Optional[str] = None,
    138         **kwargs) -> rdflib.Graph:

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/checkouts/v1.7.0/h5rdmtoolbox/wrapper/ld/__init__.py:1
----> 1 import ssnolib.ssno.standard_name
      2 from ontolutils.namespacelib import M4I
      3 from ontolutils.namespacelib import SCHEMA

ModuleNotFoundError: No module named 'ssnolib'

Four methods must be implemented: insert_dataset, insert_group, find and find_one (and of course an __init__ method). The return types for find and find_one are so-called lazy objects. In short: They are interfaces to the HDF5 dataset and group objects and allow accessing data while the source file is closed. Examples are given at the end.

from typing import List
from h5rdmtoolbox.wrapper.lazy import LHDFObject

class MyDBInterface(HDF5DBInterface):

    def __init__(self, *args, **kwargs):
        """init the db"""
    
    def insert_dataset(self, *args, **kwargs):
        """inserting datasets into a database"""
    
    def insert_group(self, *args, **kwargs):
        """inserting datasets into a database"""

    def find(self, *args, **kwargs) -> LHDFObject:
        """find (many) objects according to the query parameters"""
        
    def find_one(self, *args, **kwargs) -> List[LHDFObject]:
        """find (many) objects according to the query parameters"""

The usage of a database interface will then look like this for all database implementations:

import h5rdmtoolbox as h5tbx

mydb = MyDBInterface()

with h5tbx.File() as h5:
    h5.create_group('a group')
    h5.create_dataset('my_dataset', shape=(4, 2))
    # ... 
    mydb.insert_dataset(h5['my_dataset'])
    mydb.insert_group(h5['a group'])

many_res = mydb.find(...)
single_res = mydb.find_one(...)

First steps: HDF5 and databases

Contents

First steps: HDF5 and databases#

Design idea#

A word on lazy objects (return values of database find methods)#