HDF5 + mongoDB

HDF5 + mongoDB#

This database interface uses pymongo as the backend database. The user communicates via the interface h5rdmtoolbox.database.MongoDB.

The idea is not to store all data in a mongoDB database. Then we would not have to write an HDF5 file in the first place.
We rather use the database as a metadata storage, which is much more efficient to search through. So the steps are:

insert (metadata) into the database (the interface takes care how this is done)
perform query
the interface collects performs query on mongoDB…
…and the interface returns a lazy HDF object to the user

../../_static/mongoDB_concept.png

import pymongo
# from pymongo import MongoClient
from mongomock import MongoClient  # for docs or testing only

from h5rdmtoolbox import tutorial
import h5rdmtoolbox as h5tbx

import numpy as np
from pprint import pprint

h5tbx.use(None)

/home/docs/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/v2.5.3/lib/python3.10/site-packages/mongomock/__version__.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

using("h5py")

First things first: Connection to the DB:#

Connect to the mongod client:

client = MongoClient()
client

mongomock.MongoClient('localhost', 27017)

Create a database and a (test) collection named “digits”:

db = client['h5database_notebook_tutorial']
collection = db['digits']

# drop all content in order to start from scratch:
collection.drop()

Testdata#

We will take test data from scikit-learn, namely the hand-written digits ((https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html):

# ! pip install scikit-learn
from sklearn.datasets import load_digits
digits = load_digits()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 2
      1 # ! pip install scikit-learn
----> 2 from sklearn.datasets import load_digits
      3 digits = load_digits()

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/v2.5.3/lib/python3.10/site-packages/sklearn/__init__.py:82
     80 from . import _distributor_init  # noqa: F401
     81 from . import __check_build  # noqa: F401
---> 82 from .base import clone
     83 from .utils._show_versions import show_versions
     85 __all__ = [
     86     "calibration",
     87     "cluster",
   (...)
    128     "show_versions",
    129 ]

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/v2.5.3/lib/python3.10/site-packages/sklearn/base.py:17
     15 from . import __version__
     16 from ._config import get_config
---> 17 from .utils import _IS_32BIT
     18 from .utils._tags import (
     19     _DEFAULT_TAGS,
     20 )
     21 from .utils.validation import check_X_y

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/v2.5.3/lib/python3.10/site-packages/sklearn/utils/__init__.py:24
     21 import numpy as np
     22 from scipy.sparse import issparse
---> 24 from .murmurhash import murmurhash3_32
     25 from .class_weight import compute_class_weight, compute_sample_weight
     26 from . import _joblib

File sklearn/utils/murmurhash.pyx:1, in init sklearn.utils.murmurhash()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Fill a HDF5 file with the loaded data. We additionally compute the mean count and two gray occurance properties (dissimilarity and correlation). Those three datasets together with the true digit of the image are linked to the image via HDF dimension scales:

from skimage.feature import graycomatrix, graycoprops

filename = h5tbx.utils.generate_temporary_filename(suffix='.hdf')

with h5tbx.File(filename, 'w') as h5:
    ds_trg = h5.create_dataset('digit',
                               data=digits.target,
                               make_scale=True)
    ds_img = h5.create_dataset('images',
                               shape=(len(digits.images), 8, 8))
    
    ds_mean = h5.create_dataset('mean',
                                shape=(len(digits.images), ),
                                make_scale=True)
    ds_diss = h5.create_dataset('dissimilarity',
                                shape=(len(digits.images), ),
                                make_scale=True)
    ds_corr = h5.create_dataset('correlation',
                                shape=(len(digits.images), ),
                                make_scale=True)
    
    
    for i, img in enumerate(digits.images):
        ds_img[i, :, :] = img
        ds_mean[i] = np.mean(img)
        
        glcm = graycomatrix(img.astype(int), distances=[5], angles=[0], levels=256,
                            symmetric=True, normed=True)
        ds_diss[i] = graycoprops(glcm, 'dissimilarity')[0, 0]
        ds_corr[i] = graycoprops(glcm, 'correlation')[0, 0]
        
    ds_img.dims[0].attach_scale(ds_trg)
    ds_img.dims[0].attach_scale(ds_mean)
    ds_img.dims[0].attach_scale(ds_diss)
    ds_img.dims[0].attach_scale(ds_corr)
    h5.dump()

Insert (metadata) into the database#

To insert data from the HDF5 file into the DB, we need the init the mongoDB interface class.

from h5rdmtoolbox.database import mongo

mdb = mongo.MongoDB(collection=collection)
mdb.collection.drop()  # clean the collection just to be certain that it is really empty

Next, we have two options:

Insert a full dataset
Insert slices of a dataset according to the dimension scales

We perform both options and understand their meaning in the following:

First option

Let’s insert the dataset “image” into dat database:

with h5tbx.File(filename) as h5:
    mdb.insert_dataset(h5['images'], axis=None)

This will result in one document:

collection.count_documents({})  # or mdb.collection.count_documents({})

Let’s find it (quite trivial…):

res = mdb.find_one({})

Let’s slice the (lazy) dataset and ask for its shape:

res[()].shape

We asked for the shape, to compare it to the second option on how to insert multi-dimensional arrays into the mongoDB:

Second option

Now, we will set axis to 0. This will “cut” the image dataset into $N=1797$ subarrays and insert them individually. Let’s do it and find out what the advantage is afterwards:

mdb.collection.drop()  # clean the collection

with h5tbx.File(filename) as h5:
    mdb.insert_dataset(h5['images'], axis=0, update=False)

Count the number of collections inserted. It is equal to the number of images (1797).

collection.count_documents({})  # or mdb.collection.count_documents({})

The image dataset has dimension scales (you may like to inspect the content of the HDF file again in the dump() call at the beginning). By searching e.g. for digit=3, we find the correct slice for an image corresonding to this condition:

one_res = mdb.find_one({'digit': {'$eq': 3}})
one_res[()].shape

In the version above, we need to work with the xarray object and somehow determine which slice corresponds to digit equal to 3. However, the mongoDB approach does it, too:

one_res[()].plot(cmap='gray')

import matplotlib.pyplot as plt

res = mdb.find({'digit': {'$eq': 3}})

# plot the first 5 results:
for i in range(5):
    next(res)[()].plot()
    plt.show()

HDF5 + mongoDB

Contents

HDF5 + mongoDB#

First things first: Connection to the DB:#

Testdata#

Insert (metadata) into the database#