HDF5 + mongoDB#

This databases uses pymongo as the backend database. Only meta data (or part of it) is stored in the database, not the raw data. To understand how dataset raw data is linked to the database see the respective chapter in this document.

import pymongo
from pymongo import MongoClient

from h5rdmtoolbox import tutorial
import h5rdmtoolbox as h5tbx

import numpy as np
from pprint import pprint

h5tbx.use(None)

h5tbx.__version__
'0.13.0'

First things first: Connection to the DB:#

Connect to the mongod client:

client = MongoClient()
client
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

Create a database and a (test) collection named “digits”:

db = client['h5database_notebook_tutorial']
collection = db['digits']

# drop all content in order to start from scratch:
collection.drop()
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[3], line 5
      2 collection = db['digits']
      4 # drop all content in order to start from scratch:
----> 5 collection.drop()

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/collection.py:1244, in Collection.drop(self, session, comment, encrypted_fields)
   1210 """Alias for :meth:`~pymongo.database.Database.drop_collection`.
   1211 
   1212 :Parameters:
   (...)
   1235    Added ``session`` parameter.
   1236 """
   1237 dbo = self.__database.client.get_database(
   1238     self.__database.name,
   1239     self.codec_options,
   (...)
   1242     self.read_concern,
   1243 )
-> 1244 dbo.drop_collection(
   1245     self.__name, session=session, comment=comment, encrypted_fields=encrypted_fields
   1246 )

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/_csot.py:107, in apply.<locals>.csot_wrapper(self, *args, **kwargs)
    105         with _TimeoutContext(timeout):
    106             return func(self, *args, **kwargs)
--> 107 return func(self, *args, **kwargs)

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/database.py:1243, in Database.drop_collection(self, name_or_collection, session, comment, encrypted_fields)
   1236     self._drop_helper(
   1237         _esc_coll_name(encrypted_fields, name), session=session, comment=comment
   1238     )
   1239     self._drop_helper(
   1240         _ecoc_coll_name(encrypted_fields, name), session=session, comment=comment
   1241     )
-> 1243 return self._drop_helper(name, session, comment)

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/database.py:1156, in Database._drop_helper(self, name, session, comment)
   1153 if comment is not None:
   1154     command["comment"] = comment
-> 1156 with self.__client._conn_for_writes(session) as connection:
   1157     return self._command(
   1158         connection,
   1159         command,
   (...)
   1163         session=session,
   1164     )

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/mongo_client.py:1313, in MongoClient._conn_for_writes(self, session)
   1312 def _conn_for_writes(self, session: Optional[ClientSession]) -> ContextManager[Connection]:
-> 1313     server = self._select_server(writable_server_selector, session)
   1314     return self._checkout(server, session)

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/mongo_client.py:1303, in MongoClient._select_server(self, server_selector, session, address)
   1301             raise AutoReconnect("server %s:%s no longer available" % address)  # noqa: UP031
   1302     else:
-> 1303         server = topology.select_server(server_selector)
   1304     return server
   1305 except PyMongoError as exc:
   1306     # Server selection errors in a transaction are transient.

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:302, in Topology.select_server(self, selector, server_selection_timeout, address)
    295 def select_server(
    296     self,
    297     selector: Callable[[Selection], Selection],
    298     server_selection_timeout: Optional[float] = None,
    299     address: Optional[_Address] = None,
    300 ) -> Server:
    301     """Like select_servers, but choose a random server if several match."""
--> 302     server = self._select_server(selector, server_selection_timeout, address)
    303     if _csot.get_timeout():
    304         _csot.set_rtt(server.description.min_round_trip_time)

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:286, in Topology._select_server(self, selector, server_selection_timeout, address)
    280 def _select_server(
    281     self,
    282     selector: Callable[[Selection], Selection],
    283     server_selection_timeout: Optional[float] = None,
    284     address: Optional[_Address] = None,
    285 ) -> Server:
--> 286     servers = self.select_servers(selector, server_selection_timeout, address)
    287     if len(servers) == 1:
    288         return servers[0]

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:237, in Topology.select_servers(self, selector, server_selection_timeout, address)
    234     server_timeout = server_selection_timeout
    236 with self._lock:
--> 237     server_descriptions = self._select_servers_loop(selector, server_timeout, address)
    239     return [
    240         cast(Server, self.get_server_by_address(sd.address)) for sd in server_descriptions
    241     ]

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:270, in Topology._select_servers_loop(self, selector, timeout, address)
    264 self._request_check_all()
    266 # Release the lock and wait for the topology description to
    267 # change, or for a timeout. We won't miss any changes that
    268 # came after our most recent apply_selector call, since we've
    269 # held the lock until now.
--> 270 self._condition.wait(common.MIN_HEARTBEAT_INTERVAL)
    271 self._description.check_compatible()
    272 now = time.monotonic()

File ~/.asdf/installs/python/3.8.18/lib/python3.8/threading.py:306, in Condition.wait(self, timeout)
    304 else:
    305     if timeout > 0:
--> 306         gotit = waiter.acquire(True, timeout)
    307     else:
    308         gotit = waiter.acquire(False)

KeyboardInterrupt: 

Testdata#

We will take test data from scikit-learn, namely the hand-written digits ((https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html):

from sklearn.datasets import load_digits  # ! pip install scikit-learn
digits = load_digits()

Fill a HDF5 file with the loaded data. We additionally compute the mean count and two gray occurance properties (dissimilarity and correlation). Those three datasets together with the true digit of the image are linked to the image via HDF dimension scales:

from skimage.feature import graycomatrix, graycoprops

filename = h5tbx.utils.generate_temporary_filename(suffix='.hdf')

with h5tbx.File(filename, 'w') as h5:
    ds_trg = h5.create_dataset('digit',
                               data=digits.target,
                               make_scale=True)
    ds_img = h5.create_dataset('images',
                               shape=(len(digits.images), 8, 8))
    
    ds_mean = h5.create_dataset('mean',
                                shape=(len(digits.images), ),
                                make_scale=True)
    ds_diss = h5.create_dataset('dissimilarity',
                                shape=(len(digits.images), ),
                                make_scale=True)
    ds_corr = h5.create_dataset('correlation',
                                shape=(len(digits.images), ),
                                make_scale=True)
    
    
    for i, img in enumerate(digits.images):
        ds_img[i, :, :] = img
        ds_mean[i] = np.mean(img)
        
        glcm = graycomatrix(img.astype(int), distances=[5], angles=[0], levels=256,
                            symmetric=True, normed=True)
        ds_diss[i] = graycoprops(glcm, 'dissimilarity')[0, 0]
        ds_corr[i] = graycoprops(glcm, 'correlation')[0, 0]
        
    ds_img.dims[0].attach_scale(ds_trg)
    ds_img.dims[0].attach_scale(ds_mean)
    ds_img.dims[0].attach_scale(ds_diss)
    ds_img.dims[0].attach_scale(ds_corr)
    h5.dump()
    • __h5rdmtoolbox_version__ : 0.9.0
      (1797) [float32]
      (1797) [int32]
      (1797) [float32]
      (digit: 1797, 8, 8) [float32]
      (1797) [float32]

Filling the database#

To insert data from the HDF5 file into the DB, we need the accessor “mongo”:

from h5rdmtoolbox.database import mongo
with h5tbx.File(filename) as h5:
    h5['images'].mongo.insert(0, collection, update=False)

Count the number of collections inserted:

collection.count_documents({})
1797

Find one:#

one_res = collection.find_one({'digit': {'$eq': 3}})
one_res
{'_id': ObjectId('650afa3eb61edd6511e8dc77'),
 'filename': 'C:\\Users\\da4323\\AppData\\Local\\h5rdmtoolbox\\h5rdmtoolbox\\tmp\\tmp_0\\tmp0.hdf',
 'name': '/images',
 'basename': 'images',
 'file_creation_time': datetime.datetime(2023, 9, 20, 13, 57, 13, 651000),
 'shape': [1797, 8, 8],
 'ndim': 3,
 'hdfobj': 'dataset',
 'slice': [[3, 4, 1], [0, None, 1], [0, None, 1]],
 'digit': 3,
 'mean': 4.171875,
 'dissimilarity': 4.875,
 'correlation': -0.3547935485839844}

We found one entry only because we asked for one only. Note, that the sult dictionary provides a “slice” entry. This is the slice within the 3D array in the HDF file. We can use this to slice the array. There is even a method in the accessor to simplify this:

import matplotlib.pyplot as plt

with h5tbx.File(filename) as h5:
    plt.figure(figsize=(4,3))
    h5.images.mongo.slice(one_res['slice']).plot(cmap='gray')
../_images/4f6ebb492434473da7af5999cf24d7d91097ed1d16fd17431e05765b88d2a196.png

Find many:#

Let’ query a rang of data. Mean count shall be above 5 counts and the digit is >3 and <=8:

collection.count_documents({'mean': {'$gt': 5}, 'digit': {'$gt': 3}, 'digit': {'$lte': 8}})
many_res = collection.find({'mean': {'$gt': 5}, 'digit': {'$gt': 3}, 'digit': {'$lte': 8}})

Inspect the result by the help of pandas:

import pandas as pd
df = pd.DataFrame(data=[r for r in many_res.rewind()])
df.drop('_id', inplace=True, axis=1)
df.drop('filename', inplace=True, axis=1)
df.drop('slice', inplace=True, axis=1)

pd.plotting.scatter_matrix(df[['mean', 'dissimilarity', 'correlation', 'digit']], hist_kwds={'bins': 20})
df.head()
name basename file_creation_time shape ndim hdfobj digit mean dissimilarity correlation
0 /images images 2023-09-20 13:57:13.651 [1797, 8, 8] 3 dataset 2 5.375000 6.291667 -0.360439
1 /images images 2023-09-20 13:57:13.651 [1797, 8, 8] 3 dataset 5 5.343750 7.250000 -0.430473
2 /images images 2023-09-20 13:57:13.651 [1797, 8, 8] 3 dataset 8 5.578125 7.416667 -0.469501
3 /images images 2023-09-20 13:57:13.651 [1797, 8, 8] 3 dataset 0 5.031250 7.875000 -0.520575
4 /images images 2023-09-20 13:57:13.651 [1797, 8, 8] 3 dataset 3 5.015625 6.375000 -0.383382
../_images/9772842ef37694f8d1c8501e75ceb2b6e287849981fdca07939715fd49e81c1f.png

Query for other meta data.#

First of all we could have insert the full file. We might have decided to insert only a group content or really all data in the file, thus a recursive run that insert all data. Ok, let’s do that:

db = client['h5database_notebook_tutorial']
collection_full_digits = db['full_digits']

# drop all content in order to start from scratch:
collection_full_digits.drop()
with h5tbx.File(filename) as h5:
    h5.mongo.insert(collection_full_digits, recursive=True)

The first entry looks like this:

collection_full_digits.find_one({})
{'_id': ObjectId('650afa3fbf7512e9120cbd28'),
 '__h5rdmtoolbox_version__': '0.9.0',
 'basename': '',
 'file_creation_time': datetime.datetime(2023, 9, 20, 13, 57, 13, 651000),
 'filename': 'C:\\Users\\da4323\\AppData\\Local\\h5rdmtoolbox\\h5rdmtoolbox\\tmp\\tmp_0\\tmp0.hdf',
 'hdfobj': 'group',
 'name': '/'}

It is the data of the root group. It shows all attribute of the group.

Comparison to file query#

The mongoDB-appoache maps the metadata to the NoSQL database mongoDB. This step is not needed for the “serverless” solution. So there are differences:

  • Mapping vs. no mapping: For only-once-queries, the mapping to mongoDB makes no sense

  • Moving files: If files are moved, the filenames must be updated in the database. This needs to be done manually at the moment!

  • Query time: Let’s find out what is faster in the below test…

For a fair comparison, we need to change the starting point. Above, we have mapped one file including all images into the mongoDB. Let’s now write $n$ files, with $n$ being the number of images. This time we only write the image and put the digit in the attribute of the dataset:

db_dir = h5tbx.utils.generate_temporary_directory()
for i in range(digits.images.shape[0]):
    with h5tbx.File(db_dir / f'f{i}.hdf', 'w') as h5:
        ds_img = h5.create_dataset('images', data=digits.images[i], attrs={'digit': digits.target[i]})
%%time
means = [float(ds[()].mean()) for ds in h5tbx.FileDB(db_dir).find({'digit': 3})]
plt.hist(means)
_ = plt.xlabel('mean pixel count')
CPU times: total: 1.53 s
Wall time: 2.85 s
../_images/3bd1fd74fbf76965432f740fce4ce2f79a5b2c072a4872e8d98636d429856ee5.png

Prepare the collection:

import h5py

new_collection = db['digits_individual_files']

# drop all content in order to start from scratch:
new_collection.drop()

For a fair comparison, we must include the filling of the database. The rest of the code is basically what is happening in the background of the above code…

%%time
for f in db_dir.glob('*.hdf'):
    with h5tbx.File(f) as h5:
        h5['images'].mongo.insert(None, new_collection, update=False)

mongoDB_means=[]
for r in new_collection.find({'digit': 3}):
    with h5py.File(str(r['filename']), mode='r') as h5:
        mongoDB_means.append(h5['images'][:,:].mean())
        
plt.hist(mongoDB_means)
_ = plt.xlabel('mean pixel count')
CPU times: total: 859 ms
Wall time: 4.21 s
../_images/45ec6f85bc6af994d0fc29cc85beecd4a050b3968a2718cec2e72070e0c7a5f7.png

The mongoDB approach is pretty fast if we consider the search part only. In total it is slower, as expected. However, with all the file opening and closing and this in seqence with the file storage solution, the time difference is not big!