HDF5 + mongoDB#
This databases uses pymongo
as the backend database. Only meta data (or part of it) is stored in the database, not the raw data. To understand how dataset raw data is linked to the database see the respective chapter in this document.
import pymongo
from pymongo import MongoClient
from h5rdmtoolbox import tutorial
import h5rdmtoolbox as h5tbx
import numpy as np
from pprint import pprint
h5tbx.use(None)
h5tbx.__version__
'0.13.0'
First things first: Connection to the DB:#
Connect to the mongod client:
client = MongoClient()
client
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)
Create a database and a (test) collection named “digits”:
db = client['h5database_notebook_tutorial']
collection = db['digits']
# drop all content in order to start from scratch:
collection.drop()
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[3], line 5
2 collection = db['digits']
4 # drop all content in order to start from scratch:
----> 5 collection.drop()
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/collection.py:1244, in Collection.drop(self, session, comment, encrypted_fields)
1210 """Alias for :meth:`~pymongo.database.Database.drop_collection`.
1211
1212 :Parameters:
(...)
1235 Added ``session`` parameter.
1236 """
1237 dbo = self.__database.client.get_database(
1238 self.__database.name,
1239 self.codec_options,
(...)
1242 self.read_concern,
1243 )
-> 1244 dbo.drop_collection(
1245 self.__name, session=session, comment=comment, encrypted_fields=encrypted_fields
1246 )
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/_csot.py:107, in apply.<locals>.csot_wrapper(self, *args, **kwargs)
105 with _TimeoutContext(timeout):
106 return func(self, *args, **kwargs)
--> 107 return func(self, *args, **kwargs)
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/database.py:1243, in Database.drop_collection(self, name_or_collection, session, comment, encrypted_fields)
1236 self._drop_helper(
1237 _esc_coll_name(encrypted_fields, name), session=session, comment=comment
1238 )
1239 self._drop_helper(
1240 _ecoc_coll_name(encrypted_fields, name), session=session, comment=comment
1241 )
-> 1243 return self._drop_helper(name, session, comment)
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/database.py:1156, in Database._drop_helper(self, name, session, comment)
1153 if comment is not None:
1154 command["comment"] = comment
-> 1156 with self.__client._conn_for_writes(session) as connection:
1157 return self._command(
1158 connection,
1159 command,
(...)
1163 session=session,
1164 )
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/mongo_client.py:1313, in MongoClient._conn_for_writes(self, session)
1312 def _conn_for_writes(self, session: Optional[ClientSession]) -> ContextManager[Connection]:
-> 1313 server = self._select_server(writable_server_selector, session)
1314 return self._checkout(server, session)
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/mongo_client.py:1303, in MongoClient._select_server(self, server_selector, session, address)
1301 raise AutoReconnect("server %s:%s no longer available" % address) # noqa: UP031
1302 else:
-> 1303 server = topology.select_server(server_selector)
1304 return server
1305 except PyMongoError as exc:
1306 # Server selection errors in a transaction are transient.
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:302, in Topology.select_server(self, selector, server_selection_timeout, address)
295 def select_server(
296 self,
297 selector: Callable[[Selection], Selection],
298 server_selection_timeout: Optional[float] = None,
299 address: Optional[_Address] = None,
300 ) -> Server:
301 """Like select_servers, but choose a random server if several match."""
--> 302 server = self._select_server(selector, server_selection_timeout, address)
303 if _csot.get_timeout():
304 _csot.set_rtt(server.description.min_round_trip_time)
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:286, in Topology._select_server(self, selector, server_selection_timeout, address)
280 def _select_server(
281 self,
282 selector: Callable[[Selection], Selection],
283 server_selection_timeout: Optional[float] = None,
284 address: Optional[_Address] = None,
285 ) -> Server:
--> 286 servers = self.select_servers(selector, server_selection_timeout, address)
287 if len(servers) == 1:
288 return servers[0]
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:237, in Topology.select_servers(self, selector, server_selection_timeout, address)
234 server_timeout = server_selection_timeout
236 with self._lock:
--> 237 server_descriptions = self._select_servers_loop(selector, server_timeout, address)
239 return [
240 cast(Server, self.get_server_by_address(sd.address)) for sd in server_descriptions
241 ]
File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/latest/lib/python3.8/site-packages/pymongo/topology.py:270, in Topology._select_servers_loop(self, selector, timeout, address)
264 self._request_check_all()
266 # Release the lock and wait for the topology description to
267 # change, or for a timeout. We won't miss any changes that
268 # came after our most recent apply_selector call, since we've
269 # held the lock until now.
--> 270 self._condition.wait(common.MIN_HEARTBEAT_INTERVAL)
271 self._description.check_compatible()
272 now = time.monotonic()
File ~/.asdf/installs/python/3.8.18/lib/python3.8/threading.py:306, in Condition.wait(self, timeout)
304 else:
305 if timeout > 0:
--> 306 gotit = waiter.acquire(True, timeout)
307 else:
308 gotit = waiter.acquire(False)
KeyboardInterrupt:
Testdata#
We will take test data from scikit-learn, namely the hand-written digits ((https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html):
from sklearn.datasets import load_digits # ! pip install scikit-learn
digits = load_digits()
Fill a HDF5 file with the loaded data. We additionally compute the mean count and two gray occurance properties (dissimilarity and correlation). Those three datasets together with the true digit of the image are linked to the image via HDF dimension scales:
from skimage.feature import graycomatrix, graycoprops
filename = h5tbx.utils.generate_temporary_filename(suffix='.hdf')
with h5tbx.File(filename, 'w') as h5:
ds_trg = h5.create_dataset('digit',
data=digits.target,
make_scale=True)
ds_img = h5.create_dataset('images',
shape=(len(digits.images), 8, 8))
ds_mean = h5.create_dataset('mean',
shape=(len(digits.images), ),
make_scale=True)
ds_diss = h5.create_dataset('dissimilarity',
shape=(len(digits.images), ),
make_scale=True)
ds_corr = h5.create_dataset('correlation',
shape=(len(digits.images), ),
make_scale=True)
for i, img in enumerate(digits.images):
ds_img[i, :, :] = img
ds_mean[i] = np.mean(img)
glcm = graycomatrix(img.astype(int), distances=[5], angles=[0], levels=256,
symmetric=True, normed=True)
ds_diss[i] = graycoprops(glcm, 'dissimilarity')[0, 0]
ds_corr[i] = graycoprops(glcm, 'correlation')[0, 0]
ds_img.dims[0].attach_scale(ds_trg)
ds_img.dims[0].attach_scale(ds_mean)
ds_img.dims[0].attach_scale(ds_diss)
ds_img.dims[0].attach_scale(ds_corr)
h5.dump()
-
- __h5rdmtoolbox_version__ : 0.9.0
-
(1797) [float32]
-
(1797) [int32]
-
(1797) [float32]
-
(digit: 1797, 8, 8) [float32]
-
(1797) [float32]
Filling the database#
To insert data from the HDF5 file into the DB, we need the accessor “mongo”:
from h5rdmtoolbox.database import mongo
with h5tbx.File(filename) as h5:
h5['images'].mongo.insert(0, collection, update=False)
Count the number of collections inserted:
collection.count_documents({})
1797
Find one:#
one_res = collection.find_one({'digit': {'$eq': 3}})
one_res
{'_id': ObjectId('650afa3eb61edd6511e8dc77'),
'filename': 'C:\\Users\\da4323\\AppData\\Local\\h5rdmtoolbox\\h5rdmtoolbox\\tmp\\tmp_0\\tmp0.hdf',
'name': '/images',
'basename': 'images',
'file_creation_time': datetime.datetime(2023, 9, 20, 13, 57, 13, 651000),
'shape': [1797, 8, 8],
'ndim': 3,
'hdfobj': 'dataset',
'slice': [[3, 4, 1], [0, None, 1], [0, None, 1]],
'digit': 3,
'mean': 4.171875,
'dissimilarity': 4.875,
'correlation': -0.3547935485839844}
We found one entry only because we asked for one only. Note, that the sult dictionary provides a “slice” entry. This is the slice within the 3D array in the HDF file. We can use this to slice the array. There is even a method in the accessor to simplify this:
import matplotlib.pyplot as plt
with h5tbx.File(filename) as h5:
plt.figure(figsize=(4,3))
h5.images.mongo.slice(one_res['slice']).plot(cmap='gray')

Find many:#
Let’ query a rang of data. Mean count shall be above 5 counts and the digit is >3 and <=8:
collection.count_documents({'mean': {'$gt': 5}, 'digit': {'$gt': 3}, 'digit': {'$lte': 8}})
many_res = collection.find({'mean': {'$gt': 5}, 'digit': {'$gt': 3}, 'digit': {'$lte': 8}})
Inspect the result by the help of pandas:
import pandas as pd
df = pd.DataFrame(data=[r for r in many_res.rewind()])
df.drop('_id', inplace=True, axis=1)
df.drop('filename', inplace=True, axis=1)
df.drop('slice', inplace=True, axis=1)
pd.plotting.scatter_matrix(df[['mean', 'dissimilarity', 'correlation', 'digit']], hist_kwds={'bins': 20})
df.head()
name | basename | file_creation_time | shape | ndim | hdfobj | digit | mean | dissimilarity | correlation | |
---|---|---|---|---|---|---|---|---|---|---|
0 | /images | images | 2023-09-20 13:57:13.651 | [1797, 8, 8] | 3 | dataset | 2 | 5.375000 | 6.291667 | -0.360439 |
1 | /images | images | 2023-09-20 13:57:13.651 | [1797, 8, 8] | 3 | dataset | 5 | 5.343750 | 7.250000 | -0.430473 |
2 | /images | images | 2023-09-20 13:57:13.651 | [1797, 8, 8] | 3 | dataset | 8 | 5.578125 | 7.416667 | -0.469501 |
3 | /images | images | 2023-09-20 13:57:13.651 | [1797, 8, 8] | 3 | dataset | 0 | 5.031250 | 7.875000 | -0.520575 |
4 | /images | images | 2023-09-20 13:57:13.651 | [1797, 8, 8] | 3 | dataset | 3 | 5.015625 | 6.375000 | -0.383382 |

Query for other meta data.#
First of all we could have insert the full file. We might have decided to insert only a group content or really all data in the file, thus a recursive run that insert all data. Ok, let’s do that:
db = client['h5database_notebook_tutorial']
collection_full_digits = db['full_digits']
# drop all content in order to start from scratch:
collection_full_digits.drop()
with h5tbx.File(filename) as h5:
h5.mongo.insert(collection_full_digits, recursive=True)
The first entry looks like this:
collection_full_digits.find_one({})
{'_id': ObjectId('650afa3fbf7512e9120cbd28'),
'__h5rdmtoolbox_version__': '0.9.0',
'basename': '',
'file_creation_time': datetime.datetime(2023, 9, 20, 13, 57, 13, 651000),
'filename': 'C:\\Users\\da4323\\AppData\\Local\\h5rdmtoolbox\\h5rdmtoolbox\\tmp\\tmp_0\\tmp0.hdf',
'hdfobj': 'group',
'name': '/'}
It is the data of the root group. It shows all attribute of the group.
Comparison to file query#
The mongoDB-appoache maps the metadata to the NoSQL database mongoDB. This step is not needed for the “serverless” solution. So there are differences:
Mapping vs. no mapping: For only-once-queries, the mapping to mongoDB makes no sense
Moving files: If files are moved, the filenames must be updated in the database. This needs to be done manually at the moment!
Query time: Let’s find out what is faster in the below test…
For a fair comparison, we need to change the starting point. Above, we have mapped one file including all images into the mongoDB. Let’s now write $n$ files, with $n$ being the number of images. This time we only write the image and put the digit in the attribute of the dataset:
db_dir = h5tbx.utils.generate_temporary_directory()
for i in range(digits.images.shape[0]):
with h5tbx.File(db_dir / f'f{i}.hdf', 'w') as h5:
ds_img = h5.create_dataset('images', data=digits.images[i], attrs={'digit': digits.target[i]})
%%time
means = [float(ds[()].mean()) for ds in h5tbx.FileDB(db_dir).find({'digit': 3})]
plt.hist(means)
_ = plt.xlabel('mean pixel count')
CPU times: total: 1.53 s
Wall time: 2.85 s

Prepare the collection:
import h5py
new_collection = db['digits_individual_files']
# drop all content in order to start from scratch:
new_collection.drop()
For a fair comparison, we must include the filling of the database. The rest of the code is basically what is happening in the background of the above code…
%%time
for f in db_dir.glob('*.hdf'):
with h5tbx.File(f) as h5:
h5['images'].mongo.insert(None, new_collection, update=False)
mongoDB_means=[]
for r in new_collection.find({'digit': 3}):
with h5py.File(str(r['filename']), mode='r') as h5:
mongoDB_means.append(h5['images'][:,:].mean())
plt.hist(mongoDB_means)
_ = plt.xlabel('mean pixel count')
CPU times: total: 859 ms
Wall time: 4.21 s

The mongoDB approach is pretty fast if we consider the search part only. In total it is slower, as expected. However, with all the file opening and closing and this in seqence with the file storage solution, the time difference is not big!