{ "cells": [ { "cell_type": "markdown", "id": "26a6643f-3d3c-4cf4-99c7-32ddd5a4ec0a", "metadata": {}, "source": [ "# HDF5 + mongoDB\n", "\n", "This database interface uses `pymongo` as the backend database. The user communicates via the interface `h5rdmtoolbox.database.MongoDB`.\n", "\n", "The idea is not to store all data in a mongoDB database. Then we would not have to write an HDF5 file in the first place.
\n", "We rather use the database as a **metadata storage**, which is much more efficient to search through. So the steps are:\n", "1. insert (metadata) into the database (the interface takes care how this is done)\n", "2. perform query\n", "3. the interface collects performs query on mongoDB...\n", "4. ...and the interface returns a lazy HDF object to the user\n", "\n", "\n", "\n", "\"../../_static/mongoDB_concept.png\"\n",\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "1e3c96fc-62ac-4462-8e87-d131aeb52c53", "metadata": {}, "outputs": [], "source": [ "import pymongo\n", "# from pymongo import MongoClient\n", "from mongomock import MongoClient # for docs or testing only\n", "\n", "from h5rdmtoolbox import tutorial\n", "import h5rdmtoolbox as h5tbx\n", "\n", "import numpy as np\n", "from pprint import pprint\n", "\n", "h5tbx.use(None)" ] }, { "cell_type": "markdown", "id": "d13b2108-f659-495c-84f0-db7a206acfdb", "metadata": {}, "source": [ "## First things first: Connection to the DB:\n", "Connect to the mongod client:" ] }, { "cell_type": "code", "execution_count": null, "id": "b40e8640-b166-4792-913b-6ce39765eef4", "metadata": {}, "outputs": [], "source": [ "client = MongoClient()\n", "client" ] }, { "cell_type": "markdown", "id": "1b4f6cf9-bcfb-4f24-9052-303bdfcd0515", "metadata": {}, "source": [ "Create a database and a (test) collection named \"digits\":" ] }, { "cell_type": "code", "execution_count": null, "id": "a8f3cb21-e1e3-4969-9103-a48ac49547b9", "metadata": {}, "outputs": [], "source": [ "db = client['h5database_notebook_tutorial']\n", "collection = db['digits']\n", "\n", "# drop all content in order to start from scratch:\n", "collection.drop()" ] }, { "cell_type": "markdown", "id": "9417ee13-271e-43aa-9a23-7dee3c296ff7", "metadata": {}, "source": [ "## Testdata\n", "We will take test data from scikit-learn, namely the hand-written digits ((https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html):" ] }, { "cell_type": "code", "execution_count": null, "id": "111e3c0d-6d5d-4931-9aac-e4f9e6575088", "metadata": {}, "outputs": [], "source": [ "# ! pip install scikit-learn\n", "from sklearn.datasets import load_digits\n", "digits = load_digits()" ] }, { "cell_type": "markdown", "id": "52d15e57-0fac-4c5c-8570-9aaba55f7fbe", "metadata": {}, "source": [ "Fill a HDF5 file with the loaded data. We additionally compute the mean count and two gray occurance properties (dissimilarity and correlation). Those three datasets together with the true digit of the image are linked to the image via HDF dimension scales:" ] }, { "cell_type": "code", "execution_count": null, "id": "3c7f5e27-c173-4a14-941c-2614466219a5", "metadata": {}, "outputs": [], "source": [ "from skimage.feature import graycomatrix, graycoprops\n", "\n", "filename = h5tbx.utils.generate_temporary_filename(suffix='.hdf')\n", "\n", "with h5tbx.File(filename, 'w') as h5:\n", " ds_trg = h5.create_dataset('digit',\n", " data=digits.target,\n", " make_scale=True)\n", " ds_img = h5.create_dataset('images',\n", " shape=(len(digits.images), 8, 8))\n", " \n", " ds_mean = h5.create_dataset('mean',\n", " shape=(len(digits.images), ),\n", " make_scale=True)\n", " ds_diss = h5.create_dataset('dissimilarity',\n", " shape=(len(digits.images), ),\n", " make_scale=True)\n", " ds_corr = h5.create_dataset('correlation',\n", " shape=(len(digits.images), ),\n", " make_scale=True)\n", " \n", " \n", " for i, img in enumerate(digits.images):\n", " ds_img[i, :, :] = img\n", " ds_mean[i] = np.mean(img)\n", " \n", " glcm = graycomatrix(img.astype(int), distances=[5], angles=[0], levels=256,\n", " symmetric=True, normed=True)\n", " ds_diss[i] = graycoprops(glcm, 'dissimilarity')[0, 0]\n", " ds_corr[i] = graycoprops(glcm, 'correlation')[0, 0]\n", " \n", " ds_img.dims[0].attach_scale(ds_trg)\n", " ds_img.dims[0].attach_scale(ds_mean)\n", " ds_img.dims[0].attach_scale(ds_diss)\n", " ds_img.dims[0].attach_scale(ds_corr)\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "3410054e-7640-4c2e-ada3-fa7ea8b99e60", "metadata": {}, "source": [ "## Insert (metadata) into the database\n", "\n", "To insert data from the HDF5 file into the DB, we need the init the mongoDB interface class." ] }, { "cell_type": "code", "execution_count": null, "id": "16c8fddc-d3f2-4987-925f-db07a1337b8b", "metadata": {}, "outputs": [], "source": [ "from h5rdmtoolbox.database import mongo" ] }, { "cell_type": "code", "execution_count": null, "id": "1f5d4d82-78ce-49fc-8883-bb89bb1a6666", "metadata": {}, "outputs": [], "source": [ "mdb = mongo.MongoDB(collection=collection)\n", "mdb.collection.drop() # clean the collection just to be certain that it is really empty" ] }, { "cell_type": "markdown", "id": "cad679e4-8196-44fa-a1ec-28483d9ca7f5", "metadata": {}, "source": [ "Next, we have two options:\n", "1. Insert a full dataset\n", "2. Insert slices of a dataset according to the dimension scales\n", "\n", "We perform both options and understand their meaning in the following:" ] }, { "cell_type": "markdown", "id": "3d84959c-3e82-450e-b65d-4755dde8ef45", "metadata": {}, "source": [ "**First option**\n", "\n", "Let's insert the dataset \"image\" into dat database:" ] }, { "cell_type": "code", "execution_count": null, "id": "afcd3f7d-754e-4f55-9fd3-d8cfef7c4a66", "metadata": {}, "outputs": [], "source": [ "with h5tbx.File(filename) as h5:\n", " mdb.insert_dataset(h5['images'], axis=None)" ] }, { "cell_type": "markdown", "id": "acf056ab-5499-4286-bcc2-7dea13aeaca0", "metadata": {}, "source": [ "This will result in one document:" ] }, { "cell_type": "code", "execution_count": null, "id": "6d8e16c7-cd04-4b25-bb18-382bb98b2ee0", "metadata": {}, "outputs": [], "source": [ "collection.count_documents({}) # or mdb.collection.count_documents({})" ] }, { "cell_type": "markdown", "id": "a769fec6-cebb-42d9-9bc2-86dbc4ee5918", "metadata": {}, "source": [ "Let's find it (quite trivial...):" ] }, { "cell_type": "code", "execution_count": null, "id": "df71d034-ebcf-4c2d-811f-24dff3ae59fc", "metadata": {}, "outputs": [], "source": [ "res = mdb.find_one({})" ] }, { "cell_type": "markdown", "id": "67a5bf9e-6a5d-490c-abc0-328ec1f9c542", "metadata": {}, "source": [ "Let's slice the (lazy) dataset and ask for its shape:" ] }, { "cell_type": "code", "execution_count": null, "id": "615cea84-62d8-416c-ba3d-8609663f7dd8", "metadata": {}, "outputs": [], "source": [ "res[()].shape" ] }, { "cell_type": "markdown", "id": "b67ae0a4-7bd4-4d2b-a54a-37f92cc57820", "metadata": {}, "source": [ "We asked for the shape, to compare it to the second option on how to insert multi-dimensional arrays into the mongoDB:" ] }, { "cell_type": "markdown", "id": "48f9803d-1820-4ab5-8941-df6bda13bef4", "metadata": {}, "source": [ "**Second option**\n", "\n", "Now, we will set `axis` to 0. This will \"cut\" the image dataset into $N=1797$ subarrays and insert them individually. Let's do it and find out what the advantage is afterwards:" ] }, { "cell_type": "code", "execution_count": null, "id": "34e31813-62f1-4af6-bb1d-b4dac415a5b1", "metadata": {}, "outputs": [], "source": [ "mdb.collection.drop() # clean the collection" ] }, { "cell_type": "code", "execution_count": null, "id": "40d84010-8add-4b34-b090-64340880bcf9", "metadata": {}, "outputs": [], "source": [ "with h5tbx.File(filename) as h5:\n", " mdb.insert_dataset(h5['images'], axis=0, update=False)" ] }, { "cell_type": "markdown", "id": "cd81f6a0-7628-4785-9d76-c5d05422a8dd", "metadata": {}, "source": [ "Count the number of collections inserted. It is equal to the number of images (1797)." ] }, { "cell_type": "code", "execution_count": null, "id": "951f2bec-1e21-4f17-8932-6cb6a2d1edc1", "metadata": {}, "outputs": [], "source": [ "collection.count_documents({}) # or mdb.collection.count_documents({})" ] }, { "cell_type": "markdown", "id": "c844d58c-7b0f-4057-af0b-d194242f613f", "metadata": {}, "source": [ "The image dataset has dimension scales (you may like to inspect the content of the HDF file again in the `dump()` call at the beginning). By searching e.g. for digit=3, we find the correct slice for an image corresonding to this condition:" ] }, { "cell_type": "code", "execution_count": null, "id": "5e95ad39-9b21-417f-8c09-e2bc6b135e62", "metadata": {}, "outputs": [], "source": [ "one_res = mdb.find_one({'digit': {'$eq': 3}})\n", "one_res[()].shape" ] }, { "cell_type": "markdown", "id": "219bbfc6-4118-4376-8eac-fa1f327d484d", "metadata": {}, "source": [ "In the version above, we need to work with the xarray object and somehow determine which slice corresponds to digit equal to 3. However, the mongoDB approach does it, too:" ] }, { "cell_type": "code", "execution_count": null, "id": "80f1e871-f048-434b-9372-8fec5a85fbc8", "metadata": {}, "outputs": [], "source": [ "one_res[()].plot(cmap='gray')" ] }, { "cell_type": "code", "execution_count": null, "id": "2ab19a91-e44d-49bd-ac32-5cc83b57beec", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "res = mdb.find({'digit': {'$eq': 3}})\n", "\n", "# plot the first 5 results:\n", "for i in range(5):\n", " next(res)[()].plot()\n", " plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }