{
"cells": [
{
"cell_type": "markdown",
"id": "26a6643f-3d3c-4cf4-99c7-32ddd5a4ec0a",
"metadata": {},
"source": [
"# HDF5 + mongoDB\n",
"\n",
"This database interface uses `pymongo` as the backend database. The user communicates via the interface `h5rdmtoolbox.database.MongoDB`.\n",
"\n",
"The idea is not to store all data in a mongoDB database. Then we would not have to write an HDF5 file in the first place.
\n",
"We rather use the database as a **metadata storage**, which is much more efficient to search through. So the steps are:\n",
"1. insert (metadata) into the database (the interface takes care how this is done)\n",
"2. perform query\n",
"3. the interface collects performs query on mongoDB...\n",
"4. ...and the interface returns a lazy HDF object to the user\n",
"\n",
"\n",
"\n",
"
\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1e3c96fc-62ac-4462-8e87-d131aeb52c53",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"using(\"h5py\")"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pymongo\n",
"# from pymongo import MongoClient\n",
"from mongomock import MongoClient # for docs or testing only\n",
"\n",
"from h5rdmtoolbox import tutorial\n",
"import h5rdmtoolbox as h5tbx\n",
"\n",
"import numpy as np\n",
"from pprint import pprint\n",
"\n",
"h5tbx.use(None)"
]
},
{
"cell_type": "markdown",
"id": "d13b2108-f659-495c-84f0-db7a206acfdb",
"metadata": {},
"source": [
"## First things first: Connection to the DB:\n",
"Connect to the mongod client:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b40e8640-b166-4792-913b-6ce39765eef4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"mongomock.MongoClient('localhost', 27017)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client = MongoClient()\n",
"client"
]
},
{
"cell_type": "markdown",
"id": "1b4f6cf9-bcfb-4f24-9052-303bdfcd0515",
"metadata": {},
"source": [
"Create a database and a (test) collection named \"digits\":"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a8f3cb21-e1e3-4969-9103-a48ac49547b9",
"metadata": {},
"outputs": [],
"source": [
"db = client['h5database_notebook_tutorial']\n",
"collection = db['digits']\n",
"\n",
"# drop all content in order to start from scratch:\n",
"collection.drop()"
]
},
{
"cell_type": "markdown",
"id": "9417ee13-271e-43aa-9a23-7dee3c296ff7",
"metadata": {},
"source": [
"## Testdata\n",
"We will take test data from scikit-learn, namely the hand-written digits ((https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "111e3c0d-6d5d-4931-9aac-e4f9e6575088",
"metadata": {},
"outputs": [],
"source": [
"# ! pip install scikit-learn\n",
"from sklearn.datasets import load_digits\n",
"digits = load_digits()"
]
},
{
"cell_type": "markdown",
"id": "52d15e57-0fac-4c5c-8570-9aaba55f7fbe",
"metadata": {},
"source": [
"Fill a HDF5 file with the loaded data. We additionally compute the mean count and two gray occurance properties (dissimilarity and correlation). Those three datasets together with the true digit of the image are linked to the image via HDF dimension scales:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3c7f5e27-c173-4a14-941c-2614466219a5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"