{ "cells": [ { "cell_type": "markdown", "id": "8b08d985-9c1b-4613-80d8-2965af5b8034", "metadata": {}, "source": [ "# First steps: HDF5 and databases\n", "\n", "There are two ways of working with HDF5 and databases:\n", "1. Using HDF5 file(s) as a database itself.\n", "2. Writing HDF5 content into dedicated database solutions.\n", "\n", "Both ways will be described in the next two chapters. The second approach is currently implemented for a [mongoDB](https://pymongo.readthedocs.io/en/stable/)-interface.\n", "\n", "However, before we start, let's understand how the database interface is designed." ] }, { "cell_type": "markdown", "id": "7dd41460-d2d8-4794-893a-2ed8017c607b", "metadata": {}, "source": [ "## Design idea\n", "\n", "Regardless of whether we want to use HDF5 files themselves as databases or connect the content to third party solutions, we need to write a user interface. The `h5RDMtoolbox` provides an abstract class (`HDF5DatabaseInterface`) from which any database interface must inherit." ] }, { "cell_type": "code", "execution_count": 1, "id": "c03a9649-c89b-4d0b-b5a3-cff64a6316da", "metadata": {}, "outputs": [], "source": [ "from h5rdmtoolbox.database import HDF5DBInterface" ] }, { "cell_type": "markdown", "id": "c3d18bbc-08dd-4378-b36e-a65165507f52", "metadata": {}, "source": [ "Four methods must be implemented: `insert_dataset`, `insert_group`, `find` and `find_one` (and of course an `__init__` method). The return types for `find` and `find_one` are so-called *lazy* objects. In short: They are interfaces to the HDF5 dataset and group objects and allow accessing data while the source file is closed. Examples are given at the end." ] }, { "cell_type": "code", "execution_count": 2, "id": "4d4c6cd6-6262-4abe-af1d-adec503a9b84", "metadata": {}, "outputs": [], "source": [ "from typing import List\n", "from h5rdmtoolbox.wrapper.lazy import LHDFObject\n", "\n", "class MyDBInterface(HDF5DBInterface):\n", "\n", " def __init__(self, *args, **kwargs):\n", " \"\"\"init the db\"\"\"\n", " \n", " def insert_dataset(self, *args, **kwargs):\n", " \"\"\"inserting datasets into a database\"\"\"\n", " \n", " def insert_group(self, *args, **kwargs):\n", " \"\"\"inserting datasets into a database\"\"\"\n", "\n", " def find(self, *args, **kwargs) -> LHDFObject:\n", " \"\"\"find (many) objects according to the query parameters\"\"\"\n", " \n", " def find_one(self, *args, **kwargs) -> List[LHDFObject]:\n", " \"\"\"find (many) objects according to the query parameters\"\"\"" ] }, { "cell_type": "markdown", "id": "8f54c90f-7b73-4e78-b93b-f3c11d9e273e", "metadata": {}, "source": [ "The usage of a database interface will then look like this for all database implementations:" ] }, { "cell_type": "code", "execution_count": 3, "id": "f52f38e7-2bb7-402c-bea0-627598671901", "metadata": {}, "outputs": [], "source": [ "import h5rdmtoolbox as h5tbx\n", "\n", "mydb = MyDBInterface()\n", "\n", "with h5tbx.File() as h5:\n", " h5.create_group('a group')\n", " h5.create_dataset('my_dataset', shape=(4, 2))\n", " # ... \n", " mydb.insert_dataset(h5['my_dataset'])\n", " mydb.insert_group(h5['a group'])\n", "\n", "many_res = mydb.find(...)\n", "single_res = mydb.find_one(...)" ] }, { "cell_type": "markdown", "id": "9e2bcb7b-321f-449f-a665-e5260dfe465e", "metadata": {}, "source": [ "## A word on lazy objects (return values of database find methods)" ] }, { "cell_type": "markdown", "id": "9a994739-d1df-42bc-8544-57ad042a8e1a", "metadata": {}, "source": [ "The **return types** of the find-methods are so-called lazy objects (or generator of lazy objects). What is a lazy object?\n", "\n", "There are two types: `LDataset` and `LGroup`, the lazy objects for datasets and groups. Those objects are connected to HDF datasets and groups with the only difference, that the user can work with them **even if the file is closed**.\n", "\n", "Example: The standard approach is to open a file whenever data or information needs to be accessed:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b131f615-1ebb-43f8-9d4c-bf3c73591060", "metadata": {}, "outputs": [], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset('my_dataset', shape=(4, 2))\n", "\n", "# some other code....\n", "\n", "# after a while, we want to access the data again and need to reopen the file again:\n", "with h5tbx.File(h5.hdf_filename) as h5:\n", " ds = h5['my_dataset'][()]" ] }, { "cell_type": "markdown", "id": "a67262e1-c9d0-4367-91cc-c93e9cc6f296", "metadata": {}, "source": [ "The `LDataset` allows accessing a dataset without actively opening the file (the object takes core of it in the background)" ] }, { "cell_type": "code", "execution_count": 5, "id": "74cc251f-23ca-4f86-b0f9-7f57f7dbd139", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.DataArray 'my_dataset' (x: 4, y: 2)>\n",
"0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"Coordinates:\n",
" * x (x) int32 1 2 3 4\n",
" * y (y) int32 10 20<xarray.DataArray 'my_dataset' ()>\n",
"0.0\n",
"Coordinates:\n",
" x int32 4\n",
" y int32 20