{ "cells": [ { "cell_type": "markdown", "id": "8b08d985-9c1b-4613-80d8-2965af5b8034", "metadata": {}, "source": [ "# First steps: HDF5 and databases\n", "\n", "There are two ways of working with HDF5 and databases:\n", "1. Using HDF5 file(s) as a database itself.\n", "2. Writing HDF5 content into dedicated database solutions.\n", "\n", "Both ways will be described in the next two chapters. The second approach is currently implemented for a [mongoDB](https://pymongo.readthedocs.io/en/stable/)-interface.\n", "\n", "However, before we start, let's understand how the database interface is designed." ] }, { "cell_type": "markdown", "id": "7dd41460-d2d8-4794-893a-2ed8017c607b", "metadata": {}, "source": [ "## Design idea\n", "\n", "Regardless of whether we want to use HDF5 files themselves as databases or connect the content to third party solutions, we need to write a user interface. The `h5RDMtoolbox` provides an abstract class (`HDF5DatabaseInterface`) from which any database interface must inherit." ] }, { "cell_type": "code", "execution_count": 1, "id": "c03a9649-c89b-4d0b-b5a3-cff64a6316da", "metadata": {}, "outputs": [], "source": [ "from h5rdmtoolbox.database import HDF5DBInterface" ] }, { "cell_type": "markdown", "id": "c3d18bbc-08dd-4378-b36e-a65165507f52", "metadata": {}, "source": [ "Four methods must be implemented: `insert_dataset`, `insert_group`, `find` and `find_one` (and of course an `__init__` method). The return types for `find` and `find_one` are so-called *lazy* objects. In short: They are interfaces to the HDF5 dataset and group objects and allow accessing data while the source file is closed. Examples are given at the end." ] }, { "cell_type": "code", "execution_count": 2, "id": "4d4c6cd6-6262-4abe-af1d-adec503a9b84", "metadata": {}, "outputs": [], "source": [ "from typing import List\n", "from h5rdmtoolbox.wrapper.lazy import LHDFObject\n", "\n", "class MyDBInterface(HDF5DBInterface):\n", "\n", " def __init__(self, *args, **kwargs):\n", " \"\"\"init the db\"\"\"\n", " \n", " def insert_dataset(self, *args, **kwargs):\n", " \"\"\"inserting datasets into a database\"\"\"\n", " \n", " def insert_group(self, *args, **kwargs):\n", " \"\"\"inserting datasets into a database\"\"\"\n", "\n", " def find(self, *args, **kwargs) -> LHDFObject:\n", " \"\"\"find (many) objects according to the query parameters\"\"\"\n", " \n", " def find_one(self, *args, **kwargs) -> List[LHDFObject]:\n", " \"\"\"find (many) objects according to the query parameters\"\"\"" ] }, { "cell_type": "markdown", "id": "8f54c90f-7b73-4e78-b93b-f3c11d9e273e", "metadata": {}, "source": [ "The usage of a database interface will then look like this for all database implementations:" ] }, { "cell_type": "code", "execution_count": 3, "id": "f52f38e7-2bb7-402c-bea0-627598671901", "metadata": {}, "outputs": [], "source": [ "import h5rdmtoolbox as h5tbx\n", "\n", "mydb = MyDBInterface()\n", "\n", "with h5tbx.File() as h5:\n", " h5.create_group('a group')\n", " h5.create_dataset('my_dataset', shape=(4, 2))\n", " # ... \n", " mydb.insert_dataset(h5['my_dataset'])\n", " mydb.insert_group(h5['a group'])\n", "\n", "many_res = mydb.find(...)\n", "single_res = mydb.find_one(...)" ] }, { "cell_type": "markdown", "id": "9e2bcb7b-321f-449f-a665-e5260dfe465e", "metadata": {}, "source": [ "## A word on lazy objects (return values of database find methods)" ] }, { "cell_type": "markdown", "id": "9a994739-d1df-42bc-8544-57ad042a8e1a", "metadata": {}, "source": [ "The **return types** of the find-methods are so-called lazy objects (or generator of lazy objects). What is a lazy object?\n", "\n", "There are two types: `LDataset` and `LGroup`, the lazy objects for datasets and groups. Those objects are connected to HDF datasets and groups with the only difference, that the user can work with them **even if the file is closed**.\n", "\n", "Example: The standard approach is to open a file whenever data or information needs to be accessed:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b131f615-1ebb-43f8-9d4c-bf3c73591060", "metadata": {}, "outputs": [], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset('my_dataset', shape=(4, 2))\n", "\n", "# some other code....\n", "\n", "# after a while, we want to access the data again and need to reopen the file again:\n", "with h5tbx.File(h5.hdf_filename) as h5:\n", " ds = h5['my_dataset'][()]" ] }, { "cell_type": "markdown", "id": "a67262e1-c9d0-4367-91cc-c93e9cc6f296", "metadata": {}, "source": [ "The `LDataset` allows accessing a dataset without actively opening the file (the object takes core of it in the background)" ] }, { "cell_type": "code", "execution_count": 5, "id": "74cc251f-23ca-4f86-b0f9-7f57f7dbd139", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'my_dataset' (x: 4, y: 2)>\n",
       "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
       "Coordinates:\n",
       "  * x        (x) int32 1 2 3 4\n",
       "  * y        (y) int32 10 20
" ], "text/plain": [ "\n", "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "Coordinates:\n", " * x (x) int32 1 2 3 4\n", " * y (y) int32 10 20" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with h5tbx.File() as h5:\n", " x = h5.create_dataset('x', data=[1, 2, 3, 4], make_scale=True)\n", " y = h5.create_dataset('y', data=[10, 20], make_scale=True)\n", " h5.create_dataset('my_dataset', shape=(4, 2), attach_scales=(x, y))\n", "\n", " lds = h5tbx.database.lazy.LDataset(h5['my_dataset'])\n", "\n", "lds[()] # access the data although the file is closed" ] }, { "cell_type": "markdown", "id": "3b73c359-8972-4f7a-922c-d6614774791e", "metadata": {}, "source": [ "The \"laziness\" behind this is that the object takes care of opening and closing the file in the background. It is just a convenient way of accessing data from an hdf file without the extra code and worries of properly opening and closing the file.\n", "\n", "Moreover, the object has additional functionality, such as slicing the array based on the dimension scales/coordinates:" ] }, { "cell_type": "code", "execution_count": 6, "id": "3bfc43bf-f896-4a84-b3a2-ed7d7da6144c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'my_dataset' ()>\n",
       "0.0\n",
       "Coordinates:\n",
       "    x        int32 4\n",
       "    y        int32 20
" ], "text/plain": [ "\n", "0.0\n", "Coordinates:\n", " x int32 4\n", " y int32 20" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lds.sel(x=4, y=20)" ] }, { "cell_type": "markdown", "id": "35c72746-0315-43f4-a17c-6776da340a93", "metadata": {}, "source": [ "We can do the same thing with groups. It is just less useful because the datasets are usually of greater interest..." ] }, { "cell_type": "code", "execution_count": null, "id": "3a15cd55-f6ef-4bb7-83f7-d45da38543d9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.19" } }, "nbformat": 4, "nbformat_minor": 5 }