{ "cells": [ { "cell_type": "markdown", "id": "ee60b8cd-e14d-4607-9122-26c71e4e025c", "metadata": {}, "source": [ "# Getting Started with \"Layouts\"\n", "\n", "Layout (definitions) are a mean to validate the setup and metadata content of an HDF5 file. The `h5rdmtoolbox` provides a way which allows defining exected dataset, groups, attributes and properties. This is intended to be used as a validation step before any further handling with an HDF5 file.\n", "\n", "Different to conventions, a layout validates an existing HDF5 files. The below image illustrates a typical workflow:\n", "1. The file is created while a convention may be in place. The convention supports during the creation process and makes sure that some required data will be available. However, it is not possible to capture all requirements, especially structural ones, e.g. a certain dataset must exist.\n", "2. The structural and conditional testing can be described by means of a layout definition. A layout validates an already written file. If it succeeds, follow-up steps like sharing, storing or additional processing steps could take place.\n", "\n", "\"../../_static/layout_workflow.svg\"\n",\n", "\n", "Let's learn about it by practical examples:" ] }, { "cell_type": "code", "execution_count": 1, "id": "b24fad61-0117-4d27-8a9d-b8389f655873", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Failed to import module h5tbx\n" ] } ], "source": [ "from h5rdmtoolbox import layout # import the layout module" ] }, { "cell_type": "markdown", "id": "3a553a51-1573-4b28-9d91-26f8740b4c74", "metadata": {}, "source": [ "## 1. Create a layout\n", "\n", "The core class is called `Layout`, which will take so-called (layout) specifications:" ] }, { "cell_type": "code", "execution_count": 2, "id": "2e66e20d-712b-429c-aca7-640b572bc026", "metadata": {}, "outputs": [], "source": [ "lay = layout.Layout()" ] }, { "cell_type": "markdown", "id": "90673c5c-bbd2-475a-b191-d40e5f2f2e5a", "metadata": {}, "source": [ "Currently, the layout has no specifications:" ] }, { "cell_type": "code", "execution_count": 3, "id": "804e832a-0374-41fb-8f8a-65365df73e12", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lay.specifications" ] }, { "cell_type": "markdown", "id": "95002eca-127d-45ad-89b1-1558acdc22ae", "metadata": {}, "source": [ "### 1.1 Adding a specification\n", "\n", "Let's add a specification. For this we call `.add()`. We will add information for a query request, which will be performed later, when we validate a file (layout).\n", "\n", "The most important argument of a specification is a query-function. It can be any function that returns a list of HDF5 objects which are found based on keyword arguments requested by that function.\n", "\n", "Such a query function exists in h5rdmtoolbox, see the HDF5 database class: [`h5rdmtoolbox.database.hdfdb.FileDB`](../database/hdfDB.ipynb).\n", "\n", "As a first example, we request the following for all files to be validated with our layout:\n", "- all dataset must be compressed with \"gzip\"\n", "- the dataset with name \"/u\" must exist" ] }, { "cell_type": "code", "execution_count": 4, "id": "762c996d-af0c-4d7e-a794-31c9c788ae76", "metadata": {}, "outputs": [], "source": [ "from h5rdmtoolbox.database import hdfdb" ] }, { "cell_type": "code", "execution_count": 5, "id": "ba490b3c-820c-4dca-a94c-6e7f5f211dee", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[LayoutSpecification(description=\"At least one dataset exists\", kwargs={'flt': {}, 'objfilter': 'dataset'}),\n", " LayoutSpecification(description=\"Dataset \"/u\" exists\", kwargs={'flt': {'$name': '/u'}, 'objfilter': 'dataset'})]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the file must have datasets (this spec makes more sense with the following spec)\n", "spec_all_dataset = lay.add(\n", " hdfdb.FileDB.find, # query function\n", " flt={},\n", " objfilter='dataset',\n", " n={'$gte': 1}, # at least one dataset must exist\n", " description='At least one dataset exists'\n", ")\n", "\n", "# all datasets must be compressed with gzip (conditional spec. only called if parent spec is successful)\n", "spec_compression = spec_all_dataset.add(\n", " hdfdb.FileDB.find, # query function\n", " flt={'$compression': 'gzip'}, # query parameter\n", " n=1,\n", " description='Compression of any dataset is \"gzip\"'\n", ")\n", "\n", "# the file must have the dataset \"/u\"\n", "spec_ds_u = lay.add(\n", " hdfdb.FileDB.find, # query function\n", " flt={'$name': '/u'},\n", " objfilter='dataset',\n", " n=1,\n", " description='Dataset \"/u\" exists'\n", ")\n", "\n", "# we added one specification to the layout:\n", "lay.specifications" ] }, { "cell_type": "markdown", "id": "54fcadf7-50c3-44ff-a610-6ab25c7d92d0", "metadata": {}, "source": [ "**Note:** We added three specifications: The first (`spec_all_dataset`) and the last (`spec_ds_u`) specification were added to layout class. The second specification (`spec_compression`) was added to the first specification and therefore is a *conditional specification*. This means, that it is only called, if the parent specification was successful. Also note, that a child specification is called on all result objects of the parent specification. In our case, `spec_compression` is called on all datasets objects in the file." ] }, { "cell_type": "code", "execution_count": 6, "id": "fdc586ea-4175-4c4b-a884-a2a16daebc01", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[LayoutSpecification(description=\"Compression of any dataset is \"gzip\"\", kwargs={'flt': {'$compression': 'gzip'}})]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lay.specifications[0].specifications" ] }, { "cell_type": "markdown", "id": "5303fc07-e8c6-4f0e-9abd-1def29794760", "metadata": {}, "source": [ "## 2. Validate a file" ] }, { "cell_type": "markdown", "id": "939c72a1-8a23-4114-9598-40a600c1c717", "metadata": {}, "source": [ "**Example data**\n", "\n", "To test our layout, we need some example data" ] }, { "cell_type": "code", "execution_count": 7, "id": "b0ff9809-1c42-4987-83d7-e5ef7bcc2623", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", " \n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import h5rdmtoolbox as h5tbx\n", "with h5tbx.File() as h5:\n", " h5.create_dataset('u', shape=(3, 5), compression='lzf')\n", " h5.create_dataset('v', shape=(3, 5), compression='gzip')\n", " h5.create_group('instruments', attrs={'description': 'Instrument data'})\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "4a7c54bf-817a-4082-a87c-25fa3569dd10", "metadata": {}, "source": [ "Let's perform the validation. We expect one failed validation, because dataset \"u\" has the wrong compression type:" ] }, { "cell_type": "code", "execution_count": 8, "id": "a5b3846d-63f6-486a-8cb0-6bb12e307e2c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-04-07_18:34:55,177 ERROR [core.py:320] Applying spec. \"LayoutSpecification(description=\"Compression of any dataset is \"gzip\"\", kwargs={'flt': {'$compression': 'gzip'}})\" failed due to not matching the number of results: 1 != 0\n" ] } ], "source": [ "res = lay.validate(h5.hdf_filename)" ] }, { "cell_type": "code", "execution_count": 9, "id": "bd88b18f-113e-4683-9cc7-c012f3e2468d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.specifications[0].specifications[0].results[0].validation_flag" ] }, { "cell_type": "markdown", "id": "f6d4a866-98b2-4a57-a26b-93f48785c3da", "metadata": {}, "source": [ "The summary gives a comprehensive set of information about the performed calls on targets (datasets or groups) and their outcomes:" ] }, { "cell_type": "code", "execution_count": 14, "id": "c327ff0f-d366-42ee-90a9-08d21b85ea6b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Summary of layout validation\n", "+----------+--------+------------------------+--------------------------------------+---------------+---------------+\n", "| called | flag | flag description | description | target_type | target_name |\n", "|----------+--------+------------------------+--------------------------------------+---------------+---------------|\n", "| True | 1 | SUCCESSFUL | At least one dataset exists | Group | tmp0.hdf |\n", "| True | 10 | FAILED, INVALID_NUMBER | Compression of any dataset is \"gzip\" | Dataset | /u |\n", "| True | 1 | SUCCESSFUL | Compression of any dataset is \"gzip\" | Dataset | /v |\n", "| True | 1 | SUCCESSFUL | Dataset \"/u\" exists | Group | tmp0.hdf |\n", "+----------+--------+------------------------+--------------------------------------+---------------+---------------+\n", "--> Layout validation found issues!\n" ] } ], "source": [ "res.print_summary(exclude_keys=['id', 'kwargs', 'func'])" ] }, { "cell_type": "markdown", "id": "e2873694-9080-4ece-a874-1c892e6c54c9", "metadata": {}, "source": [ "The error log message shows us that one specification was not successful. The `is_valid()` call therefore will return `False`, too:" ] }, { "cell_type": "code", "execution_count": 11, "id": "de3744f2-e3d9-4388-852a-486601f88ade", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.is_valid()" ] }, { "cell_type": "code", "execution_count": 12, "id": "3e98e206-0f47-4d82-a213-b4d4b7964b86", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hdfdb.FileDB(h5.hdf_filename).find({'$compression': 'gzip'})" ] }, { "cell_type": "markdown", "id": "462e2b06-a48d-4d38-9ae2-94c9fde202ee", "metadata": {}, "source": [ "The \"compression-specification\" got called twice and failed one (for dataset \"u\"):" ] }, { "cell_type": "code", "execution_count": 13, "id": "e371cada-69b2-45dc-a759-2537ed1b9218", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, 1)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spec_compression.n_calls, spec_compression.n_fails" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 5 }