{ "cells": [ { "cell_type": "markdown", "id": "0052861b-12a3-4960-9279-7b8d03752a9d", "metadata": {}, "source": [ "# Creating HDF Datasets\n", "\n", "Dataset creation works almost as known from `h5py`. However, to facilitate and streamline the work with HDF5 files further some featurs are added." ] }, { "cell_type": "code", "execution_count": 1, "id": "7e7ee488-fc61-4edd-a944-d5613969c769", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "using(\"h5py\")" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import h5rdmtoolbox as h5tbx\n", "import numpy as np\n", "import xarray as xr\n", "\n", "h5tbx.use(None)" ] }, { "cell_type": "markdown", "id": "a6baa8c7-9bc8-4c45-9632-9322c72cda9d", "metadata": {}, "source": [ "Obligatory parameters during dataset creation know from the base package `h5py` are `name` and `data` or `shape`. Additionally, attributes can be passed during dataset creation right away:" ] }, { "cell_type": "code", "execution_count": 2, "id": "0bf9a1d3-5af4-46f1-8954-38dc7d1d0542", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", " \n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset('x', shape=(4,),\n", " attrs=dict(description='x coordinate'))\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "7074b05a-d5d0-487b-aaed-fb5dee043a6b", "metadata": {}, "source": [ "The name of the dataset is the path within the HDF5 file. It is possible to create the dataset although the (sub-)groups don't exist." ] }, { "cell_type": "code", "execution_count": 3, "id": "ba439684-87de-477e-bfc9-ca1645cc5fa7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", " \n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset('grp/subgrp/x', shape=(4,))\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "831ed780-ad93-4181-9989-1b4a8aab87a0", "metadata": {}, "source": [ "## Attributes\n", "\n", "More flexibility and additional features are given also to attributes. One of the main ones to mention is the ability to intepret the attribute strings as \"value and quantity\" using the package `pint`:
\n", "Let's say we store the attribute `length` then most probably it will inlcude the unit,e.g. `1 m`. We could also saved it as a dataset, but we did not. By calling `.to_pint()` on the return object (which is a subclass of `str`) we receive a `pint.Qunatity` (see https://pint.readthedocs.io/en/stable/getting/tutorial.html for more info):" ] }, { "cell_type": "code", "execution_count": 4, "id": "c9234e61-6011-4fd2-8cda-ac280680333f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "1 m" ], "text/latex": [ "$\\begin{pmatrix}1\\end{pmatrix}\\ \\mathrm{m}$" ], "text/plain": [ "array(1) " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with h5tbx.File() as h5:\n", " h5.attrs['length'] = '1 m'\n", " p = h5.attrs.length.to_pint()\n", "p" ] }, { "cell_type": "markdown", "id": "ab70316c-64c5-4a56-a920-3caf83322963", "metadata": {}, "source": [ "## Dimension scales\n", "\n", "Dimension scales can be defined during dataset creation. Let `time` be the dimension scale and `pressure` be the dataset to which it is attached.
\n", "In order to make seamingless use of the HDF dimension scales, the feature is provided back to the user by returning a `xarray.DataArray` instead of a `np.ndarray` object. See more on this [slicing datasets](./DatasetSlicing.ipynb)." ] }, { "cell_type": "code", "execution_count": 5, "id": "ceb9c820-070c-4cd7-b944-5e130bd7e7dd", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "
    \n", "
  • \n", " \n", " \n", " \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (time: 6) [float64]\n", "
      • units: Pa
      • \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (6) [int32]\n", "
      • units: s
      • \n", "
      \n", "
    \n", "
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fname_dimcales = h5tbx.utils.generate_temporary_filename()\n", "with h5tbx.File(fname_dimcales, 'w') as h5:\n", " h5.create_dataset('time', data=[0,1,2,3,4,5],\n", " make_scale=True,\n", " attrs={'units': 's'})\n", " h5.create_dataset('pressure', data=np.random.rand(6),\n", " attach_scale=((h5['time'])),\n", " attrs={'units': 'Pa'})\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "f1fdf25f-a193-458e-813e-b129f9134947", "metadata": {}, "source": [ "In order to be compliant with `xarray` objects, single value \"dimension scales\" are set via the attribute `COORDINATES`. An example is the location of the pressure sensor in our case. Let's first create the datasets and then add them as attributes to \"pressure\":" ] }, { "cell_type": "code", "execution_count": 6, "id": "59560eec-0225-43bc-975d-3ba1812468fa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "
    \n", "
  • \n", " \n", " \n", " \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (time: 6) [float64]\n", "
      • units: Pa
      • \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (6) [int32]\n", "
      • units: s
      • \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " 5.32 [] (float64)\n", "
        \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " -3.1 [] (float64)\n", "
        \n", "
      \n", "
    \n", "
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with h5tbx.File(fname_dimcales, 'r+') as h5:\n", " h5.create_dataset('x', data=5.32)\n", " h5.create_dataset('y', data=-3.1)\n", " h5['pressure'].attrs['COORDINATES'] = ('x', 'y')\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "5283b85d-bcf8-420e-91c1-fc7dadab6bf6", "metadata": {}, "source": [ "### String datasets\n", "String datasets can be created very quickly. No standard_name, long_name or units *must* be given. As units generally anyhow makes no sense, there is still the option to pass long and standard name via the method parameters.
\n", "The dump method will display single strings but not lists of strings.
\n", "The return value when sliced will still be a `xarray.DataArray` as attributes should still be attached to the object. Use `.values` to get the raw string:" ] }, { "cell_type": "code", "execution_count": 7, "id": "a854cf32-74d8-47a9-a34c-1921b36913c8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "
    \n", "
  • \n", " \n", " \n", " \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " : [|S11] data=b'hello_world'\n", " \n", "
        \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " : [|S5]\n", " \n", "
        \n", "
      \n", "
    \n", "
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "> hello_world\n", "> b'hello_world'\n", "> \n", "'hello' 'world'\n", "Dimensions without coordinates: dim_0\n", "> [b'hello' b'world']\n" ] } ], "source": [ "with h5tbx.File() as h5:\n", " h5.create_string_dataset('astr', 'hello_world')\n", " h5.create_string_dataset('string_list', ['hello', 'world'])\n", " h5.dump()\n", " \n", " print('> ', h5['astr'][()])\n", " print('> ',h5['astr'].values[()])\n", " \n", " print('> ', h5['string_list'][:])\n", " print('> ',h5['string_list'].values[:])" ] }, { "cell_type": "markdown", "id": "19f5a8dc-26a0-4178-9752-0de9c34d8137", "metadata": {}, "source": [ "#### Time datasets\n", "\n", "Time data is stored as string datasets. Use `create_time_dataset`. Provide data as `datetime` objects and indicate the time format (simplest is to pass 'iso'). " ] }, { "cell_type": "code", "execution_count": 19, "id": "4cd1032e-ec57-4427-95cc-efacc2c9e14d", "metadata": {}, "outputs": [], "source": [ "from datetime import datetime, timedelta" ] }, { "cell_type": "code", "execution_count": 27, "id": "617ccb67-7801-4ec4-a1e0-07944a2eef80", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray (dim_0: 3)>\n",
       "2024-06-18T12:26:35.885516 2024-06-18T12:26:36.885516 2024-06-18T12:26:37.885516\n",
       "Dimensions without coordinates: dim_0\n",
       "Attributes:\n",
       "    @TYPE:          https://schema.org/DateTime\n",
       "    RDF_PREDICATE:  {'time_format': 'https://matthiasprobst.github.io/pivmeta...\n",
       "    time_format:    %Y-%m-%dT%H:%M:%S.%f
" ], "text/plain": [ "\n", "2024-06-18T12:26:35.885516 2024-06-18T12:26:36.885516 2024-06-18T12:26:37.885516\n", "Dimensions without coordinates: dim_0\n", "Attributes:\n", " @TYPE: https://schema.org/DateTime\n", " RDF_PREDICATE: {'time_format': 'https://matthiasprobst.github.io/pivmeta...\n", " time_format: %Y-%m-%dT%H:%M:%S.%f" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with h5tbx.File() as h5:\n", " now = datetime.now()\n", " h5.create_time_dataset('t',\n", " data=[now, now + timedelta(seconds=1), now + timedelta(seconds=2)],\n", " time_format='iso',\n", " make_scale=True)\n", " h5.create_dataset('x', data=[1, 2, 3], attach_scale='t')\n", " # h5.dump()\n", " txr = h5['t'][()]\n", "txr" ] }, { "cell_type": "markdown", "id": "9d0d95f9-8dbe-4755-ad03-2258a51beae4", "metadata": {}, "source": [ "### Advanced dataset creation\n", "\n", "There is more to dataset creation. You can:\n", "- add attributes" ] }, { "cell_type": "code", "execution_count": 8, "id": "4b9899ee-8b74-4521-a8a5-6d132e3ef744", "metadata": {}, "outputs": [], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset('ds', shape=(10, ), attrs=dict(long_name='a long name', anothera='another attr')) # unitless dataset. long_name is passed via parameter attrs" ] }, { "cell_type": "markdown", "id": "0ca2a909-b2f7-44eb-8339-169c5fd62b3a", "metadata": {}, "source": [ "- make and attach scales (Note the output using `dump()`: the scale \"link\" is shown)" ] }, { "cell_type": "code", "execution_count": 9, "id": "4d2e9458-d5c9-414c-9def-44277d11aa47", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "\n", "
\n", "\n", "
    \n", "
  • \n", " \n", " \n", " \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (x: 3) [float64]\n", "
      • standard_name: temperature
      • units: degC
      • \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (3) [int32]\n", "
      • standard_name: x_coordinate
      • units: m
      • \n", "
      \n", "
    \n", "
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset('x', data=[1,2,3], attrs=dict(units='m', standard_name='x_coordinate'), make_scale=True)\n", " h5.create_dataset('t', data=[20.1, 18.5, 24.7], attrs=dict(units='degC', standard_name='temperature'), attach_scale=h5['x'])\n", " print(h5.t.x) # note, that you can access the dimension scale using attribute-style-syntax\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "4e810d95-ae9f-43c8-8560-1f054ddd50eb", "metadata": {}, "source": [ "- add `xarry.DataArrays`" ] }, { "cell_type": "code", "execution_count": 10, "id": "9e8f826f-0579-405d-9c1d-bf47733c66fb", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "
    \n", "
  • \n", " \n", " \n", " \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (y: 3, x: 2) [float64]\n", "
      • long_name: a long name
      • units: m/s
      • \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (2) [int32]\n", "
      • standard_name: x_coordinate
      • \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (3) [int32]\n", "
      • standard_name: y_coordinate
      • units: m
      • \n", "
      \n", "
    \n", "
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "arr = xr.DataArray(dims=('y', 'x'), data=np.random.rand(3, 2),\n", " coords={'y': xr.DataArray(dims='y', data=[1, 2, 3],\n", " attrs={'units': 'm',\n", " 'standard_name': 'y_coordinate'}),\n", " 'x': xr.DataArray(dims='x',\n", " data=[0, 1],\n", " attrs={'standard_name': 'x_coordinate'})\n", " },\n", " attrs={'long_name': 'a long name',\n", " 'units': 'm/s'})\n", "\n", "with h5tbx.File() as h5:\n", " h5.create_dataset('temperature', data=arr)\n", " h5.dump()" ] }, { "cell_type": "markdown", "id": "4c629265-da4a-4f50-b887-f16aa3a385c8", "metadata": {}, "source": [ "- add `xarry.Dataset`" ] }, { "cell_type": "code", "execution_count": 11, "id": "23d2c88c-8707-4ee0-938c-20eb90699ca6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (foo: 3, x: 2)\n",
       "Coordinates:\n",
       "  * foo      (foo) int32 1 2 3\n",
       "Dimensions without coordinates: x\n",
       "Data variables:\n",
       "    bar      (x) int32 1 2\n",
       "    baz      float64 3.142
" ], "text/plain": [ "\n", "Dimensions: (foo: 3, x: 2)\n", "Coordinates:\n", " * foo (foo) int32 1 2 3\n", "Dimensions without coordinates: x\n", "Data variables:\n", " bar (x) int32 1 2\n", " baz float64 3.142" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = xr.Dataset({'foo': [1,2,3], 'bar': ('x', [1, 2]), 'baz': np.pi})\n", "ds" ] }, { "cell_type": "code", "execution_count": 12, "id": "e167343f-12c2-400b-9c1f-c2b338758471", "metadata": {}, "outputs": [], "source": [ "try:\n", " with h5tbx.File() as h5:\n", " h5.create_dataset_from_xarray_dataset(ds)\n", "except h5tbx.errors.UnitsError as e:\n", " print(e)" ] }, { "cell_type": "code", "execution_count": 13, "id": "00ad35bc-dbfa-4da9-8c27-8f727e2555c5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (foo: 3, x: 2)\n",
       "Coordinates:\n",
       "  * foo      (foo) int32 1 2 3\n",
       "Dimensions without coordinates: x\n",
       "Data variables:\n",
       "    bar      (x) int32 1 2\n",
       "    baz      float64 3.142
" ], "text/plain": [ "\n", "Dimensions: (foo: 3, x: 2)\n", "Coordinates:\n", " * foo (foo) int32 1 2 3\n", "Dimensions without coordinates: x\n", "Data variables:\n", " bar (x) int32 1 2\n", " baz float64 3.142" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.foo.attrs['units']='m'\n", "ds.foo.attrs['long_name']='foo'\n", "\n", "ds.bar.attrs['units']='m'\n", "ds.bar.attrs['long_name']='bar'\n", "\n", "ds.baz.attrs['units']='m'\n", "ds.baz.attrs['long_name']='baz'\n", "\n", "ds" ] }, { "cell_type": "code", "execution_count": 14, "id": "52b1701c-11bb-4b8f-aa26-d7a77e56b126", "metadata": {}, "outputs": [], "source": [ "with h5tbx.File() as h5:\n", " h5.create_dataset_from_xarray_dataset(ds)" ] }, { "cell_type": "markdown", "id": "3c4808bb-63cb-4886-b2ea-4fd863238e48", "metadata": {}, "source": [ "We may also create a dataset by using the `__setitem__`:" ] }, { "cell_type": "code", "execution_count": 15, "id": "a52f72b8-5cdc-4ce0-adfe-67ab83d650ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "
    \n", "
  • \n", " \n", " \n", " \n", "
      \n", "
    \n", "\n", "
      \n", " \n", " \n", " (3) [int32]\n", "
      • hello: world
      • \n", "
      \n", "
    \n", "
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with h5tbx.File() as h5:\n", " h5['x'] = ([1,2,3], dict(attrs={'hello': 'world'}, compression='gzip'))\n", " h5.dump()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 5 }