Standard Name Convention#

The “Standard Name Convention” is one realization of a convention promoted by the toolbox. It is based on the idea, that every dataset must have a physical unit (or none if it is dimensionless) and that datasets must be identifiable via an identifier attribute rather than the dataset name itself.

The key standard attributes are

  • standard_name: A human- and machine-readable dataset identifier based on construction rules and listed in a “Standard Name Table”,

  • standard_name_table: List of standard_name together with the base unit (SI) and a comprehensive description. It also includes additional information about how a standard_name can be transformed into a new standard_name

  • units: The unit attribute of a dataset. Must not be SI-unit, but must be convertible to it and then match the registered SI-unit in the Standard name table,

  • long_name: An alternative name if no standard_name is applicable.

This concept is first introduced by the Climate and Forecast community and is called CF-convention. The h5RDMtoolbox adopts the concept and implements a general version of it, so that users can define their own discipline- or problem-specific standard name convention.

Main benefits of the convention are:

  • achieving self-describing files, which are human and machine interpretation interpretable,

  • validating correctness of dataset identifiers (standard_name) and their units

  • allowing unit-aware processing of data.

This chapter walks you through the concept and shows how to apply it

import h5rdmtoolbox as h5tbx
import warnings
warnings.filterwarnings('ignore')

from h5rdmtoolbox.convention.standard_names.table import StandardNameTable
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import h5rdmtoolbox as h5tbx
      2 import warnings
      3 warnings.filterwarnings('ignore')

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/checkouts/v1.7.0/h5rdmtoolbox/__init__.py:129
    125     with File(src) as h5:
    126         return h5.dumps()
--> 129 from h5rdmtoolbox.wrapper.ld.hdf.file import get_ld as hdf_get_ld
    130 from h5rdmtoolbox.wrapper.ld.user.file import get_ld as user_get_ld
    133 def get_ld(
    134         hdf_filename: Union[str, pathlib.Path],
    135         structural: bool = True,
    136         semantic: bool = True,
    137         blank_node_iri_base: Optional[str] = None,
    138         **kwargs) -> rdflib.Graph:

File ~/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/checkouts/v1.7.0/h5rdmtoolbox/wrapper/ld/__init__.py:1
----> 1 import ssnolib.ssno.standard_name
      2 from ontolutils.namespacelib import M4I
      3 from ontolutils.namespacelib import SCHEMA

ModuleNotFoundError: No module named 'ssnolib'

Standard Name Tables#

Example 1: cf-convention#

The Standard name table should be defined in documents (typically XML or YAML). The corresponding object then can be initialized by the respective constructor methods (from_yaml, from_web, …).

For reading the original CF-convention table, do the following:

cf = StandardNameTable.from_web("https://cfconventions.org/Data/cf-standard-names/79/src/cf-standard-name-table.xml",
                               known_hash='4c29b5ad70f6416ad2c35981ca0f9cdebf8aab901de5b7e826a940cf06f9bae4')
cf

The standard names are items of the table object:

cf['x_wind']
cf['x_wind'].units
cf['x_wind'].description

Example 2: User defined table#

Initializing standard name tables from a web-resource should be the standard process, because a project or community might defined it and published it under a DOI.

The h5rdmtoolbox especially supports tables that are published on Zenodo:

snt = StandardNameTable.from_zenodo(10428795)
snt

Here are the standard names of the table:

snt.names

In a notebook, we can also get a nice overview of the table by calling dump():

snt.dump()

Transformation of base standard names#

Not all allowed standard names must be included in the table. There are some so-called transformations of the listed ones. There are two ways to transform a standard name.

  1. Using affixes: Adding a prefix or a suffix

  2. Apply a mathematical operation to the name

1. Adding affixes#

Note, that ‘x_velocity’ is not part of the table:

'x_velocity' in snt

… but ‘velocity’ is. And it is a vector. The vector property tells us, if we can add a “vector component name” as a prefix, e.g. a “x” or “y”:

snt['velocity'].is_vector()

Which vector component exist, are defined in the table:

snt.affixes['component'].values

Thus, by indexing “x_velocity” the table checks whether the prefix is valid and if yes returns the new (transformed) standard name:

snt['x_velocity']

Apply a mathematical operation#

During processing of data, often times datasets are transformed in with mathematical function like taking the square or applying a derivative of one quantity with respect to (wrt) another one. Some mathemtaical operations like these are supported in the version, e.g.:

snt['derivative_of_x_velocity_wrt_x_coordinate']
snt['square_of_static_pressure']
snt['arithmetic_mean_of_static_pressure']

Usage with HDF5 files#

Let’s apply the convention to HDF5 files. We lazyly take the existing tutorial convention and remove some standard attributes in order to limit the example to the relevant attributes of the standard name convention:

zenodo_cv = h5tbx.convention.from_zenodo('https://zenodo.org/record/8357399')
sn_cv = zenodo_cv.pop('contact', 'comment', 'references', 'data_type')
sn_cv.name = 'standard name convention'
sn_cv.register()

h5tbx.use(sn_cv)
sn_cv

Find out about the available standard names: We do this by creating a file and retrieving the attributestandard_name_table. Based on the convention, it is set by default, so it is available without explicitly setting it:

with h5tbx.File() as h5:
    snt = h5.standard_name_table

print('The available (base) standard names are: ', snt.names)

One possible dataset based on the standard name table could be “x_velocity”. This is possible, because component is available in the list of affixes. Based on the transformation pattern, it is clear the “component” is a prefix. “x” is within the available components, so “x_velocity” is a valid transformed standard name from the given table:

print('Available affixes: ', snt.affixes.keys())

print('\nValues for the component prefix:')
snt.affixes['component']

Let’s access the name from the table. It exists and the description is adjusted, too:

snt['x_velocity']

Creating a x-velocity dataset:

with h5tbx.File() as h5:
    h5.create_dataset('u', data=[1,2,3], standard_name='x_velocity', units='km/s')
    h5.dump()

Usage with HDF5 files (update)#

from ssnolib import SSNO
with h5tbx.File(mode='w') as h5:
    ds = h5.create_dataset('u', data=3)
    ds.attrs['standard_name', SSNO.hasStandardName] = 'x_velocity'
    ds.rdf.object['standard_name'] = SSNO.StandardName  # https://matthiasprobst.github.io/ssno#StandardName
    
    ds = h5.create_dataset('v', data=3)
    ds.attrs['standard_name', SSNO.hasStandardName] = 'y_velocity'
    ds.rdf.object['standard_name'] = SSNO.StandardName  # https://matthiasprobst.github.io/ssno#StandardName
    h5.dump(collapsed=False)

hdf_filename = h5.hdf_filename