HDF5 and RDF: FAIR Attributes

HDF5 and RDF: FAIR Attributes#

According to F1 of the FAIR Principles attributes shall be assigned to globally unique and persistent identifiers.

Here’s what www.go-fair.org says about it:

“Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data.”

The h5rdmtoolbox allows assigning attributes (and their data) to identifiers. For this, each name and value of an attribute may obtain an IRI (internationalized resource identifier). The following outlines, how it is done.

Concept#

We can interpret HDF5 objects, their attribute names and attribute values as RDF triples (subject-predicate-object), where…

… a group or dataset is a subject
… the attribute name is a predicate
… and the attriute value is an object

In the following, we would like to describe the content of an HDF5 file. There will be a dataset or random data generated by a person, which can be identified/described by a researcher ID (ORCID).

We as humans may understand the content of such an HDF5 file. For machines to interpret the data, we need to associate URIs with the HDF5 objects. In fact, sometimes it may also not very clear to humans, what is meant with a certain attribute. To be unambiguous about it, a URI helps. Think of the attribute “contact”, we will define. Is it a person or an organization? Note, that URI and IRI may be used synonymously - IRI is built on URI by expanding the set of permitted characters.

Let’s build the example step by step. We start with creating the group “contact”:

import h5rdmtoolbox as h5tbx

How to make use of the FAIR HDF5 file?#

There are three ways, how the above IRI assignments help us and how we might want to use the information:

Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)
We can extract a JSON-LD file. This is useful for other processes. We can also investigate this file further with tools like JSON-LD-playground.
Access IRI in (Python) code

1. Visual inspection#

The dump() method will now add IRI-icons. Click on it and get redirected to the resources:

h5tbx.dump(hdf_filename, collapsed=False)

/(3)
- contact @type: http://xmlns.com/foaf/0.1/Person @id: https://orcid.org/0000-0001-8729-0482 https://schema.org/author(0)
  - first_name http://xmlns.com/foaf/0.1/firstName: Matthias
  - orcid http://w3id.org/nfdi4ing/metadata4ing#orcidId: https://orcid.org/0000-0001-8729-0482
- grp(1)
- processing_info @id: http://w3id.org/nfdi4ing/metadata4ing#ProcessingStep(0)
  - end_time https://schema.org/startTime: 2025-02-08T18:25:56.866437
  - has_participants http://purl.obolibrary.org/obo/RO_0000057: /contact
  - output http://purl.obolibrary.org/obo/RO_0002234: /grp/random_velocity
  - start_time https://schema.org/startTime: 2025-02-08T18:25:56.866432

2. JSON-LD extraction#

Write the JSON-LD file and share it with others or a repository. The toolbox provides dump-methods through the jsonld module. It might look a bit overwelming, however dedicated scripts can perfectly work with it while humans still can read it (with a bit of practice and patience…):

print(
    h5tbx.dump_jsonld(
        hdf_filename,
        indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/'}
    )
)

{
  "@context": {
    "foaf": "http://xmlns.com/foaf/0.1/",
    "hdf5": "http://purl.allotrope.org/ontologies/hdf5/1.8#",
    "m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
    "obo": "http://purl.obolibrary.org/obo/",
    "schema": "https://schema.org/"
  },
  "@graph": [
    {
      "@id": "_:N899b84cb1ef549f78859b6b3368d53c4",
      "@type": "hdf5:File",
      "hdf5:rootGroup": {
        "@id": "_:Ned95842074914f2d81088c185de05d46",
        "@type": "hdf5:Group",
        "hdf5:member": [
          {
            "@id": "https://orcid.org/0000-0001-8729-0482",
            "@type": [
              "hdf5:Group",
              "foaf:Person"
            ],
            "foaf:firstName": "Matthias",
            "hdf5:attribute": [
              {
                "@id": "_:N2876255d964141228159780f7529c057",
                "@type": "hdf5:Attribute",
                "hdf5:name": "first_name",
                "hdf5:value": "Matthias"
              },
              {
                "@id": "_:N4d499b0fb0574507a2bd795a263d4f95",
                "@type": "hdf5:Attribute",
                "hdf5:name": "orcid",
                "hdf5:value": {
                  "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
                  "@value": "https://orcid.org/0000-0001-8729-0482"
                }
              }
            ],
            "hdf5:name": "/contact",
            "m4i:orcidId": {
              "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
              "@value": "https://orcid.org/0000-0001-8729-0482"
            }
          },
          {
            "@id": "_:N98522107156844909950b3b966ef0638",
            "@type": "hdf5:Group",
            "hdf5:member": {
              "@id": "_:N9fb9ccc69ae140098536046df253b7ea",
              "@type": "hdf5:Dataset",
              "hdf5:attribute": [
                {
                  "@id": "_:Ne722f7465524431fb1c1d1f7064d96bc",
                  "@type": "hdf5:Attribute",
                  "hdf5:name": "quantity_kind",
                  "hdf5:value": [
                    "velocity",
                    {
                      "@id": "http://qudt.org/vocab/quantitykind/Velocity",
                      "http://www.w3.org/2004/02/skos/core#prefLabel": "velocity"
                    }
                  ]
                },
                {
                  "@id": "_:Nb134d8d2306842d792d09f693ef0a833",
                  "@type": "hdf5:Attribute",
                  "hdf5:name": "units",
                  "hdf5:value": [
                    "m/s",
                    {
                      "@id": "http://qudt.org/vocab/unit/M-PER-SEC",
                      "http://www.w3.org/2004/02/skos/core#prefLabel": "m/s"
                    }
                  ]
                }
              ],
              "hdf5:datatype": "H5T_FLOAT",
              "hdf5:dimension": 1,
              "hdf5:name": "/grp/random_velocity",
              "hdf5:size": 100,
              "m4i:hasKindOfQuantity": {
                "@id": "http://qudt.org/vocab/quantitykind/Velocity"
              },
              "m4i:hasUnit": {
                "@id": "http://qudt.org/vocab/unit/M-PER-SEC"
              }
            },
            "hdf5:name": "/grp"
          },
          {
            "@id": "m4i:ProcessingStep",
            "@type": "hdf5:Group",
            "hdf5:attribute": [
              {
                "@id": "_:N9c8bd14fab634010a2fdd7c75a6305ed",
                "@type": "hdf5:Attribute",
                "hdf5:name": "end_time",
                "hdf5:value": "2025-02-08T18:25:56.866437"
              },
              {
                "@id": "_:N632d9498a3d141609a68f77bff6f1d22",
                "@type": "hdf5:Attribute",
                "hdf5:name": "has_participants",
                "hdf5:value": "/contact"
              },
              {
                "@id": "_:N53bea4e5125a47d49cd420d5c2e78581",
                "@type": "hdf5:Attribute",
                "hdf5:name": "output",
                "hdf5:value": "/grp/random_velocity"
              },
              {
                "@id": "_:N22fc83d7f4604bb1bac46de88ffe73e6",
                "@type": "hdf5:Attribute",
                "hdf5:name": "start_time",
                "hdf5:value": "2025-02-08T18:25:56.866432"
              }
            ],
            "hdf5:name": "/processing_info",
            "obo:RO_0000057": "/contact",
            "obo:RO_0002234": "/grp/random_velocity",
            "schema:startTime": [
              "2025-02-08T18:25:56.866437",
              "2025-02-08T18:25:56.866432"
            ]
          }
        ],
        "hdf5:name": "/"
      }
    }
  ]
}

Dump it to the file rather than to the screen:

from h5rdmtoolbox import jsonld
with open('hdf_meta.jsonld', 'w') as f:
    jsonld.dump(hdf_filename, f, indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/'})
                # context={'foaf': 'http://xmlns.com/foaf/0.1/',
                #                                    'm4i': 'http://w3id.org/nfdi4ing/metadata4ing#'})

3. Access IRI in code#

You may want to access the IRI of an attribute with Python within the HDF5 file. E.g. while working with the file, you may ask “Hey, what is ‘contact’ exactly?” or “What does the attribute ‘orcid’ mean?”

with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.rdf.subject
    orcid_iri = h5.contact.rdf.predicate['orcid']

… Well “contact” is a “Person” defined by the FOAF ontology:

person_iri

'https://orcid.org/0000-0001-8729-0482'

… and “orcid” is a predicate defined by the metadata4ing ontology:

orcid_iri

'http://w3id.org/nfdi4ing/metadata4ing#orcidId'

3.1 Find data based on IRIs#

import rdflib.graph as g

graph = g.Graph()
graph.parse('hdf_meta.jsonld', format='json-ld')

<Graph identifier=N4807bf7e412e4f11a930b98e82909f73 (<class 'rdflib.graph.Graph'>)>

Note, that we need to provide the PREFIXES, if the json-ld data/file does not include the context.

res = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX m4i: <http://w3id.org/nfdi4ing/metadata4ing#>

SELECT ?id ?orcid
WHERE {
    ?id a foaf:Person .
    ?id m4i:orcid ?orcid .
    }
""")

for r in res:
    print(r)

4. Examples:#

4.1 Read metadata from JSON and write to HDF5#

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the h5rdmtoolbox codemeta.json file from the github repository:

from h5rdmtoolbox.utils import download_file
from pprint import pprint

Download the file:

codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'
dowloaded_filename = download_file(codemeta_url)

2025-02-08_18:25:57,95 WARNING  [utils.py:70] No hash given! This is recommended when downloading files from the web.

Read the data with ontolutils.dquery:

from ontolutils import dquery

data = dquery(subject='schema:SoftwareSourceCode',
              source=dowloaded_filename,
              context={"schema": "http://schema.org/"})
pprint(data[0])

{'@context': {'applicationCategory': 'http://schema.org/applicationCategory',
              'author': 'http://schema.org/author',
              'codeRepository': 'http://schema.org/codeRepository',
              'description': 'http://schema.org/description',
              'license': 'http://schema.org/license',
              'name': 'http://schema.org/name',
              'operatingSystem': 'http://schema.org/operatingSystem',
              'programmingLanguage': 'http://schema.org/programmingLanguage',
              'version': 'http://schema.org/version'},
 '@id': '_:N16ed2ed22a90463fb1ae5448a299053e',
 '@type': 'http://schema.org/SoftwareSourceCode',
 'applicationCategory': 'file:///home/docs/.cache/h5rdmtoolbox/1.6.2/Engineering',
 'author': [{'@id': 'https://orcid.org/0000-0001-9560-500X',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'familyName': 'Pritz',
             'givenName': 'Balazs'},
            {'@id': 'https://orcid.org/0000-0001-8729-0482',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'email': 'matth.probst@gmail.com',
             'familyName': 'Probst',
             'givenName': 'Matthias'},
            {'@id': 'https://orcid.org/0000-0002-4116-0065',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'familyName': 'Büttner',
             'givenName': 'Lucas'}],
 'codeRepository': 'git+https://github.com/matthiasprobst/h5RDMtoolbox.git',
 'description': 'Supporting a FAIR Research Data lifecycle using Python and '
                'HDF5.',
 'license': 'https://spdx.org/licenses/MIT',
 'name': 'h5RDMtoolbox',
 'operatingSystem': ['Linux', 'Windows', 'macOS'],
 'programmingLanguage': ['Python 3',
                         'Python 3.8',
                         'Python 3.9',
                         'Python 3.10',
                         'Python 3.11',
                         'Python 3.12'],
 'version': '1.6.2'}

The data are written into the HDF5 file by using jsonld.to_hdf():

with h5tbx.File('test.hdf', 'w') as h5:
    jsonld.to_hdf(data=data[0],
                 grp=h5.create_group('software_code'))
    h5.dump(False)

/(1)
- software_code @type: http://schema.org/SoftwareSourceCode(3)
  - applicationCategory http://schema.org/applicationCategory: ///home/docs/.cache/h5rdmtoolbox/1.6.2/Engineering
  - codeRepository http://schema.org/codeRepository: //github.com/matthiasprobst/h5RDMtoolbox.git
  - description http://schema.org/description: Supporting a FAIR Research Data lifecycle using Python and HDF5.
  - license http://schema.org/license: https://spdx.org/licenses/MIT
  - name http://schema.org/name: h5RDMtoolbox
  - operatingSystem http://schema.org/operatingSystem: Linux, Windows, macOS
  - programmingLanguage http://schema.org/programmingLanguage: Python 3, Python 3.8, Python 3.9, Python 3.10, Python 3.11, Python 3.12
  - version http://schema.org/version: 1.6.2
  - author1 @type: http://schema.org/Person @id: https://orcid.org/0000-0001-9560-500X http://schema.org/author(1)
    - familyName: Pritz
    - givenName: Balazs
    - affiliation @type: http://schema.org/Organization @id: https://ror.org/04t3en479(0)
      - name http://schema.org/name: Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery
  - author2 @type: http://schema.org/Person @id: https://orcid.org/0000-0001-8729-0482 http://schema.org/author(1)
    - email: matth.probst@gmail.com
    - familyName: Probst
    - givenName: Matthias
    - affiliation @type: http://schema.org/Organization @id: https://ror.org/04t3en479(0)
      - name http://schema.org/name: Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery
  - author3 @type: http://schema.org/Person @id: https://orcid.org/0000-0002-4116-0065 http://schema.org/author(1)
    - familyName: Büttner
    - givenName: Lucas
    - affiliation @type: http://schema.org/Organization @id: https://ror.org/04t3en479(0)
      - name http://schema.org/name: Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery

import ontolutils

with h5tbx.File(mode='w') as h5:
    _ = h5.create_dataset('test_dataset', data=np.array([[1, 2], [3, 4], [5.4, 1.9]]))
    _ = h5.create_dataset('test_dataset 2', data=4.5)
    h5.create_dataset('grp/subgrp/vel', data=4)
    h5.attrs['name', ontolutils.SCHEMA.name] = 'test attr'
    # _ = h5.create_dataset('test_dataset', data=5.4)
    jd = jsonld.dumpd(h5, structural=True)
    jds = jsonld.dumps(h5, structural=True, indent=2)
from pprint import pprint
pprint(jd, indent=1)

{'@context': {'hdf5': 'http://purl.allotrope.org/ontologies/hdf5/1.8#',
              'schema': 'https://schema.org/'},
 '@graph': [{'@id': '_:N68426685fdd847a9bf851a64f9224af2',
             '@type': 'hdf5:File',
             'hdf5:rootGroup': {'@id': '_:N105fe10f726c4f598cbb5f7a02ae8cd4',
                                '@type': 'hdf5:Group',
                                'hdf5:attribute': {'@id': '_:N0f4c18d313454375a72694259570dc88',
                                                   '@type': 'hdf5:Attribute',
                                                   'hdf5:name': 'name',
                                                   'hdf5:value': 'test attr'},
                                'hdf5:member': [{'@id': '_:Nc09f66ce0f034b3fa74f946857778668',
                                                 '@type': 'hdf5:Group',
                                                 'hdf5:member': {'@id': '_:N848bba62383e467ea4ee87af6161113f',
                                                                 '@type': 'hdf5:Group',
                                                                 'hdf5:member': {'@id': '_:N5fe689468dcd4ef5aa022c2b89199b6d',
                                                                                 '@type': 'hdf5:Dataset',
                                                                                 'hdf5:datatype': 'H5T_INTEGER',
                                                                                 'hdf5:dimension': 0,
                                                                                 'hdf5:name': '/grp/subgrp/vel',
                                                                                 'hdf5:size': 1,
                                                                                 'hdf5:value': '4'},
                                                                 'hdf5:name': '/grp/subgrp'},
                                                 'hdf5:name': '/grp'},
                                                {'@id': '_:Nc3ba532f96f240cfafd0293f546701e3',
                                                 '@type': 'hdf5:Dataset',
                                                 'hdf5:datatype': 'H5T_FLOAT',
                                                 'hdf5:dimension': 2,
                                                 'hdf5:name': '/test_dataset',
                                                 'hdf5:size': 6,
                                                 'hdf5:value': '[[1.0, 2.0], '
                                                               '[3.0, 4.0], '
                                                               '[5.4, 1.9]]'},
                                                {'@id': '_:Nee51ec48994d4e6c957434ca9923d73a',
                                                 '@type': 'hdf5:Dataset',
                                                 'hdf5:datatype': 'H5T_FLOAT',
                                                 'hdf5:dimension': 0,
                                                 'hdf5:name': '/test_dataset 2',
                                                 'hdf5:size': 1,
                                                 'hdf5:value': 4.5}],
                                'hdf5:name': '/',
                                'schema:name': 'test attr'}}]}

Describing attribute meanings without RDF#

Sometimes, there is no IRI (yet) defined but the need to give an additional comment on the attribute. This can be done by as follows:

with h5tbx.File() as h5:
    grp = h5.create_group('contact')

    # Set an attribute as usual
    grp.attrs['type'] = 'Contact'

    # Update the attribute definition afterwards:
    grp.rdf['type'].definition = 'The role of the Person'

    # Alternatively, it can be assigned simultaneously via h5tbx.Attribute:
    grp.attrs['fname'] = h5tbx.Attribute(value='Matthias',
                                        definition='The first name of the contact')
    h5.dump(False)

    jdict = h5.dump_jsonld(h5.hdf_filename, indent=2)

/(1)
- contact(0)
  - fname
    The first name of the contact
    : Matthias
  - type
    The role of the Person
    : Contact

print(jdict)

{
  "@context": {
    "hdf5": "http://purl.allotrope.org/ontologies/hdf5/1.8#",
    "skos": "http://www.w3.org/2004/02/skos/core#"
  },
  "@graph": [
    {
      "@id": "_:N23153257b06049459e384cd5ec85d23a",
      "@type": "hdf5:File",
      "hdf5:rootGroup": {
        "@id": "_:Nbb2fa720b14b4eb1934f83615da0003c",
        "@type": "hdf5:Group",
        "hdf5:member": {
          "@id": "_:N06b39eac8e4f46cb89ab002c9ff7b816",
          "@type": "hdf5:Group",
          "hdf5:attribute": [
            {
              "@id": "_:Nf43866a6a34d48928333167defb9c093",
              "@type": "hdf5:Attribute",
              "hdf5:name": "fname",
              "hdf5:value": "Matthias",
              "skos:definition": "The first name of the contact"
            },
            {
              "@id": "_:N92a9383ef06d43929063f38c1959e5c5",
              "@type": "hdf5:Attribute",
              "hdf5:name": "type",
              "hdf5:value": "Contact",
              "skos:definition": "The role of the Person"
            }
          ],
          "hdf5:name": "/contact"
        },
        "hdf5:name": "/"
      }
    }
  ]
}

HDF5 and RDF: FAIR Attributes

Contents

HDF5 and RDF: FAIR Attributes#

Concept#

Describing an HDF5 file with persistent metadata#

Example part 1: A contact person#

Assigning metadata to the file rather than the rootgroup#

Example part 2: A random data dataset#

Example part 3: Assigning JSON-LD to describe data#