HDF5 and RDF: Toward FAIR Attributes

HDF5 and RDF: Toward FAIR Attributes#

HDF5 files are often described as self-describing, meaning they contain internal metadata about their structure, such as groups, datasets, datatypes, and attributes. Tools like h5py, HDFView, or h5dump can parse and display this structure without external documentation.

However, this self-description is:

Structural
Syntactic
Low-level

It tells you what is stored, but not what it means. For example, an attribute named "units" with the value "counts" says nothing about whether it refers to photon counts, electrical pulses, or normalized integers — and it’s unlikely to align with shared standards or ontologies.

To make HDF5 data understandable, interoperable, and reusable, especially by machines, semantic annotation is essential.

FAIR Principle F1: Use Globally Unique and Persistent Identifiers#

According to F1 of the FAIR Principles:

“Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data.”

In this context, each concept or attribute — such as a physical unit, method, instrument, or material — should be identified using a globally resolvable identifier, such as a URI or IRI.

Using `h5rdmtoolbox` for Semantic Annotation#

The h5rdmtoolbox provides functionality to link HDF5 attributes to identifiers, enabling FAIR-compliant metadata. For each attribute, both the name and the value can be associated with an IRI (Internationalized Resource Identifier), connecting the metadata to shared vocabularies, ontologies, or data catalogs.

The following section demonstrates how to annotate HDF5 attributes with semantic identifiers using h5rdmtoolbox.

Concept: Representing HDF5 Metadata as RDF Triples#

HDF5 metadata can be interpreted in terms of RDF triples, the foundational structure of the Semantic Web. An RDF triple consists of:

a subject – the thing being described (e.g., a group or dataset)
a predicate – the property or relationship (e.g., an attribute name)
an object – the value or target of the property (e.g., an attribute value)

So, each attribute in an HDF5 file can naturally be viewed as a semantic statement:

subject = HDF5 object (group or dataset)
predicate = attribute name
object = attribute value

This interpretation enables the transformation of binary HDF5 metadata into structured, queryable, and machine-interpretable knowledge.

From Human Understanding to Machine Interpretability#

Humans may be able to understand the contents of an HDF5 file based on its naming conventions or documentation. For example, a dataset of random data might include an attribute like "creator": "Alice", and we understand that “Alice” refers to a person.

However, machines cannot reliably interpret such informal metadata. To make meaning explicit and unambiguous, we must associate globally unique identifiers, such as URIs, with HDF5 components.

For example, the attribute "contact" is ambiguous: is it a person or an organization? A well-chosen URI can clarify this by linking to a concept from a standard vocabulary or ontology.

In the following section, we will semantically annotate an HDF5 file that contains a dataset of random values, along with metadata about its creator — identified using an ORCID iD. This demonstrates how to convert conventional metadata into machine-readable RDF.

import h5rdmtoolbox as h5tbx

How to make use of the FAIR HDF5 file?#

There are three ways, how the above IRI assignments help us and how we might want to use the information:

Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)
We can extract a JSON-LD file. This is useful for other processes. We can also investigate this file further with tools like JSON-LD-playground.
Access IRI in (Python) code

1. Visual inspection#

The dump() method will now add IRI-icons. Click on it and get redirected to the resources:

h5tbx.dump(hdf_filename, collapsed=False)

/(3)
- contact @type: http://xmlns.com/foaf/0.1/Person @id: https://orcid.org/0000-0001-8729-0482 https://schema.org/author(0)
  - first_name http://xmlns.com/foaf/0.1/firstName: Matthias
  - orcid http://w3id.org/nfdi4ing/metadata4ing#orcidId: https://orcid.org/0000-0001-8729-0482
- grp(1)
- processing_info @id: http://w3id.org/nfdi4ing/metadata4ing#ProcessingStep(0)
  - end_time https://schema.org/startTime: 2025-08-15T13:53:53.388446
  - has_participants http://purl.obolibrary.org/obo/RO_0000057: /contact
  - output http://purl.obolibrary.org/obo/RO_0002234: /grp/random_velocity
  - start_time https://schema.org/startTime: 2025-08-15T13:53:53.388441

2. JSON-LD extraction#

Write the JSON-LD or Turtle (ttl) file and share it with others or a repository. The toolbox provides dump-methods through the jsonld module or - in the newer version of h5tbx - the serialize method, allowing to write various linked data formats.

It might look a bit overwelming, however dedicated scripts can perfectly work with it while humans still can read it (with a bit of practice and patience…):

print(
    h5tbx.serialize(
        hdf_filename,
        format="ttl",
        indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/',
                 'obo': 'http://purl.obolibrary.org/obo/'}
    )
)

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix ns1: <http://purl.obolibrary.org/obo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

m4i:ProcessingStep ns1:RO_0000057 "/contact"^^xsd:string ;
    ns1:RO_0002234 "/grp/random_velocity"^^xsd:string ;
    schema:startTime "2025-08-15T13:53:53.388441"^^xsd:string,
        "2025-08-15T13:53:53.388446"^^xsd:string .

<https://orcid.org/0000-0001-8729-0482> a foaf:Person ;
    m4i:orcidId "https://orcid.org/0000-0001-8729-0482"^^xsd:string ;
    foaf:firstName "Matthias"^^xsd:string .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:member [ a hdf:Group ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Matthias"^^xsd:string ;
                            hdf:name "first_name" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "https://orcid.org/0000-0001-8729-0482"^^xsd:string ;
                            hdf:name "orcid" ] ;
                    hdf:name "/contact"^^xsd:string ;
                    dcterms:relation <https://orcid.org/0000-0001-8729-0482> ],
                [ a hdf:Group ;
                    hdf:member [ a hdf:Dataset ;
                            hdf:attribute [ a hdf:StringAttribute ;
                                    hdf:data "velocity"^^xsd:string ;
                                    hdf:name "quantity_kind" ],
                                [ a hdf:StringAttribute ;
                                    hdf:data "m/s"^^xsd:string ;
                                    hdf:name "units" ] ;
                            hdf:dataspace [ a hdf:SimpleDataspace ;
                                    hdf:dimension [ a hdf:DataspaceDimension ;
                                            hdf:dimensionIndex 0 ;
                                            hdf:size 100 ] ] ;
                            hdf:datatype hdf:H5T_IEEE_F64LE,
                                "H5T_FLOAT" ;
                            hdf:layout hdf:H5D_CONTIGUOUS ;
                            hdf:maximumSize 100 ;
                            hdf:name "/grp/random_velocity" ;
                            hdf:rank 1 ;
                            hdf:size 100 ;
                            m4i:hasKindOfQuantity <http://qudt.org/vocab/quantitykind/Velocity> ;
                            m4i:hasUnit <http://qudt.org/vocab/unit/M-PER-SEC> ] ;
                    hdf:name "/grp"^^xsd:string ],
                [ a hdf:Group ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "2025-08-15T13:53:53.388446"^^xsd:string ;
                            hdf:name "end_time" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "/contact"^^xsd:string ;
                            hdf:name "has_participants" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "/grp/random_velocity"^^xsd:string ;
                            hdf:name "output" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "2025-08-15T13:53:53.388441"^^xsd:string ;
                            hdf:name "start_time" ] ;
                    hdf:name "/processing_info"^^xsd:string ;
                    dcterms:relation m4i:ProcessingStep ] ;
            hdf:name "/"^^xsd:string ] .

3. Access IRI in code#

You may want to access the IRI of an attribute with Python within the HDF5 file. E.g. while working with the file, you may ask “Hey, what is ‘contact’ exactly?” or “What does the attribute ‘orcid’ mean?”

with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.rdf.subject
    orcid_iri = h5.contact.rdf.predicate['orcid']

… Well “contact” is a “Person” defined by the FOAF ontology:

person_iri

'https://orcid.org/0000-0001-8729-0482'

… and “orcid” is a predicate defined by the metadata4ing ontology:

orcid_iri

'http://w3id.org/nfdi4ing/metadata4ing#orcidId'

3.1 Find data based on IRIs#

import rdflib.graph as g

graph = g.Graph()
graph.parse('hdf_meta.jsonld', format='json-ld')

<Graph identifier=Nc46e35ad2cdb44aa8c4f57ac8269bb20 (<class 'rdflib.graph.Graph'>)>

Note, that we need to provide the PREFIXES, if the json-ld data/file does not include the context.

res = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX m4i: <http://w3id.org/nfdi4ing/metadata4ing#>

SELECT ?id ?orcid
WHERE {
    ?id a foaf:Person .
    ?id m4i:orcid ?orcid .
    }
""")

for r in res:
    print(r)

4. Examples:#

4.1 Read metadata from JSON and write to HDF5#

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the h5rdmtoolbox codemeta.json file from the github repository:

from h5rdmtoolbox.utils import download_file
from pprint import pprint

Download the file:

codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'
dowloaded_filename = download_file(codemeta_url)

2025-08-15_13:53:53,709 WARNING  [utils.py:70] No hash given! This is recommended when downloading files from the web.

Read the data with ontolutils.dquery:

from ontolutils import dquery

data = dquery(subject='schema:SoftwareSourceCode',
              source=dowloaded_filename,
              context={"schema": "http://schema.org/"})
pprint(data[0])

{'@context': {'applicationCategory': 'http://schema.org/applicationCategory',
              'author': 'http://schema.org/author',
              'codeRepository': 'http://schema.org/codeRepository',
              'description': 'http://schema.org/description',
              'license': 'http://schema.org/license',
              'name': 'http://schema.org/name',
              'operatingSystem': 'http://schema.org/operatingSystem',
              'programmingLanguage': 'http://schema.org/programmingLanguage',
              'version': 'http://schema.org/version'},
 '@id': '_:Na73c96daa7f24d03bbc2201f4189bdda',
 '@type': 'http://schema.org/SoftwareSourceCode',
 'applicationCategory': 'file:///home/docs/.cache/h5rdmtoolbox/2.2.1/Engineering',
 'author': [{'@id': 'https://orcid.org/0000-0001-9560-500X',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'familyName': 'Pritz',
             'givenName': 'Balazs'},
            {'@id': 'https://orcid.org/0000-0002-4116-0065',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'familyName': 'Büttner',
             'givenName': 'Lucas'},
            {'@id': 'https://orcid.org/0000-0001-8729-0482',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'email': 'matth.probst@gmail.com',
             'familyName': 'Probst',
             'givenName': 'Matthias'}],
 'codeRepository': 'git+https://github.com/matthiasprobst/h5RDMtoolbox.git',
 'description': 'Supporting a FAIR Research Data lifecycle using Python and '
                'HDF5.',
 'license': 'https://spdx.org/licenses/MIT',
 'name': 'h5RDMtoolbox',
 'operatingSystem': ['Linux', 'Windows', 'macOS'],
 'programmingLanguage': ['Python 3',
                         'Python 3.9',
                         'Python 3.10',
                         'Python 3.11',
                         'Python 3.12'],
 'version': '2.2.1'}

The data are written into the HDF5 file by using jsonld.to_hdf():

from h5rdmtoolbox.wrapper import jsonld

with h5tbx.File('test.hdf', 'w') as h5:
    jsonld.to_hdf(data=data[0],
                 grp=h5.create_group('software_code'))
    h5.dump(False)

/(1)
- software_code @type: http://schema.org/SoftwareSourceCode(3)
  - applicationCategory http://schema.org/applicationCategory: ///home/docs/.cache/h5rdmtoolbox/2.2.1/Engineering
  - codeRepository http://schema.org/codeRepository: //github.com/matthiasprobst/h5RDMtoolbox.git
  - description http://schema.org/description: Supporting a FAIR Research Data lifecycle using Python and HDF5.
  - license http://schema.org/license: https://spdx.org/licenses/MIT
  - name http://schema.org/name: h5RDMtoolbox
  - operatingSystem http://schema.org/operatingSystem: Linux, Windows, macOS
  - programmingLanguage http://schema.org/programmingLanguage: Python 3, Python 3.9, Python 3.10, Python 3.11, Python 3.12
  - version http://schema.org/version: 2.2.1
  - author1 @type: http://schema.org/Person @id: https://orcid.org/0000-0001-9560-500X http://schema.org/author(1)
    - familyName: Pritz
    - givenName: Balazs
    - affiliation @type: http://schema.org/Organization @id: https://ror.org/04t3en479(0)
      - name http://schema.org/name: Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery
  - author2 @type: http://schema.org/Person @id: https://orcid.org/0000-0002-4116-0065 http://schema.org/author(1)
    - familyName: Büttner
    - givenName: Lucas
    - affiliation @type: http://schema.org/Organization @id: https://ror.org/04t3en479(0)
      - name http://schema.org/name: Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery
  - author3 @type: http://schema.org/Person @id: https://orcid.org/0000-0001-8729-0482 http://schema.org/author(1)
    - email: matth.probst@gmail.com
    - familyName: Probst
    - givenName: Matthias
    - affiliation @type: http://schema.org/Organization @id: https://ror.org/04t3en479(0)
      - name http://schema.org/name: Karlsruhe Institute of Technology, Institute of Thermal Turbomachinery

import ontolutils

with h5tbx.File(mode='w') as h5:
    _ = h5.create_dataset('test_dataset', data=np.array([[1, 2], [3, 4], [5.4, 1.9]]))
    h5.create_dataset('grp/subgrp/vel', data=4)
    h5.attrs['name', ontolutils.SCHEMA.name] = 'test attr'

    ttl = h5tbx.serialize(h5.filename, structural=True, format="ttl")
print(ttl)

@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

hdf:H5T_INTEL_I64 a hdf:Datatype .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:attribute [ a hdf:StringAttribute ;
                    hdf:data "test attr"^^xsd:string ;
                    hdf:name "name" ] ;
            hdf:member [ a hdf:Group ;
                    hdf:member [ a hdf:Group ;
                            hdf:member [ a hdf:Dataset ;
                                    hdf:dataspace [ a hdf:ScalarDataspace ] ;
                                    hdf:datatype hdf:H5T_INTEL_I64,
                                        "H5T_INTEGER" ;
                                    hdf:layout hdf:H5D_CONTIGUOUS ;
                                    hdf:maximumSize -1 ;
                                    hdf:name "/grp/subgrp/vel" ;
                                    hdf:rank 0 ;
                                    hdf:size 1 ;
                                    hdf:value "4" ] ;
                            hdf:name "/grp/subgrp"^^xsd:string ] ;
                    hdf:name "/grp"^^xsd:string ],
                [ a hdf:Dataset ;
                    hdf:dataspace [ a hdf:SimpleDataspace ;
                            hdf:dimension [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 0 ;
                                    hdf:size 3 ],
                                [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 1 ;
                                    hdf:size 2 ] ] ;
                    hdf:datatype hdf:H5T_IEEE_F64LE,
                        "H5T_FLOAT" ;
                    hdf:layout hdf:H5D_CONTIGUOUS ;
                    hdf:maximumSize 6 ;
                    hdf:name "/test_dataset" ;
                    hdf:rank 2 ;
                    hdf:size 6 ] ;
            hdf:name "/"^^xsd:string ;
            schema:name "test attr"^^xsd:string ] .

Describing attribute meanings without RDF#

Sometimes, there is no IRI (yet) defined but the need to give an additional comment on the attribute. This can be done by as follows:

with h5tbx.File() as h5:
    grp = h5.create_group('contact')

    # Set an attribute as usual
    grp.attrs['type'] = 'Contact'

    # Update the attribute definition afterwards:
    grp.rdf['type'].definition = 'The role of the Person'

    # Alternatively, it can be assigned simultaneously via h5tbx.Attribute:
    grp.attrs['fname'] = h5tbx.Attribute(value='Matthias',
                                        definition='The first name of the contact')
    h5.dump(False)

    jdict = h5.dump_jsonld(indent=2)

/(1)
- contact(0)
  - fname
    The first name of the contact
    : Matthias
  - type
    The role of the Person
    : Contact

print(jdict)

{
  "@context": {
    "hdf": "http://purl.allotrope.org/ontologies/hdf5/1.8#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  },
  "@graph": [
    {
      "@id": "_:tmp4.hdf",
      "@type": "hdf:File",
      "hdf:rootGroup": {
        "@id": "_:tmp4.hdf/"
      }
    },
    {
      "@id": "_:tmp4.hdf/",
      "@type": "hdf:Group",
      "hdf:member": {
        "@id": "_:tmp4.hdf/contact"
      },
      "hdf:name": "/"
    },
    {
      "@id": "_:tmp4.hdf/contact",
      "@type": "hdf:Group",
      "hdf:attribute": [
        {
          "@id": "_:tmp4.hdf/contact@type"
        },
        {
          "@id": "_:tmp4.hdf/contact@fname"
        }
      ],
      "hdf:name": "/contact"
    },
    {
      "@id": "_:tmp4.hdf/contact@type",
      "@type": "hdf:StringAttribute",
      "hdf:data": "Contact",
      "hdf:name": "type"
    },
    {
      "@id": "_:tmp4.hdf/contact@fname",
      "@type": "hdf:StringAttribute",
      "hdf:data": "Matthias",
      "hdf:name": "fname"
    }
  ]
}

HDF5 and RDF: Toward FAIR Attributes

Contents

HDF5 and RDF: Toward FAIR Attributes#

FAIR Principle F1: Use Globally Unique and Persistent Identifiers#

Using `h5rdmtoolbox` for Semantic Annotation#

Concept: Representing HDF5 Metadata as RDF Triples#

From Human Understanding to Machine Interpretability#

Describing an HDF5 file with persistent metadata#

Example part 1: A contact person#

Assigning metadata to the file rather than the root group#

Example part 2: A random data dataset#

Example part 3: Associating an entity#

How to make use of the FAIR HDF5 file?#

1. Visual inspection#

2. JSON-LD extraction#

3. Access IRI in code#

3.1 Find data based on IRIs#

4. Examples:#

4.1 Read metadata from JSON and write to HDF5#

Describing attribute meanings without RDF#

HDF5 and RDF: Toward FAIR Attributes

Contents

HDF5 and RDF: Toward FAIR Attributes#

FAIR Principle F1: Use Globally Unique and Persistent Identifiers#

Using h5rdmtoolbox for Semantic Annotation#

Concept: Representing HDF5 Metadata as RDF Triples#

From Human Understanding to Machine Interpretability#

Describing an HDF5 file with persistent metadata#

Example part 1: A contact person#

Assigning metadata to the file rather than the root group#

Example part 2: A random data dataset#

Example part 3: Associating an entity#

How to make use of the FAIR HDF5 file?#

1. Visual inspection#

2. JSON-LD extraction#

3. Access IRI in code#

3.1 Find data based on IRIs#

4. Examples:#

4.1 Read metadata from JSON and write to HDF5#

Describing attribute meanings without RDF#

Using `h5rdmtoolbox` for Semantic Annotation#