HDF5 and RDF: Toward FAIR Attributes#

HDF5 files are often described as self-describing, meaning they contain internal metadata about their structure, such as groups, datasets, datatypes, and attributes. Tools like h5py, HDFView, or h5dump can parse and display this structure without external documentation.

However, this self-description is:

  • Structural

  • Syntactic

  • Low-level

It tells you what is stored, but not what it means. For example, an attribute named "units" with the value "counts" says nothing about whether it refers to photon counts, electrical pulses, or normalized integers — and it’s unlikely to align with shared standards or ontologies.

To make HDF5 data understandable, interoperable, and reusable, especially by machines, semantic annotation is essential.


FAIR Principle F1: Use Globally Unique and Persistent Identifiers#

According to F1 of the FAIR Principles:

“Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset. In this context, identifiers consist of an internet link (e.g., a URL that resolves to a web page that defines the concept such as a particular human protein). Many data repositories will automatically generate globally unique and persistent identifiers to deposited datasets. Identifiers can help other people understand exactly what you mean, and they allow computers to interpret your data in a meaningful way (i.e., computers that are searching for your data or trying to automatically integrate them). Identifiers are essential to the human-machine interoperation that is key to the vision of Open Science. In addition, identifiers will help others to properly cite your work when reusing your data.”

In this context, each concept or attribute — such as a physical unit, method, instrument, or material — should be identified using a globally resolvable identifier, such as a URI or IRI.


Using h5rdmtoolbox for Semantic Annotation#

The h5rdmtoolbox provides functionality to link HDF5 attributes to identifiers, enabling FAIR-compliant metadata. For each attribute, both the name and the value can be associated with an IRI (Internationalized Resource Identifier), connecting the metadata to shared vocabularies, ontologies, or data catalogs.

The following section demonstrates how to annotate HDF5 attributes with semantic identifiers using h5rdmtoolbox.

Concept: Representing HDF5 Metadata as RDF Triples#

HDF5 metadata can be interpreted in terms of RDF triples, the foundational structure of the Semantic Web. An RDF triple consists of:

  • a subject – the thing being described (e.g., a group or dataset)

  • a predicate – the property or relationship (e.g., an attribute name)

  • an object – the value or target of the property (e.g., an attribute value)

So, each attribute in an HDF5 file can naturally be viewed as a semantic statement:

subject = HDF5 object (group or dataset)
predicate = attribute name
object = attribute value

This interpretation enables the transformation of binary HDF5 metadata into structured, queryable, and machine-interpretable knowledge.


From Human Understanding to Machine Interpretability#

Humans may be able to understand the contents of an HDF5 file based on its naming conventions or documentation. For example, a dataset of random data might include an attribute like "creator": "Alice", and we understand that “Alice” refers to a person.

However, machines cannot reliably interpret such informal metadata. To make meaning explicit and unambiguous, we must associate globally unique identifiers, such as URIs, with HDF5 components.

For example, the attribute "contact" is ambiguous: is it a person or an organization? A well-chosen URI can clarify this by linking to a concept from a standard vocabulary or ontology.


In the following section, we will semantically annotate an HDF5 file that contains a dataset of random values, along with metadata about its creator — identified using an ORCID iD. This demonstrates how to convert conventional metadata into machine-readable RDF.

import h5rdmtoolbox as h5tbx

Describing an HDF5 file with persistent metadata#

Example part 1: A contact person#

In this example, we create a HDF5 group, that contains all relevant contact data of the author of the file. The content if the group thus describes the contact person and therefore is a person. The group itself, gets the predicate has author and relates the HDF5 to the author:

with h5tbx.File() as h5:
    grp = h5.create_group('contact', attrs=dict(orcid='https://orcid.org/0000-0001-8729-0482'))   
    grp.rdf.predicate = 'https://schema.org/author'
    grp.rdf.type = 'http://xmlns.com/foaf/0.1/Person'  # what the content of group is, namely a foaf:Person
    grp.rdf.subject = 'https://orcid.org/0000-0001-8729-0482'  # corresponds to @ID in JSON-LD
    grp.rdf.predicate['orcid'] =  'http://w3id.org/nfdi4ing/metadata4ing#orcidId'
    grp.attrs['first_name', 'http://xmlns.com/foaf/0.1/firstName'] = 'Matthias'

    o = grp.rdf.predicate['orcid']
    
    h5.dump(collapsed=False)

hdf_filename = h5.hdf_filename

Using the rdf accessory, we can assign the objects (dataset, groups, attributes) with the internationalized resource identifier (IRI). An IRI is a web resource and points to the definition in an ontology, e.g. “contact” is a “Person” and is defined in the ontology FOAF: ‘http://xmlns.com/foaf/0.1/Person’. The person “has a researcher ID”. This predicate is described in the M4i (metadata4ing) ontology: ‘http://w3id.org/nfdi4ing/metadata4ing#orcid’

Assigning metadata to the file rather than the root group#

If we want to describe the file using attributes, the root group “/” is the way to go. However, there we might want to distinguish between the actual file and the root group. For this, we can also use the accessory frdf, which allows assigning RDF triples to the file.

In the following example, we add the creation date as a root group attribute but explain it as a file attribute rather than a group attribute.

Using the method serialize() the content is displayed as a Linked Data form:

from datetime import datetime

with h5tbx.File(mode='w') as h5:
    h5.frdf.subject = "https://example.org/myfile-id" # the ID of this file container
    
    h5.attrs["creation_date"] = datetime.today()
    h5.frdf["creation_date"].predicate = "http://purl.org/dc/terms/created"

print(h5tbx.serialize(h5.hdf_filename, format="ttl", structural=False, semantic=True, indent=2))
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://example.org/myfile-id> dcterms:created "20250815135353347986"^^xsd:string .

From now on, let’s same some work and use the package namespacelib, which simplifies the work with the namespaces, so that we don’t have to type the full IRI address. Some popular ones are implemented in the rdflib package, too:

from ontolutils.namespacelib import M4I, OBO, QUDT_UNIT, QUDT_KIND
from rdflib.namespace import FOAF

As a result, we can type the following:

M4I.orcidId  # equal to http://w3id.org/nfdi4ing/metadata4ing#orcidId
rdflib.term.URIRef('http://w3id.org/nfdi4ing/metadata4ing#orcidId')

Example part 2: A random data dataset#

Next, we add the random data dataset with units. We can even describe what type the data is. In our case it shall be velocity data. Without this specification it would otherwise not be clear to the user (or a machine):

import numpy as np

with h5tbx.File(hdf_filename, mode='r+') as h5:    
    ds = h5.create_dataset('grp/random_velocity', data=np.random.random(100))
    ds.attrs.create('units',
                    rdf_predicate=M4I.hasUnit,
                    data='m/s',
                    rdf_object=QUDT_UNIT.M_PER_SEC)
    ds.attrs.create('quantity_kind',
                     data='velocity',
                     rdf_predicate=M4I.hasKindOfQuantity,
                     rdf_object=QUDT_KIND.Velocity)

    h5.dump(collapsed=False)

Now, let’s go further and describe how the random dataset was created and that the contact was involved in it:

from datetime import datetime

with h5tbx.File(hdf_filename, mode='r+') as h5:  
    proc = h5.create_group('processing_info')
    proc.rdf.subject = M4I.ProcessingStep
    proc.attrs['has_participants', OBO.has_participant] = h5['contact']
    start_time = datetime.today()
    end_time = datetime.today()
    proc.attrs.create('start_time', data=start_time,
                      rdf_predicate='https://schema.org/startTime')
    proc.attrs.create('end_time', data=end_time,
                      rdf_predicate='https://schema.org/startTime')
    proc.attrs['output', 'http://purl.obolibrary.org/obo/RO_0002234'] = h5['grp/random_velocity'].name
h5tbx.dump(hdf_filename, collapsed=False)

Example part 3: Associating an entity#

Until now, we used IRIs to assign meaning to HDF5 attributes, e.g. proc.rdf.subject = M4I.ProcessingStep.

Sometimes, data cannot be expressed by a single IRI, because there is no globally unique identifier. We might describe the object by an entity with its properties.

In the example below, the attribute “standard_name” of the dataset “u” refers to “x_velocity” being the Standard name of the HDF5 dataset “u”. A Standard name has a name, description and SI unit and may be associated to a Standard Name Table in which it is listed. In our case, the Standard name “x_velocity” has no globally unique identifier, hence we need to describe it by a JOSN-LD string:

sn_xvel = """{
    "@context": {
        "ssno": "https://matthiasprobst.github.io/ssno#"
    },
    "@type": "ssno:StandardName",
    "ssno:standardName": "x_velocity",
    "ssno:unit": "http://qudt.org/vocab/unit/M-PER-SEC",
    "ssno:description": "X-component of a velocity vector."
}"""

Let’s assign this entity to the attribute “standard_name”. Note, that when dumping the data, the “LD”-icon appears (linked data):

with h5tbx.File() as h5:
    h5.create_dataset("u", data=[1,2,3], attrs={"standard_name": "x_velocity"})
    h5.u.rdf["standard_name"].predicate = "https://matthiasprobst.github.io/ssno#hasStandardName"
    # h5.u.rdf["standard_name"].object = sn_xvel
    h5.u.rdf["standard_name"].object = sn_xvel
    h5.dump(False)
    
    serialization = h5.serialize(fmt="ttl", structural=False)

The JSON-LD dump shows that “standard_name” is correctly associated with our JSON-LD string for the ssno:StandardName:

print(serialization)
@prefix ssno: <https://matthiasprobst.github.io/ssno#> .

[] ssno:hasStandardName [ a ssno:StandardName ;
            ssno:description "X-component of a velocity vector." ;
            ssno:standardName "x_velocity" ;
            ssno:unit "http://qudt.org/vocab/unit/M-PER-SEC" ] .

How to make use of the FAIR HDF5 file?#

There are three ways, how the above IRI assignments help us and how we might want to use the information:

  1. Visual inspection by dumping the content to screen: This will outline the file (meta) content and we can click on the attributes with IRIs, which will explain the attribute (data)

  2. We can extract a JSON-LD file. This is useful for other processes. We can also investigate this file further with tools like JSON-LD-playground.

  3. Access IRI in (Python) code

1. Visual inspection#

The dump() method will now add IRI-icons. Click on it and get redirected to the resources:

h5tbx.dump(hdf_filename, collapsed=False)

2. JSON-LD extraction#

Write the JSON-LD or Turtle (ttl) file and share it with others or a repository. The toolbox provides dump-methods through the jsonld module or - in the newer version of h5tbx - the serialize method, allowing to write various linked data formats.

It might look a bit overwelming, however dedicated scripts can perfectly work with it while humans still can read it (with a bit of practice and patience…):

print(
    h5tbx.serialize(
        hdf_filename,
        format="ttl",
        indent=2,
        context={'m4i': 'http://w3id.org/nfdi4ing/metadata4ing#',
                 'foaf': 'http://xmlns.com/foaf/0.1/',
                 'obo': 'http://purl.obolibrary.org/obo/'}
    )
)
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix ns1: <http://purl.obolibrary.org/obo/> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

m4i:ProcessingStep ns1:RO_0000057 "/contact"^^xsd:string ;
    ns1:RO_0002234 "/grp/random_velocity"^^xsd:string ;
    schema:startTime "2025-08-15T13:53:53.388441"^^xsd:string,
        "2025-08-15T13:53:53.388446"^^xsd:string .

<https://orcid.org/0000-0001-8729-0482> a foaf:Person ;
    m4i:orcidId "https://orcid.org/0000-0001-8729-0482"^^xsd:string ;
    foaf:firstName "Matthias"^^xsd:string .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:member [ a hdf:Group ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Matthias"^^xsd:string ;
                            hdf:name "first_name" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "https://orcid.org/0000-0001-8729-0482"^^xsd:string ;
                            hdf:name "orcid" ] ;
                    hdf:name "/contact"^^xsd:string ;
                    dcterms:relation <https://orcid.org/0000-0001-8729-0482> ],
                [ a hdf:Group ;
                    hdf:member [ a hdf:Dataset ;
                            hdf:attribute [ a hdf:StringAttribute ;
                                    hdf:data "velocity"^^xsd:string ;
                                    hdf:name "quantity_kind" ],
                                [ a hdf:StringAttribute ;
                                    hdf:data "m/s"^^xsd:string ;
                                    hdf:name "units" ] ;
                            hdf:dataspace [ a hdf:SimpleDataspace ;
                                    hdf:dimension [ a hdf:DataspaceDimension ;
                                            hdf:dimensionIndex 0 ;
                                            hdf:size 100 ] ] ;
                            hdf:datatype hdf:H5T_IEEE_F64LE,
                                "H5T_FLOAT" ;
                            hdf:layout hdf:H5D_CONTIGUOUS ;
                            hdf:maximumSize 100 ;
                            hdf:name "/grp/random_velocity" ;
                            hdf:rank 1 ;
                            hdf:size 100 ;
                            m4i:hasKindOfQuantity <http://qudt.org/vocab/quantitykind/Velocity> ;
                            m4i:hasUnit <http://qudt.org/vocab/unit/M-PER-SEC> ] ;
                    hdf:name "/grp"^^xsd:string ],
                [ a hdf:Group ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "2025-08-15T13:53:53.388446"^^xsd:string ;
                            hdf:name "end_time" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "/contact"^^xsd:string ;
                            hdf:name "has_participants" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "/grp/random_velocity"^^xsd:string ;
                            hdf:name "output" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "2025-08-15T13:53:53.388441"^^xsd:string ;
                            hdf:name "start_time" ] ;
                    hdf:name "/processing_info"^^xsd:string ;
                    dcterms:relation m4i:ProcessingStep ] ;
            hdf:name "/"^^xsd:string ] .

3. Access IRI in code#

You may want to access the IRI of an attribute with Python within the HDF5 file. E.g. while working with the file, you may ask “Hey, what is ‘contact’ exactly?” or “What does the attribute ‘orcid’ mean?”

with h5tbx.File(hdf_filename) as h5:
    person_iri = h5.contact.rdf.subject
    orcid_iri = h5.contact.rdf.predicate['orcid']

… Well “contact” is a “Person” defined by the FOAF ontology:

person_iri
'https://orcid.org/0000-0001-8729-0482'

… and “orcid” is a predicate defined by the metadata4ing ontology:

orcid_iri
'http://w3id.org/nfdi4ing/metadata4ing#orcidId'

3.1 Find data based on IRIs#

import rdflib.graph as g

graph = g.Graph()
graph.parse('hdf_meta.jsonld', format='json-ld')
<Graph identifier=Nc46e35ad2cdb44aa8c4f57ac8269bb20 (<class 'rdflib.graph.Graph'>)>

Note, that we need to provide the PREFIXES, if the json-ld data/file does not include the context.

res = graph.query("""
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX m4i: <http://w3id.org/nfdi4ing/metadata4ing#>

SELECT ?id ?orcid
WHERE {
    ?id a foaf:Person .
    ?id m4i:orcid ?orcid .
    }
""")
for r in res:
    print(r)

4. Examples:#

4.1 Read metadata from JSON and write to HDF5#

Suppose we want to store information about the used software to the HDF5 file. It exists as a JSON-LD file based on the codemeta ontolog. For this example, we use the h5rdmtoolbox codemeta.json file from the github repository:

from h5rdmtoolbox.utils import download_file
from pprint import pprint

Download the file:

codemeta_url = 'https://raw.githubusercontent.com/matthiasprobst/h5RDMtoolbox/main/codemeta.json'
dowloaded_filename = download_file(codemeta_url)
2025-08-15_13:53:53,709 WARNING  [utils.py:70] No hash given! This is recommended when downloading files from the web.

Read the data with ontolutils.dquery:

from ontolutils import dquery
data = dquery(subject='schema:SoftwareSourceCode',
              source=dowloaded_filename,
              context={"schema": "http://schema.org/"})
pprint(data[0])
{'@context': {'applicationCategory': 'http://schema.org/applicationCategory',
              'author': 'http://schema.org/author',
              'codeRepository': 'http://schema.org/codeRepository',
              'description': 'http://schema.org/description',
              'license': 'http://schema.org/license',
              'name': 'http://schema.org/name',
              'operatingSystem': 'http://schema.org/operatingSystem',
              'programmingLanguage': 'http://schema.org/programmingLanguage',
              'version': 'http://schema.org/version'},
 '@id': '_:Na73c96daa7f24d03bbc2201f4189bdda',
 '@type': 'http://schema.org/SoftwareSourceCode',
 'applicationCategory': 'file:///home/docs/.cache/h5rdmtoolbox/2.2.1/Engineering',
 'author': [{'@id': 'https://orcid.org/0000-0001-9560-500X',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'familyName': 'Pritz',
             'givenName': 'Balazs'},
            {'@id': 'https://orcid.org/0000-0002-4116-0065',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'familyName': 'Büttner',
             'givenName': 'Lucas'},
            {'@id': 'https://orcid.org/0000-0001-8729-0482',
             '@type': 'http://schema.org/Person',
             'affiliation': {'@id': 'https://ror.org/04t3en479',
                             '@type': 'http://schema.org/Organization',
                             'name': 'Karlsruhe Institute of Technology, '
                                     'Institute of Thermal Turbomachinery'},
             'email': 'matth.probst@gmail.com',
             'familyName': 'Probst',
             'givenName': 'Matthias'}],
 'codeRepository': 'git+https://github.com/matthiasprobst/h5RDMtoolbox.git',
 'description': 'Supporting a FAIR Research Data lifecycle using Python and '
                'HDF5.',
 'license': 'https://spdx.org/licenses/MIT',
 'name': 'h5RDMtoolbox',
 'operatingSystem': ['Linux', 'Windows', 'macOS'],
 'programmingLanguage': ['Python 3',
                         'Python 3.9',
                         'Python 3.10',
                         'Python 3.11',
                         'Python 3.12'],
 'version': '2.2.1'}

The data are written into the HDF5 file by using jsonld.to_hdf():

from h5rdmtoolbox.wrapper import jsonld
with h5tbx.File('test.hdf', 'w') as h5:
    jsonld.to_hdf(data=data[0],
                 grp=h5.create_group('software_code'))
    h5.dump(False)
import ontolutils
with h5tbx.File(mode='w') as h5:
    _ = h5.create_dataset('test_dataset', data=np.array([[1, 2], [3, 4], [5.4, 1.9]]))
    h5.create_dataset('grp/subgrp/vel', data=4)
    h5.attrs['name', ontolutils.SCHEMA.name] = 'test attr'

    ttl = h5tbx.serialize(h5.filename, structural=True, format="ttl")
print(ttl)
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

hdf:H5T_INTEL_I64 a hdf:Datatype .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:attribute [ a hdf:StringAttribute ;
                    hdf:data "test attr"^^xsd:string ;
                    hdf:name "name" ] ;
            hdf:member [ a hdf:Group ;
                    hdf:member [ a hdf:Group ;
                            hdf:member [ a hdf:Dataset ;
                                    hdf:dataspace [ a hdf:ScalarDataspace ] ;
                                    hdf:datatype hdf:H5T_INTEL_I64,
                                        "H5T_INTEGER" ;
                                    hdf:layout hdf:H5D_CONTIGUOUS ;
                                    hdf:maximumSize -1 ;
                                    hdf:name "/grp/subgrp/vel" ;
                                    hdf:rank 0 ;
                                    hdf:size 1 ;
                                    hdf:value "4" ] ;
                            hdf:name "/grp/subgrp"^^xsd:string ] ;
                    hdf:name "/grp"^^xsd:string ],
                [ a hdf:Dataset ;
                    hdf:dataspace [ a hdf:SimpleDataspace ;
                            hdf:dimension [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 0 ;
                                    hdf:size 3 ],
                                [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 1 ;
                                    hdf:size 2 ] ] ;
                    hdf:datatype hdf:H5T_IEEE_F64LE,
                        "H5T_FLOAT" ;
                    hdf:layout hdf:H5D_CONTIGUOUS ;
                    hdf:maximumSize 6 ;
                    hdf:name "/test_dataset" ;
                    hdf:rank 2 ;
                    hdf:size 6 ] ;
            hdf:name "/"^^xsd:string ;
            schema:name "test attr"^^xsd:string ] .

Describing attribute meanings without RDF#

Sometimes, there is no IRI (yet) defined but the need to give an additional comment on the attribute. This can be done by as follows:

with h5tbx.File() as h5:
    grp = h5.create_group('contact')

    # Set an attribute as usual
    grp.attrs['type'] = 'Contact'

    # Update the attribute definition afterwards:
    grp.rdf['type'].definition = 'The role of the Person'

    # Alternatively, it can be assigned simultaneously via h5tbx.Attribute:
    grp.attrs['fname'] = h5tbx.Attribute(value='Matthias',
                                        definition='The first name of the contact')
    h5.dump(False)

    jdict = h5.dump_jsonld(indent=2)
      • fname
        DThe first name of the contact
        : Matthias
      • type
        DThe role of the Person
        : Contact
print(jdict)
{
  "@context": {
    "hdf": "http://purl.allotrope.org/ontologies/hdf5/1.8#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  },
  "@graph": [
    {
      "@id": "_:tmp4.hdf",
      "@type": "hdf:File",
      "hdf:rootGroup": {
        "@id": "_:tmp4.hdf/"
      }
    },
    {
      "@id": "_:tmp4.hdf/",
      "@type": "hdf:Group",
      "hdf:member": {
        "@id": "_:tmp4.hdf/contact"
      },
      "hdf:name": "/"
    },
    {
      "@id": "_:tmp4.hdf/contact",
      "@type": "hdf:Group",
      "hdf:attribute": [
        {
          "@id": "_:tmp4.hdf/contact@type"
        },
        {
          "@id": "_:tmp4.hdf/contact@fname"
        }
      ],
      "hdf:name": "/contact"
    },
    {
      "@id": "_:tmp4.hdf/contact@type",
      "@type": "hdf:StringAttribute",
      "hdf:data": "Contact",
      "hdf:name": "type"
    },
    {
      "@id": "_:tmp4.hdf/contact@fname",
      "@type": "hdf:StringAttribute",
      "hdf:data": "Matthias",
      "hdf:name": "fname"
    }
  ]
}