Getting Started with Linked Data#

This notebook shows how to enrich HDF5 metadata with semantic identifiers and export it as RDF.

With h5rdmtoolbox, HDF5 objects and attributes can be interpreted as RDF triples:

  • the HDF5 object becomes the subject

  • the attribute name becomes the predicate

  • the attribute value becomes the object

This makes HDF5 metadata machine-interpretable and easier to reuse in FAIR data workflows.

Why linked data for HDF5?#

Standard HDF5 attributes are useful for humans, but often ambiguous for machines. For example, an attribute like "units": "degree_Celsius" is readable, but it does not by itself identify a shared concept from a controlled vocabulary.

By attaching IRIs to attribute names and values, we can make the meaning explicit. This allows HDF5 metadata to be exported as RDF (as serialization in e.h. JSON-LD or Turtle) and used in semantic web and FAIR workflows.

The ld module provides:

  • Conversion of HDF5 files to RDF graphs

  • Export to JSON-LD and Turtle formats

  • SHACL validation (see shacl_validation)

  • Semantic mapping of attributes to ontologies

import numpy as np
import rdflib
import h5rdmtoolbox as h5tbx
from pathlib import Path

Create an HDF5 file with semantic annotations#

In this example, we create a small HDF5 file containing temperature and pressure data. We annotate:

  • the dataset description using schema.org

  • the unit using Metadata4Ing and a QUDT unit IRI

  • the dataset type and data predicate for a derived quantity

M4I = rdflib.Namespace("http://w3id.org/nfdi4ing/metadata4ing#")

hdf_filename = "linked_data_example.h5"

with h5tbx.File(hdf_filename, "w") as h5:
    ds = h5.create_dataset("temperature", data=np.array([20.0, 21.0, 19.0, 22.0]))

    ds.attrs["units"] = h5tbx.Attribute(
        value="degree_Celsius",
        rdf_predicate=M4I.hasUnit,
        rdf_object="http://qudt.org/vocab/unit/DEG_C",
    )
    ds.attrs["description", "https://schema.org/description"] = "Room temperature measurements"

    ds_mean = h5.create_dataset("mean_temperature", data=np.mean(ds[()]))
    ds_mean.attrs["units", M4I.hasUnit] = "degree_Celsius"
    ds_mean.rdf["units"].object = "http://qudt.org/vocab/unit/DEG_C"
    ds_mean.attrs["description", "https://schema.org/description"] = "Mean room temperature"

    ds_mean.rdf.type = M4I.NumericalVariable
    ds_mean.rdf.data_predicate = M4I.hasNumericalValue

Inspect the file#

Before exporting RDF, it helps to inspect the HDF5 content and confirm that the semantic annotations were written as expected.

h5tbx.dump(hdf_filename, collapsed=False)

Export RDF as Turtle#

h5rdmtoolbox can serialize HDF5 metadata as RDF. A useful first format is Turtle, because it is compact and readable.

ttl = h5tbx.serialize(
    hdf_filename,
    format="ttl",
    structural=False,
    semantic=True,
    indent=2,
    context={
        "m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
        "schema": "https://schema.org/",
    },
)

print(ttl)
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a m4i:NumericalVariable ;
    m4i:hasNumericalValue "20.5"^^xsd:float ;
    m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
    schema:description "Mean room temperature" .

[] m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
    schema:description "Room temperature measurements" .
/home/docs/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/stable/lib/python3.10/site-packages/h5rdmtoolbox/wrapper/core.py:287: UserWarning: Not providing a file-uri is not good practice because it will generate blank nodes. Consider providing an URI such as the DOI URL for example.
  warnings.warn(

Export RDF as JSON-LD#

JSON-LD is often the easiest RDF serialization to exchange with web-based tools and downstream FAIR workflows.

jsonld = h5tbx.serialize(
    hdf_filename,
    format="json-ld",
    structural=False,
    semantic=True,
    indent=2,
    context={
        "m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
        "schema": "https://schema.org/",
    },
)

print(jsonld[:2000])  # print only the first part to keep the notebook compact
{
  "@context": {
    "m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "schema": "https://schema.org/"
  },
  "@graph": [
    {
      "@id": "_:linked_data_example.h5/mean_temperature",
      "@type": "m4i:NumericalVariable",
      "m4i:hasNumericalValue": {
        "@type": "http://www.w3.org/2001/XMLSchema#float",
        "@value": "20.5"
      },
      "m4i:hasUnit": {
        "@id": "http://qudt.org/vocab/unit/DEG_C"
      },
      "schema:description": "Mean room temperature"
    },
    {
      "@id": "_:linked_data_example.h5/temperature",
      "m4i:hasUnit": {
        "@id": "http://qudt.org/vocab/unit/DEG_C"
      },
      "schema:description": "Room temperature measurements"
    }
  ]
}

Parse the exported RDF with rdflib#

Once exported, the metadata can be loaded into a standard RDF library and queried.

from rdflib import Graph

g = Graph()
g.parse(data=jsonld, format="json-ld")

print(f"Number of triples: {len(g)}")
Number of triples: 6

Query the graph#

We can now ask semantic questions about the data, for example: which resources have a schema:description and what is the value?

We will use the built-in method sparql(...) (alternatively use h5tbx.sparql(...)), which will generate the RDF data and apply the query in the background. For better readability we provide a base file URI and request the return data type to be a pandas dataframe:

query = """
PREFIX schema: <https://schema.org/>

SELECT ?resource ?description
WHERE {
    ?resource schema:description ?description .
}
"""

res = h5tbx.sparql(hdf_filename, query, as_dataframe=True, file_uri="https://example.org#")
res
resource description
0 https://example.org#linked_data_example.h5/mea... Mean room temperature
1 https://example.org#linked_data_example.h5/tem... Room temperature measurements

Structural vs semantic RDF#

h5rdmtoolbox can export:

  • semantic/contextual RDF: only the semantics you explicitly added

  • structural RDF: information derived from the HDF5 structure

  • full RDF: a combination of both

ttl_semantic = h5tbx.serialize(hdf_filename, format="ttl", structural=False, semantic=True)
ttl_structural = h5tbx.serialize(hdf_filename, format="ttl", structural=True, semantic=False)
ttl_full = h5tbx.serialize(hdf_filename, format="ttl", structural=True, semantic=True)

print("semantic length:", len(ttl_semantic))
print("structural length:", len(ttl_structural))
print("full length:", len(ttl_full))
semantic length: 436
structural length: 2410
full length: 2410
/home/docs/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/stable/lib/python3.10/site-packages/h5rdmtoolbox/wrapper/core.py:287: UserWarning: Not providing a file-uri is not good practice because it will generate blank nodes. Consider providing an URI such as the DOI URL for example.
  warnings.warn(
print(ttl_semantic[:1000])
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a m4i:NumericalVariable ;
    m4i:hasNumericalValue "20.5"^^xsd:float ;
    m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
    schema:description "Mean room temperature" .

[] m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
    schema:description "Room temperature measurements" .
print(ttl_structural[:2000])
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:member [ a hdf:Dataset,
                        m4i:NumericalVariable ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Mean room temperature" ;
                            hdf:name "description" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "degree_Celsius" ;
                            hdf:name "units" ] ;
                    hdf:dataspace [ a hdf:ScalarDataspace ] ;
                    hdf:datatype hdf:H5T_FLOAT,
                        hdf:H5T_IEEE_F64LE ;
                    hdf:layout hdf:H5D_CONTIGUOUS ;
                    hdf:maximumSize -1 ;
                    hdf:name "/mean_temperature" ;
                    hdf:rank 0 ;
                    hdf:size 1 ;
                    hdf:value 2.05e+01 ;
                    m4i:hasNumericalValue "20.5"^^xsd:float ;
                    m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
                    schema:description "Mean room temperature" ],
                [ a hdf:Dataset ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Room temperature measurements" ;
                            hdf:name "description" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "degree_Celsius" ;
                            hdf:name "units" ] ;
                    hdf:dataspace [ a hdf:SimpleDataspace ;
                            hdf:dimension [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 0 ;
                                    hdf:size 4 ] ] ;
                    hdf:datatype hdf:H5T_FLOAT,
             
print(ttl_full[:2000])
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_IEEE_F64LE a hdf:Datatype .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:member [ a hdf:Dataset,
                        m4i:NumericalVariable ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Mean room temperature" ;
                            hdf:name "description" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "degree_Celsius" ;
                            hdf:name "units" ] ;
                    hdf:dataspace [ a hdf:ScalarDataspace ] ;
                    hdf:datatype hdf:H5T_FLOAT,
                        hdf:H5T_IEEE_F64LE ;
                    hdf:layout hdf:H5D_CONTIGUOUS ;
                    hdf:maximumSize -1 ;
                    hdf:name "/mean_temperature" ;
                    hdf:rank 0 ;
                    hdf:size 1 ;
                    hdf:value 2.05e+01 ;
                    m4i:hasNumericalValue "20.5"^^xsd:float ;
                    m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
                    schema:description "Mean room temperature" ],
                [ a hdf:Dataset ;
                    hdf:attribute [ a hdf:StringAttribute ;
                            hdf:data "Room temperature measurements" ;
                            hdf:name "description" ],
                        [ a hdf:StringAttribute ;
                            hdf:data "degree_Celsius" ;
                            hdf:name "units" ] ;
                    hdf:dataspace [ a hdf:SimpleDataspace ;
                            hdf:dimension [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 0 ;
                                    hdf:size 4 ] ] ;
                    hdf:datatype hdf:H5T_FLOAT,
             

Next step: Validate metadata with SHACL#

Because the metadata is available as RDF, it can be validated with SHACL. This is useful when you want to check whether required metadata is present and correctly typed.

A full SHACL example is covered in a separate notebook), but the key idea is that you can define RDF constraints and validate the exported graph against them.