Getting Started with Linked Data#
This notebook shows how to enrich HDF5 metadata with semantic identifiers and export it as RDF.
With h5rdmtoolbox, HDF5 objects and attributes can be interpreted as RDF triples:
the HDF5 object becomes the subject
the attribute name becomes the predicate
the attribute value becomes the object
This makes HDF5 metadata machine-interpretable and easier to reuse in FAIR data workflows.
Why linked data for HDF5?#
Standard HDF5 attributes are useful for humans, but often ambiguous for machines.
For example, an attribute like "units": "degree_Celsius" is readable, but it does not
by itself identify a shared concept from a controlled vocabulary.
By attaching IRIs to attribute names and values, we can make the meaning explicit. This allows HDF5 metadata to be exported as RDF (as serialization in e.h. JSON-LD or Turtle) and used in semantic web and FAIR workflows.
The ld module provides:
Conversion of HDF5 files to RDF graphs
Export to JSON-LD and Turtle formats
SHACL validation (see shacl_validation)
Semantic mapping of attributes to ontologies
import numpy as np
import rdflib
import h5rdmtoolbox as h5tbx
from pathlib import Path
Create an HDF5 file with semantic annotations#
In this example, we create a small HDF5 file containing temperature and pressure data. We annotate:
the dataset description using
schema.orgthe unit using Metadata4Ing and a QUDT unit IRI
the dataset type and data predicate for a derived quantity
M4I = rdflib.Namespace("http://w3id.org/nfdi4ing/metadata4ing#")
hdf_filename = "linked_data_example.h5"
with h5tbx.File(hdf_filename, "w") as h5:
ds = h5.create_dataset("temperature", data=np.array([20.0, 21.0, 19.0, 22.0]))
ds.attrs["units"] = h5tbx.Attribute(
value="degree_Celsius",
rdf_predicate=M4I.hasUnit,
rdf_object="http://qudt.org/vocab/unit/DEG_C",
)
ds.attrs["description", "https://schema.org/description"] = "Room temperature measurements"
ds_mean = h5.create_dataset("mean_temperature", data=np.mean(ds[()]))
ds_mean.attrs["units", M4I.hasUnit] = "degree_Celsius"
ds_mean.rdf["units"].object = "http://qudt.org/vocab/unit/DEG_C"
ds_mean.attrs["description", "https://schema.org/description"] = "Mean room temperature"
ds_mean.rdf.type = M4I.NumericalVariable
ds_mean.rdf.data_predicate = M4I.hasNumericalValue
Inspect the file#
Before exporting RDF, it helps to inspect the HDF5 content and confirm that the semantic annotations were written as expected.
h5tbx.dump(hdf_filename, collapsed=False)
-
-
20.5 [degree_Celsius] [float64]
- description
https://schema.org/description: Mean room temperature - units
http://w3id.org/nfdi4ing/metadata4ing#hasUnit: degree_Celsius
http://qudt.org/vocab/unit/DEG_C
-
(4) [float64]
- description
https://schema.org/description: Room temperature measurements - units
http://w3id.org/nfdi4ing/metadata4ing#hasUnit: degree_Celsius
http://qudt.org/vocab/unit/DEG_C
- description
Export RDF as Turtle#
h5rdmtoolbox can serialize HDF5 metadata as RDF. A useful first format is Turtle,
because it is compact and readable.
ttl = h5tbx.serialize(
hdf_filename,
format="ttl",
structural=False,
semantic=True,
indent=2,
context={
"m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
"schema": "https://schema.org/",
},
)
print(ttl)
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[] a m4i:NumericalVariable ;
m4i:hasNumericalValue "20.5"^^xsd:float ;
m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
schema:description "Mean room temperature" .
[] m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
schema:description "Room temperature measurements" .
/home/docs/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/stable/lib/python3.10/site-packages/h5rdmtoolbox/wrapper/core.py:287: UserWarning: Not providing a file-uri is not good practice because it will generate blank nodes. Consider providing an URI such as the DOI URL for example.
warnings.warn(
Export RDF as JSON-LD#
JSON-LD is often the easiest RDF serialization to exchange with web-based tools and downstream FAIR workflows.
jsonld = h5tbx.serialize(
hdf_filename,
format="json-ld",
structural=False,
semantic=True,
indent=2,
context={
"m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
"schema": "https://schema.org/",
},
)
print(jsonld[:2000]) # print only the first part to keep the notebook compact
{
"@context": {
"m4i": "http://w3id.org/nfdi4ing/metadata4ing#",
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"schema": "https://schema.org/"
},
"@graph": [
{
"@id": "_:linked_data_example.h5/mean_temperature",
"@type": "m4i:NumericalVariable",
"m4i:hasNumericalValue": {
"@type": "http://www.w3.org/2001/XMLSchema#float",
"@value": "20.5"
},
"m4i:hasUnit": {
"@id": "http://qudt.org/vocab/unit/DEG_C"
},
"schema:description": "Mean room temperature"
},
{
"@id": "_:linked_data_example.h5/temperature",
"m4i:hasUnit": {
"@id": "http://qudt.org/vocab/unit/DEG_C"
},
"schema:description": "Room temperature measurements"
}
]
}
Parse the exported RDF with rdflib#
Once exported, the metadata can be loaded into a standard RDF library and queried.
from rdflib import Graph
g = Graph()
g.parse(data=jsonld, format="json-ld")
print(f"Number of triples: {len(g)}")
Number of triples: 6
Query the graph#
We can now ask semantic questions about the data, for example: which resources have
a schema:description and what is the value?
We will use the built-in method sparql(...) (alternatively use h5tbx.sparql(...)), which will generate the RDF data and apply the query in the background. For better readability we provide a base file URI and request the return data type to be a pandas dataframe:
query = """
PREFIX schema: <https://schema.org/>
SELECT ?resource ?description
WHERE {
?resource schema:description ?description .
}
"""
res = h5tbx.sparql(hdf_filename, query, as_dataframe=True, file_uri="https://example.org#")
res
| resource | description | |
|---|---|---|
| 0 | https://example.org#linked_data_example.h5/mea... | Mean room temperature |
| 1 | https://example.org#linked_data_example.h5/tem... | Room temperature measurements |
Structural vs semantic RDF#
h5rdmtoolbox can export:
semantic/contextual RDF: only the semantics you explicitly added
structural RDF: information derived from the HDF5 structure
full RDF: a combination of both
ttl_semantic = h5tbx.serialize(hdf_filename, format="ttl", structural=False, semantic=True)
ttl_structural = h5tbx.serialize(hdf_filename, format="ttl", structural=True, semantic=False)
ttl_full = h5tbx.serialize(hdf_filename, format="ttl", structural=True, semantic=True)
print("semantic length:", len(ttl_semantic))
print("structural length:", len(ttl_structural))
print("full length:", len(ttl_full))
semantic length: 436
structural length: 2410
full length: 2410
/home/docs/checkouts/readthedocs.org/user_builds/h5rdmtoolbox/envs/stable/lib/python3.10/site-packages/h5rdmtoolbox/wrapper/core.py:287: UserWarning: Not providing a file-uri is not good practice because it will generate blank nodes. Consider providing an URI such as the DOI URL for example.
warnings.warn(
print(ttl_semantic[:1000])
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[] a m4i:NumericalVariable ;
m4i:hasNumericalValue "20.5"^^xsd:float ;
m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
schema:description "Mean room temperature" .
[] m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
schema:description "Room temperature measurements" .
print(ttl_structural[:2000])
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
hdf:H5T_IEEE_F64LE a hdf:Datatype .
[] a hdf:File ;
hdf:rootGroup [ a hdf:Group ;
hdf:member [ a hdf:Dataset,
m4i:NumericalVariable ;
hdf:attribute [ a hdf:StringAttribute ;
hdf:data "Mean room temperature" ;
hdf:name "description" ],
[ a hdf:StringAttribute ;
hdf:data "degree_Celsius" ;
hdf:name "units" ] ;
hdf:dataspace [ a hdf:ScalarDataspace ] ;
hdf:datatype hdf:H5T_FLOAT,
hdf:H5T_IEEE_F64LE ;
hdf:layout hdf:H5D_CONTIGUOUS ;
hdf:maximumSize -1 ;
hdf:name "/mean_temperature" ;
hdf:rank 0 ;
hdf:size 1 ;
hdf:value 2.05e+01 ;
m4i:hasNumericalValue "20.5"^^xsd:float ;
m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
schema:description "Mean room temperature" ],
[ a hdf:Dataset ;
hdf:attribute [ a hdf:StringAttribute ;
hdf:data "Room temperature measurements" ;
hdf:name "description" ],
[ a hdf:StringAttribute ;
hdf:data "degree_Celsius" ;
hdf:name "units" ] ;
hdf:dataspace [ a hdf:SimpleDataspace ;
hdf:dimension [ a hdf:DataspaceDimension ;
hdf:dimensionIndex 0 ;
hdf:size 4 ] ] ;
hdf:datatype hdf:H5T_FLOAT,
print(ttl_full[:2000])
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
hdf:H5T_IEEE_F64LE a hdf:Datatype .
[] a hdf:File ;
hdf:rootGroup [ a hdf:Group ;
hdf:member [ a hdf:Dataset,
m4i:NumericalVariable ;
hdf:attribute [ a hdf:StringAttribute ;
hdf:data "Mean room temperature" ;
hdf:name "description" ],
[ a hdf:StringAttribute ;
hdf:data "degree_Celsius" ;
hdf:name "units" ] ;
hdf:dataspace [ a hdf:ScalarDataspace ] ;
hdf:datatype hdf:H5T_FLOAT,
hdf:H5T_IEEE_F64LE ;
hdf:layout hdf:H5D_CONTIGUOUS ;
hdf:maximumSize -1 ;
hdf:name "/mean_temperature" ;
hdf:rank 0 ;
hdf:size 1 ;
hdf:value 2.05e+01 ;
m4i:hasNumericalValue "20.5"^^xsd:float ;
m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
schema:description "Mean room temperature" ],
[ a hdf:Dataset ;
hdf:attribute [ a hdf:StringAttribute ;
hdf:data "Room temperature measurements" ;
hdf:name "description" ],
[ a hdf:StringAttribute ;
hdf:data "degree_Celsius" ;
hdf:name "units" ] ;
hdf:dataspace [ a hdf:SimpleDataspace ;
hdf:dimension [ a hdf:DataspaceDimension ;
hdf:dimensionIndex 0 ;
hdf:size 4 ] ] ;
hdf:datatype hdf:H5T_FLOAT,
Next step: Validate metadata with SHACL#
Because the metadata is available as RDF, it can be validated with SHACL. This is useful when you want to check whether required metadata is present and correctly typed.
A full SHACL example is covered in a separate notebook), but the key idea is that you can define RDF constraints and validate the exported graph against them.