SHACL Validation for HDF Files#

The Shapes Constraint Language (SHACL) is a W3C standard for validating RDF data against a set of rules, called shapes. These shapes describe the expected structure and constraints of RDF graphs, such as required properties, allowed values, or data types.

SHACL is useful because it helps ensure that RDF data is consistent, complete, and conforms to a predefined schema or ontology. In workflows like extracting RDF from HDF files, SHACL allows you to automatically verify that the generated data matches the expected structure (e.g., required datasets, naming conventions, or value constraints).

By applying SHACL validation using the validate_hdf function from the h5RDMtoolbox, you can detect errors early, enforce data quality, and make your data more reliable for downstream processing and analysis.

The following example demonstrates this.

import h5rdmtoolbox as h5tbx

import rdflib
import pyshacl

Define a SHACL Shape#

The following SHACL shape validates all nodes of type hdf:Dataset. It checks their hdf:name property and requires that its value is either “/velocity” or “/data/velocity”. If a dataset has a different name, it fails validation.

shacl_shape = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<VelocityShape1>
    a sh:NodeShape ;
    sh:targetClass hdf:Dataset ;
    sh:property [
        sh:path hdf:name ;
        sh:or (
            [ sh:hasValue "/velocity" ]
            [ sh:hasValue "/data/velocity" ]
        ) ;
    ] ."""

Validate a HDF5 File#

Let’s create two test files first, one that will conform the shape and one that does not:

with h5tbx.File("valid.hdf", "w") as h5:
    h5.create_group("data")
    h5.create_dataset("velocity", data=1)
    h5.create_dataset("data/velocity", data=1)

with h5tbx.File("invalid.hdf", "w") as h5:
    h5.create_group("data")
    h5.create_dataset("pressure", data=1)

Now we can load the function and validate both files

from h5rdmtoolbox.ld.shacl import validate_hdf
res = validate_hdf(hdf_source="valid.hdf", shacl_data=shacl_shape)
res
ValidationResult(conforms=True, results_graph=<Graph identifier=N30fa7858aecf4297b4d18ae5041468d9 (<class 'rdflib.graph.Graph'>)>, results_text='Validation Report\nConforms: True\n', messages=[], nodes=[])
res = validate_hdf(hdf_source="invalid.hdf", shacl_data=shacl_shape)
res
ValidationResult(conforms=False, results_graph=<Graph identifier=N2ff55a738e334c6da29207921f9115c5 (<class 'rdflib.graph.Graph'>)>, results_text='Validation Report\nConforms: False\nResults (1):\nConstraint Violation in OrConstraintComponent (http://www.w3.org/ns/shacl#OrConstraintComponent):\n\tSeverity: sh:Violation\n\tSource Shape: [ sh:or ( [ sh:hasValue Literal("/velocity") ] [ sh:hasValue Literal("/data/velocity") ] ) ; sh:path hdf:name ]\n\tFocus Node: <https://example.org/hdf5file#invalid.hdf/pressure>\n\tValue Node: Literal("/pressure")\n\tResult Path: hdf:name\n\tMessage: Node Literal("/pressure") must conform to one or more shapes in [ sh:hasValue Literal("/velocity") ] , [ sh:hasValue Literal("/data/velocity") ]\n', messages=['Node Literal("/pressure") must conform to one or more shapes in [ sh:hasValue Literal("/velocity") ] , [ sh:hasValue Literal("/data/velocity") ]'], nodes=[rdflib.term.URIRef('https://example.org/hdf5file#invalid.hdf/pressure')])

Above we see that that File “invalid.hdf” does not conform the shape. The result also tells us what the issue is:

print(res.results_text)
Validation Report
Conforms: False
Results (1):
Constraint Violation in OrConstraintComponent (http://www.w3.org/ns/shacl#OrConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:or ( [ sh:hasValue Literal("/velocity") ] [ sh:hasValue Literal("/data/velocity") ] ) ; sh:path hdf:name ]
	Focus Node: <https://example.org/hdf5file#invalid.hdf/pressure>
	Value Node: Literal("/pressure")
	Result Path: hdf:name
	Message: Node Literal("/pressure") must conform to one or more shapes in [ sh:hasValue Literal("/velocity") ] , [ sh:hasValue Literal("/data/velocity") ]

Application Profiles from NFDI4Ing AIMS#

NFDI4Ing provides AIMS as a service for creating application profiles. These profiles describe which metadata a resource should provide and how that metadata should be constrained. Since AIMS application profiles are expressed as SHACL shapes, they can be used directly with h5RDMtoolbox.

This fits naturally into a scientific HDF5 workflow: researchers store measurements in HDF5, annotate datasets with machine-readable RDF terms, and then validate the file against an application profile before sharing, cataloging, or publishing it. In practice, a project, lab, data steward, or community can define the metadata requirements in AIMS once, and h5RDMtoolbox can check local HDF5 files against those requirements. This catches missing metadata early, makes FAIR requirements explicit, and helps align local files with reusable community profiles.

The following compact example mirrors an AIMS-style profile for a m4i:NumericalVariable. It requires a numerical value, a QUDT unit, and a QUDT quantity kind.

application_profile = """
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix qudt: <http://qudt.org/schema/qudt/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<NumericalVariableProfile>
    a sh:NodeShape ;
    sh:targetClass m4i:NumericalVariable ;
    sh:property [
        sh:path m4i:hasNumericalValue ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:or (
            [ sh:datatype xsd:double ]
            [ sh:datatype xsd:float ]
            [ sh:datatype xsd:integer ]
        ) ;
        sh:message "A numerical variable needs exactly one integer, float, or double value." ;
    ] ;
    sh:property [
        sh:path m4i:hasUnit ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:class qudt:Unit ;
        sh:message "A numerical variable needs exactly one QUDT unit." ;
    ] ;
    sh:property [
        sh:path m4i:hasKindOfQuantity ;
        sh:minCount 1 ;
        sh:maxCount 1 ;
        sh:class qudt:QuantityKind ;
        sh:message "A numerical variable needs exactly one QUDT quantity kind." ;
    ] .
"""

The profile uses QUDT classes in sh:class constraints. For a self-contained example, we provide the small amount of ontology information needed by the validator locally through ont_graph. In a production workflow this graph can come from the same controlled vocabularies referenced by the application profile.

from rdflib import Graph
from ontolutils import M4I, QUDT_UNIT, QUDT_KIND

ont_graph = Graph().parse(
    data="""
@prefix qudt: <http://qudt.org/schema/qudt/> .
@prefix unit: <http://qudt.org/vocab/unit/> .
@prefix quantitykind: <http://qudt.org/vocab/quantitykind/> .

unit:M-PER-SEC a qudt:Unit .
quantitykind:Velocity a qudt:QuantityKind .
""",
    format="turtle",
)

Now the HDF5 dataset is annotated with the same semantics expected by the application profile. h5RDMtoolbox serializes those annotations to RDF and validates the resulting graph against the SHACL profile.

with h5tbx.File() as h5:
    ds = h5.create_dataset(
        "velocity",
        data=4.5,
        attrs={"units": "m/s", "quantity_kind": "velocity"},
    )
    ds.rdf.type = M4I.NumericalVariable
    ds.rdf.data_predicate = M4I.hasNumericalValue
    ds.rdf["units"].predicate = M4I.hasUnit
    ds.rdf["units"].object = QUDT_UNIT.M_PER_SEC
    ds.rdf["quantity_kind"].predicate = M4I.hasKindOfQuantity
    ds.rdf["quantity_kind"].object = QUDT_KIND.Velocity
    aims_valid_filename = h5.hdf_filename

res = h5tbx.validate_hdf(
    hdf_source=aims_valid_filename,
    shacl_data=application_profile,
    ont_graph=ont_graph,
    inference="rdfs",
)
res.conforms
True

If the required semantic metadata is missing, the same AIMS application profile reports the problem before the file is passed to downstream analysis, a data catalog, or a repository.

with h5tbx.File() as h5:
    ds = h5.create_dataset("velocity", data=4.5)
    ds.rdf.type = M4I.NumericalVariable
    ds.rdf.data_predicate = M4I.hasNumericalValue
    aims_invalid_filename = h5.hdf_filename

res = h5tbx.validate_hdf(
    hdf_source=aims_invalid_filename,
    shacl_data=application_profile,
    ont_graph=ont_graph,
    inference="rdfs",
)
res.messages
['A numerical variable needs exactly one QUDT unit.',
 'A numerical variable needs exactly one QUDT quantity kind.']