SHACL Validation for HDF Files

SHACL Validation for HDF Files#

The Shapes Constraint Language (SHACL) is a W3C standard for validating RDF data against a set of rules, called shapes. These shapes describe the expected structure and constraints of RDF graphs, such as required properties, allowed values, or data types.

SHACL is useful because it helps ensure that RDF data is consistent, complete, and conforms to a predefined schema or ontology. In workflows like extracting RDF from HDF files, SHACL allows you to automatically verify that the generated data matches the expected structure (e.g., required datasets, naming conventions, or value constraints).

By applying SHACL validation using the validate_hdf function from the h5RDMtoolbox, you can detect errors early, enforce data quality, and make your data more reliable for downstream processing and analysis.

The following example demonstrates this.

import h5rdmtoolbox as h5tbx

import rdflib
import pyshacl

Define a SHACL Shape#

The following SHACL shape validates all nodes of type hdf:Dataset. It checks their hdf:name property and requires that its value is either “/velocity” or “/data/velocity”. If a dataset has a different name, it fails validation.

shacl_shape = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<VelocityShape1>
    a sh:NodeShape ;
    sh:targetClass hdf:Dataset ;
    sh:property [
        sh:path hdf:name ;
        sh:or (
            [ sh:hasValue "/velocity" ]
            [ sh:hasValue "/data/velocity" ]
        ) ;
    ] ."""

Validate a HDF5 File#

Let’s create two test files first, one that will conform the shape and one that does not:

with h5tbx.File("valid.hdf", "w") as h5:
    h5.create_group("data")
    h5.create_dataset("velocity", data=1)
    h5.create_dataset("data/velocity", data=1)

with h5tbx.File("invalid.hdf", "w") as h5:
    h5.create_group("data")
    h5.create_dataset("pressure", data=1)

Now we can load the function and validate both files

from h5rdmtoolbox.ld.shacl import validate_hdf
res = validate_hdf(hdf_source="valid.hdf", shacl_data=shacl_shape)
res
ValidationResult(conforms=True, results_graph=<Graph identifier=N3695fd4c2b9f411ca49c63618519dfe9 (<class 'rdflib.graph.Graph'>)>, results_text='Validation Report\nConforms: True\n', messages=[], nodes=[])
res = validate_hdf(hdf_source="invalid.hdf", shacl_data=shacl_shape)
res
ValidationResult(conforms=False, results_graph=<Graph identifier=N5815c04cb05c497ea6e39103622cdd86 (<class 'rdflib.graph.Graph'>)>, results_text='Validation Report\nConforms: False\nResults (1):\nConstraint Violation in OrConstraintComponent (http://www.w3.org/ns/shacl#OrConstraintComponent):\n\tSeverity: sh:Violation\n\tSource Shape: [ sh:or ( [ sh:hasValue Literal("/velocity") ] [ sh:hasValue Literal("/data/velocity") ] ) ; sh:path hdf:name ]\n\tFocus Node: <https://example.org/hdf5file#invalid.hdf/pressure>\n\tValue Node: Literal("/pressure")\n\tResult Path: hdf:name\n\tMessage: Node Literal("/pressure") must conform to one or more shapes in [ sh:hasValue Literal("/velocity") ] , [ sh:hasValue Literal("/data/velocity") ]\n', messages=['Node Literal("/pressure") must conform to one or more shapes in [ sh:hasValue Literal("/velocity") ] , [ sh:hasValue Literal("/data/velocity") ]'], nodes=[rdflib.term.URIRef('https://example.org/hdf5file#invalid.hdf/pressure')])

Above we see that that File “invalid.hdf” does not conform the shape. The result also tells us what the issue is:

print(res.results_text)
Validation Report
Conforms: False
Results (1):
Constraint Violation in OrConstraintComponent (http://www.w3.org/ns/shacl#OrConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:or ( [ sh:hasValue Literal("/velocity") ] [ sh:hasValue Literal("/data/velocity") ] ) ; sh:path hdf:name ]
	Focus Node: <https://example.org/hdf5file#invalid.hdf/pressure>
	Value Node: Literal("/pressure")
	Result Path: hdf:name
	Message: Node Literal("/pressure") must conform to one or more shapes in [ sh:hasValue Literal("/velocity") ] , [ sh:hasValue Literal("/data/velocity") ]