This repository contains a prototype workflow for automating data validation in federated data spaces. The framework addresses the challenge of ensuring data quality and compliance across distributed data environments where multiple organizations share data products while maintaining autonomy over their data governance.
In federated data spaces, organizations need to validate data quality and policy compliance across heterogeneous data sources. Current approaches fall short with:
- Policy Semantics: Interpreting Open Vocabularies such as Open Digitals Rights Language (ODRL)
- Data Syntactics: Adapting to Multi-modal data types (CSV, DICOM, ML models)
- Transparency & Tracability: Need to describe what, how and who performed data validation to promote trust.
Our approach uses knowledge graphs as the foundation for automated data validation, enabling policy-driven validation across federated environments while preserving data sovereignty.
The approach is build upon a federated architectural framework and encompasses the following layers:
- Data Product Layer: Contains data products and their infrastructure extensions (e.g., connectors).
- Data Platform Layer}: Functions as a central gateway for data management processes, including data asset registration and analytical studies.
- Federated Computational Governance Layer: Establishes guidelines and artifacts for data product management and governance.
Central to the Federated Computational Governance Layer is the Semantic Data Model - a graph-based structure that captures system metadata and annotations needed to automate validation processes. The SDM serves as the backbone for policy interpretation and validation rule generation.
The validation workflow operates through two main phases:
- Planning Phase (PLANNER)
Processes data product metadata along with graph transformation rules Generates policy checkers that serve as data validation plans Ensures validation plans align with federated governance policies
- Execution Phase (EXECUTOR)
Translates policy checkers into executable User-Defined Functions (UDFs) Executes validation logic over data sources Generates validation reports and annotations
- Data consumer initiates an analytical service request
- Request is processed through the validation workflow
- PLANNER generates appropriate validation plans based on policies and data product metadata
- EXECUTOR translates plans into executable code
- Validation results are annotated with SDM and can be queried
- Policy-based data validationg
- Support for multi-modal data
- Tracability and Transparency
Folder | File | Description |
---|---|---|
FRAMEWORK | ||
FederatedComputationalGovernance | ComputationalCatalogues/p*.json | Federation defined Policies in JSON-LD |
FederatedComputationalGovernance | ComputationalCatalogues/common_data_models.json | Common Data Models in JSON-LD |
FederatedComputationalGovernance | FederatedTeam/tbox.ttl | Terminology Box for Semantic Data Model |
FederatedComputationalGovernance | SemanticDataModel/sdm.ttl | Semantic Data Model with all annotations |
DataProductLayer | DataProduct1 | CSV with Patient Demographics (Bakas et al. 2021) |
DataProductLayer | DataProduct2 | DICOM Image (Bakas et al. 2021) |
DataProductLayer | DataProduct3 | ML Model (manually trained from Data Product 1) |
DataPlatformLayer | Registration/profiler.ipynb | Notebook to automate boostraping of data sources |
DataPlatformLayer | Registration/integrator.ipynb | Notebook to generate mappings from DataProducts to CommonDataModels |
DataPlatformLayer | Registration/dpX.json | Data Products integrated with Data Contracts |
WORKFLOW | ||
Connector | parser/planner.ipynb | Planner implementation ls |
Connector | parser/rules | Graph Transformation Rules as SPARQL CONSTRUCT queries |
Connector | translator/executor.ipynb | Executor implementation |
Connector | translator/code_metadata.json | Code metadata in JSON-LD |
Connector | experiments/policy_times | Measure processing times |
Connector | experiments/transforming_bottleneck | Workflow vs RDFValidation |
Connector | experiments/ecosystem | Measure parsing scalability |
- Python 3.8+
- RDF processing libraries (rdflib, SPARQLWrapper)
- Jupyter Notebook environment
- Basic understanding of semantic web technologies
To reproduce the framework and workflow, you can follow these steps:
- Clone the repository
- Install required dependencies using
pip install -r requirements.txt
Running the Complete Workflow The demo.ipynb notebook in the demo folder provides a complete walkthrough of the validation workflow, including:
- Federation Metadata population (Registration of Data Products, Common Data Models and Policies)
- Policy Checker Generation (Planner)
- Dynamic Data Validation Code Generation (Executor)
- Example Validation Reports and Tracability Queries
Efficiency experiment scripts can be found in Connector/ValidationFramework/experiments contains three subfolders with scripts to reproduce the results
To extend the framework with new data validation use cases, you will need to provide:
- Add a new Graph transformation rule in the Connector/ValidationFramework/planner/rules folder as SPARQL CONSTRUCT queries
Example Graph Transformation Rule:
PREFIX odrl: <http://www.w3.org/ns/odrl/2/>
PREFIX tb: <http://www.semanticweb.org/acraf/ontologies/2024/healthmesh/tbox#>
PREFIX ab: <http://www.semanticweb.org/acraf/ontologies/2024/healthmesh/abox#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
<{op_uri}> a tb:Operation;
tb:hasAbstract ab:LoadData;
tb:hasOutput ab:data;
tb:hasInput ?param.
<{pc}> tb:nextStep <{op_uri}>.
}
WHERE {
<{dp}> tb:hasTA ?ta .
?ta tb:typeAcces ?access .
?access rdfs:label ?label ;
tb:path ?path .
FILTER(?label = ab:Static)
BIND(?path AS ?param)
}
- Add code metadata instance in Connector/ValidationFramework/executor/code_metadata.json
Example Code Metadata Instance in JSON-LD
{
"@id": "ab:Imp1",
"@type": "tb:Implementation",
"tb:hasCode": [
{
"@id": "ab:Imp1Code",
"@type": "tb:Code",
"tb:code": "pandas.read_csv(p)"
}
],
"tb:hasParameters": [
{
"@id": "ab:Imp1CodeParam1",
"@type": "tb:Parameter",
"tb:name": "p",
"tb:type": "str"
}
],
"tb:dependsOn": [
{
"@id": "ab:Imp1CodeDep1",
"tb:name": "pandas",
"@type": "tb:Library"
}
],
"tb:forOp": [
{
"@id": "ab:LoadData",
"@type": "tb:Operation"
}
],
"tb:forType": [
{
"@id": "ab:Tabular",
"@type": "tb:DatasetTypeTemplate"
}
],
"tb:returns": "pandas.DataFrame"
},
For best practices we recommend:
- Use standard vocabularies (DCAT, DQV, etc.) for better interoperability
- Follow semantic web principles for policy definitions
- Ensure validation logic is deterministic and reproducible
- Support for more expressive policy patterns and use cases
- Integration and deployement within existing Data Space components (e.g., Eclipse DataSpace Components)
- Optimizations and user-framework interaction interfaces