Data Validation Worfklow

This repository contains a prototype workflow for automating data validation in federated data spaces. The framework addresses the challenge of ensuring data quality and compliance across distributed data environments where multiple organizations share data products while maintaining autonomy over their data governance.

Problem Statement

In federated data spaces, organizations need to validate data quality and policy compliance across heterogeneous data sources. Current approaches fall short with:

Policy Semantics: Interpreting Open Vocabularies such as Open Digitals Rights Language (ODRL)
Data Syntactics: Adapting to Multi-modal data types (CSV, DICOM, ML models)
Transparency & Tracability: Need to describe what, how and who performed data validation to promote trust.

Architectural Framework Overview

Our approach uses knowledge graphs as the foundation for automated data validation, enabling policy-driven validation across federated environments while preserving data sovereignty.

The approach is build upon a federated architectural framework and encompasses the following layers:

Data Product Layer: Contains data products and their infrastructure extensions (e.g., connectors).
Data Platform Layer}: Functions as a central gateway for data management processes, including data asset registration and analytical studies.
Federated Computational Governance Layer: Establishes guidelines and artifacts for data product management and governance.

Semantic Data Model (SDM)

Central to the Federated Computational Governance Layer is the Semantic Data Model - a graph-based structure that captures system metadata and annotations needed to automate validation processes. The SDM serves as the backbone for policy interpretation and validation rule generation.

Data Validation Workflow

The validation workflow operates through two main phases:

Planning Phase (PLANNER)

Processes data product metadata along with graph transformation rules Generates policy checkers that serve as data validation plans Ensures validation plans align with federated governance policies

Execution Phase (EXECUTOR)

Translates policy checkers into executable User-Defined Functions (UDFs) Executes validation logic over data sources Generates validation reports and annotations

Workflow Steps

Data consumer initiates an analytical service request
Request is processed through the validation workflow
PLANNER generates appropriate validation plans based on policies and data product metadata
EXECUTOR translates plans into executable code
Validation results are annotated with SDM and can be queried

Key Features

Policy-based data validationg
Support for multi-modal data
Tracability and Transparency

Prototype Folders and Files

Folder	File	Description
FRAMEWORK
FederatedComputationalGovernance	ComputationalCatalogues/p*.json	Federation defined Policies in JSON-LD
FederatedComputationalGovernance	ComputationalCatalogues/common_data_models.json	Common Data Models in JSON-LD
FederatedComputationalGovernance	FederatedTeam/tbox.ttl	Terminology Box for Semantic Data Model
FederatedComputationalGovernance	SemanticDataModel/sdm.ttl	Semantic Data Model with all annotations
DataProductLayer	DataProduct1	CSV with Patient Demographics (Bakas et al. 2021)
DataProductLayer	DataProduct2	DICOM Image (Bakas et al. 2021)
DataProductLayer	DataProduct3	ML Model (manually trained from Data Product 1)
DataPlatformLayer	Registration/profiler.ipynb	Notebook to automate boostraping of data sources
DataPlatformLayer	Registration/integrator.ipynb	Notebook to generate mappings from DataProducts to CommonDataModels
DataPlatformLayer	Registration/dpX.json	Data Products integrated with Data Contracts
WORKFLOW
Connector	parser/planner.ipynb	Planner implementation ls
Connector	parser/rules	Graph Transformation Rules as SPARQL CONSTRUCT queries
Connector	translator/executor.ipynb	Executor implementation
Connector	translator/code_metadata.json	Code metadata in JSON-LD
Connector	experiments/policy_times	Measure processing times
Connector	experiments/transforming_bottleneck	Workflow vs RDFValidation
Connector	experiments/ecosystem	Measure parsing scalability

Quick-Start Instructions

Prerequisites

Python 3.8+
RDF processing libraries (rdflib, SPARQLWrapper)
Jupyter Notebook environment
Basic understanding of semantic web technologies

To reproduce the framework and workflow, you can follow these steps:

Clone the repository
Install required dependencies using pip install -r requirements.txt

Reproduce workflow

Running the Complete Workflow The demo.ipynb notebook in the demo folder provides a complete walkthrough of the validation workflow, including:

Federation Metadata population (Registration of Data Products, Common Data Models and Policies)
Policy Checker Generation (Planner)
Dynamic Data Validation Code Generation (Executor)
Example Validation Reports and Tracability Queries

Reproduce (TP6 - Efficiency) Experiments

Efficiency experiment scripts can be found in Connector/ValidationFramework/experiments contains three subfolders with scripts to reproduce the results

Extensions

To extend the framework with new data validation use cases, you will need to provide:

Add a new Graph transformation rule in the Connector/ValidationFramework/planner/rules folder as SPARQL CONSTRUCT queries

Example Graph Transformation Rule:

PREFIX odrl: <http://www.w3.org/ns/odrl/2/>
PREFIX tb: <http://www.semanticweb.org/acraf/ontologies/2024/healthmesh/tbox#>
PREFIX ab: <http://www.semanticweb.org/acraf/ontologies/2024/healthmesh/abox#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT {
    <{op_uri}> a tb:Operation;
        tb:hasAbstract ab:LoadData;
        tb:hasOutput ab:data;
        tb:hasInput ?param.
    <{pc}> tb:nextStep <{op_uri}>.
}
WHERE {
    <{dp}> tb:hasTA ?ta .
    ?ta tb:typeAcces ?access .
    ?access rdfs:label ?label ;
    tb:path ?path .
    FILTER(?label = ab:Static)
    BIND(?path AS ?param)
}

Add code metadata instance in Connector/ValidationFramework/executor/code_metadata.json

Example Code Metadata Instance in JSON-LD

   {
      "@id": "ab:Imp1",
      "@type": "tb:Implementation",
      "tb:hasCode": [
        {
          "@id": "ab:Imp1Code",
          "@type": "tb:Code",
          "tb:code": "pandas.read_csv(p)"
        }
      ],
      "tb:hasParameters": [
          {
            "@id": "ab:Imp1CodeParam1",
            "@type": "tb:Parameter",
            "tb:name": "p",
            "tb:type": "str"
          }
      ],
        "tb:dependsOn": [
          {
            "@id": "ab:Imp1CodeDep1",
            "tb:name": "pandas",
            "@type": "tb:Library"
          }
        ],
        "tb:forOp": [
          {
            "@id": "ab:LoadData",
            "@type": "tb:Operation"
          }
        ],
        "tb:forType": [
          {
            "@id": "ab:Tabular",
            "@type": "tb:DatasetTypeTemplate"
          }
        ],
      "tb:returns": "pandas.DataFrame"
    },

For best practices we recommend:

Use standard vocabularies (DCAT, DQV, etc.) for better interoperability
Follow semantic web principles for policy definitions
Ensure validation logic is deterministic and reproducible

Future Work

Support for more expressive policy patterns and use cases
Integration and deployement within existing Data Space components (e.g., Eclipse DataSpace Components)
Optimizations and user-framework interaction interfaces

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Validation Worfklow

Problem Statement

Architectural Framework Overview

Semantic Data Model (SDM)

Data Validation Workflow

Workflow Steps

Key Features

Prototype Folders and Files

Quick-Start Instructions

Prerequisites

Reproduce workflow

Reproduce (TP6 - Efficiency) Experiments

Extensions

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Connector		Connector
DataPlatformLayer		DataPlatformLayer
DataProductLayer		DataProductLayer
FederatedComputationalGovernance		FederatedComputationalGovernance
demo		demo
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

HealthMesh/Fdatavalidation

Folders and files

Latest commit

History

Repository files navigation

Data Validation Worfklow

Problem Statement

Architectural Framework Overview

Semantic Data Model (SDM)

Data Validation Workflow

Workflow Steps

Key Features

Prototype Folders and Files

Quick-Start Instructions

Prerequisites

Reproduce workflow

Reproduce (TP6 - Efficiency) Experiments

Extensions

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages