Market research on versioning tools

Original issue: https://github.com/kedro-org/kedro/issues/3933

Data versioning (Miro Board)

Why "data versioning" is important?

Data versioning is the practice of tracking and managing changes to datasets over time. This includes capturing versions of data as it evolves, enabling reproducibility, rollback capabilities, and auditability. Data versioning is crucial for maintaining data integrity and ensuring that data pipelines and machine learning models are reproducible and reliable.

Feature Comparison Matrix

1. Delta Lake

Click here to see Data Lake's versioning workflow

Delta Lake, by Databricks, is an open-source storage that enables building a Lakehouse architecture on top of data lakes. It is designed to provide ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and unified streaming and batch data processing. Delta Lake is built on top of Apache Spark and enhances the capabilities of data lakes by addressing common challenges like data reliability and consistency.

Strengths

ACID Transactions: Delta Lake provides strong consistency guarantees through ACID transactions, ensuring data integrity and reliability.
Unified Batch and Streaming Processing: Delta Lake supports both batch and streaming data processing in a unified manner.
Time Travel: Delta Lake's time travel feature allows users to query historical versions of data.
Schema Enforcement and Evolution: Delta Lake enforces schemas at write time and supports schema evolution, allowing changes to the schema without breaking existing queries.
Scalability and Performance: Delta Lake optimizes storage and querying through techniques like data compaction and Z-Ordering.
Integration with Spark: Built on top of Apache Spark, Delta Lake integrates seamlessly with the Spark ecosystem, enabling powerful data processing capabilities.
Rich Ecosystem and Enterprise Support backed by Databricks

Weaknesses

Limited Direct Support for Unstructured Data: Delta Lake is primarily designed for structured and semi-structured data.
Complexity in Setup and Management: Setting up and managing Delta Lake can be complex, particularly for teams not familiar with Spark.
Tight Coupling with Apache Spark: Delta Lake is heavily dependent on Apache Spark for its operations.

2. DVC

Click here to see DVC's versioning workflow

DVC, or Data Version Control, is an open-source tool specifically designed for data science and machine learning projects. It combines the version control power of Git with functionalities tailored for large datasets, allowing users to track data changes, collaborate efficiently, and ensure project reproducibility by referencing specific data versions. Imagine DVC as a special organizer for your data science projects. Just like how Git keeps track of changes you make to your code, DVC keeps track of changes you make to your data. DVC is your “Git for data”!

Strengths

Integration with Git: DVC seamlessly integrates with Git, leveraging familiar version control workflows for managing datasets and models. This integration makes it easy for teams already using Git to adopt DVC without significant changes to their workflow.
Efficient Large File Management: DVC efficiently handles large files by storing them in remote storage backends and only keeping metadata in the Git repository. This avoids bloating the Git repository and ensures efficient data management.
Reproducibility: DVC's pipeline management and experiment tracking features ensure that data workflows are reproducible. Users can recreate specific experiment runs by tracking versions of data, models, and code.
Flexible Remote Storage: DVC supports various remote storage options, including AWS S3, Google Cloud Storage, Azure Blob Storage, and more. This flexibility allows users to choose storage solutions that best fit their needs.
Experiment Management: DVC's experiment management capabilities, including checkpointing and comparing experiment runs, provide a robust framework for tracking and optimizing machine learning experiments.
Open Source and Community Support: DVC is open source, with an active community contributing to its development and providing support. This ensures continuous improvement and a wealth of shared knowledge and resources.

Weaknesses

CLI Focused: DVC introduces new concepts and CLI commands that users need to learn, which can be a barrier for those not familiar with command-line tools or version control systems.
Limited Scalability for Very Large Datasets: Managing very large projects with complex data pipelines can become cumbersome with DVC, as it requires careful organization and management of DVC files and configurations.
Limited Native UI: While DVC provides a powerful CLI, its native graphical user interface (UI) options are limited. Users often rely on third-party tools or custom-built interfaces for visualization and management.
Dependency on Git: DVC's strong dependency on Git means that it might not be suitable for environments where Git is not the primary version control system, or where users are not familiar with Git workflows.
Complexity of Collaborative Configurations: Collaboration with others requires multiple configurations such as setting up remote storage, defining roles, and providing access to each contributor, which can be frustrating and time-consuming.
Inefficient Data Addition Process: Adding new data to the storage requires pulling the existing data, then calculating the new hash before pushing back the whole data.
Lack of Relational Database Features: DVC lacks crucial relational database features, making it an unsuitable choice for those familiar with relational databases.

3. Apache Hudi

Click here to see Hudi's versioning workflow

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data management framework that helps manage large datasets stored in data lakes. It brings core warehouse and database functionality directly to a data lake. Hudi is designed to provide efficient data ingestion, storage, and query capabilities with strong support for incremental data processing. It enables data engineers to build near real-time data pipelines with support for transactions, indexing, and upserts (updates and inserts).

Strengths

Efficient Incremental Processing: Hudi excels at incremental data processing, allowing for efficient upserts (updates and inserts) and deletes.
ACID Transactions: Hudi supports ACID transactions, ensuring data consistency and reliability.
Real-Time Data Ingestion: Hudi is designed to support near real-time data ingestion and processing, making it suitable for streaming data applications.
Time Travel and Historical Queries: Hudi supports time travel queries, allowing users to access historical versions of data efficiently.
Schema Evolution: Supports schema evolution, allowing for changes to the schema without significant overhead.
Integration with Big Data Ecosystem: Hudi integrates seamlessly with Apache Spark, Apache Hive, Presto, and other big data tools.

Weaknesses

Complexity in Setup and Management: Hudi can be complex to set up and manage, particularly for teams not familiar with the Hadoop ecosystem.
Limited Support for Unstructured Data: Hudi is primarily focused on structured and semi-structured data.
Performance Overhead: Managing frequent updates and maintaining indexes can introduce performance overhead.
Maturity and Ecosystem: While rapidly maturing, Hudi’s ecosystem may not be as mature as some traditional data management tools.

4. Apache Iceberg

Click here to see Iceberg's versioning workflow

Apache Iceberg is an open-source table format for managing large-scale datasets in data lakes, designed for petabyte-scale data. It ensures data consistency, integrity, and performance, and works efficiently with big data processing engines like Apache Spark, Apache Flink, and Apache Hive. Iceberg combines the reliability and simplicity of SQL tables with high performance, enabling multiple engines to safely work with the same tables simultaneously.

Strengths

Schema and Partition Evolution: Supports non-disruptive schema changes and partition evolution, allowing tables to adapt to changing requirements without data rewriting.
Snapshot Isolation and Time Travel: Offers robust snapshot isolation, enabling time travel to query historical versions of data.
Hidden Partitioning: Abstracts partitioning details from users, simplifying query writing while ensuring efficient data access.
Integration with Multiple Big Data Engines: Supports integration with Apache Spark, Flink, Hive, and other big data processing engines.
Atomic Operations: Ensures atomicity for operations like appends, deletes, and updates, providing strong consistency guarantees.
Integration with Multiple Big Data Engines: including Spark, Flink, and Hive.

Weaknesses

Complexity in Setup and Management: Setting up and managing Iceberg tables can be complex, particularly for teams not familiar with big data ecosystems.
Limited Direct Support for Unstructured Data: Primarily designed for structured and semi-structured data.
Ecosystem Maturity: While rapidly maturing, Apache Iceberg's ecosystem is newer compared to some competitors like Delta Lake.

5. Pachyderm

Click here to see Pachyderm's versioning workflow

Pachyderm is an open-source data engineering platform that provides data versioning, pipeline management, and reproducibility for large-scale data processing. It combines data lineage and version control with the ability to manage complex data pipelines, making it an ideal tool for data science and machine learning workflows.

Strengths

Comprehensive Data Lineage: Automatically tracks data transformations, making it easy to audit and trace the source of any data.
Robust Versioning: Provides Git-like version control for data, ensuring all changes are tracked and reproducible.
Scalability and Performance: Built to handle large datasets and complex workflows efficiently.
Integration with Kubernetes: Benefits from Kubernetes’ powerful orchestration capabilities for scaling and managing resources.
Reproducibility: Ensures that every step in a data pipeline can be reproduced exactly, which is critical for reliable data science and machine learning workflows.

Weaknesses

Complexity: Can be complex to set up and manage, especially for users unfamiliar with Kubernetes.
Learning Curve: Has a steep learning curve due to its powerful but intricate features.
Resource Intensive: Requires significant computational resources, particularly for large-scale data processing tasks.

Code versioning (Miro Board)

Why "code versioning" is important?

Code versioning is the practice of managing changes to source code over time. It involves tracking and controlling modifications to the codebase to ensure that all changes are recorded, identifiable, and reversible. Code versioning is a fundamental practice in software development and is typically facilitated by version control systems (VCS).

Key Aspects of Code Versioning

Version Control Systems (VCS)

Centralized VCS: A single central repository where all versions of the code are stored.
Distributed VCS: Each developer has a local copy of the repository, including its full history.

Repositories: A repository is a storage location for the codebase, including all versions of the code and its history.
Commits: A commit is a record of changes made to the code. Each commit includes a unique identifier, a message describing the changes, and metadata such as the author and timestamp.
Branches: Branches allow developers to work on different features, bug fixes, or experiments in parallel without affecting the main codebase. Branches can be merged back into the main branch once the changes are ready.
Tags: Tags are used to mark specific points in the repository's history as significant, such as releases or milestones.
Merging: Merging combines changes from different branches into a single branch, resolving any conflicts that arise from simultaneous modifications.
Conflict Resolution: When changes from different branches conflict, developers must resolve these conflicts to integrate the changes.

Feature Comparison Matrix

Click here to see Git's versioning workflow

Model versioning (Miro Board)

Why "model versioning" is important?

Model versioning refers to the practice of managing different versions of machine learning models to track changes, ensure reproducibility, and manage deployments. It involves maintaining records of model parameters, architecture, training data, and performance metrics for each version of the model. This practice is crucial for model experimentation, collaboration, auditability, and continuous integration/continuous deployment (CI/CD) processes in machine learning workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Market research on versioning tools

Data versioning (Miro Board)

Why "data versioning" is important?

Feature Comparison Matrix

1. Delta Lake

2. DVC

3. Apache Hudi

4. Apache Iceberg

5. Pachyderm

Code versioning (Miro Board)

Why "code versioning" is important?

Key Aspects of Code Versioning

Feature Comparison Matrix

Model versioning (Miro Board)

Why "model versioning" is important?

Feature Comparison Matrix

1. MLflow

2. DVC

3. WeightS & Biases

4. TensorBoard

5. Neptune.ai

Contribute to Kedro

Kedro architecture

Technical docs

Developer docs

Kedro framework team norms

Research insights & summaries

Clone this wiki locally