Skip to content
Paul Haesler edited this page Feb 5, 2023 · 23 revisions

ODC-EP 008 - New lineage API for postgis driver and ODCv2

Overview

The existing data model and API for lineage/source data is constrained by decisions made a long time ago to meet requirements that no longer exist. The current API is a significant barrier to efficient reimplementation of key operational bottlenecks in the datacube index layer.

Several elements of this EP have been flagged previously in:

This Enhancement Proposal outlines a new data model and API for dataset lineage and a migration path within the context of the ODCv2 road map.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Issues with the current implementation of lineage/sources include:

  • A lineage relationship between two datasets can only be recorded in an index if both datasets already exist in the index, this unnecessarily complicates indexing, and prevents the recording of derivation from datasets stored in another index (external lineage).
  • A unique index on (source_dataset, classifier), requiring arbitrary multiplication of classifiers (e.g. ard1, ard2, etc for geomedian) - (in general requiring a rewriting of the source eo3 document!)
  • The "source_field" search API greatly complicates the search API and is rarely used.
  • Lineage trees are only handled in the API with fully populated trees of Dataset objects which presents which is not compatible with external lineage.

Summary of proposal:

  • Index drivers can declare whether they support the old lineage API or the new "external lineage" API, with the old API being deprecated, then dropped in v1.9 and v2.0 respectively.
  • Decouple source and destination id columns from dataset table - allow lineage of external datasets to be tracked by id.
  • Ability to optionally associate external ids with a named external index.
  • New index resource API for saving, updating, removing and retrieving lineage trees (dataset its only).
  • Internal API to convert between lineage trees and a flattened, indexed representation suitable for database representation and enforcing lineage consistency across the database.
  • Updates to Dataset model to support lineage trees.
  • simple API that works in both the sourcewards and derivedwards directions.
  • Drop support for old lineage-related API and CLI methods/options.
  • Backwards-compatible extensions to Doc2Dataset interface to support the new lineage model.

Proposal

Migration Path

  1. Add new "supports" flag to AbstractIndexDriver: supports_external_lineage. Defaults to False.
  2. supports_external_lineage=False means the index driver is fully compatible with existing APIs, and does not support new API features proposed herein.
  3. supports_external_lineage=True means the index driver supports the new API features proposed herein, and is therefore not fully compatible with legacy API.

Subsequent v1.8.x releases will support both APIs, as per the active driver's supports_external_lineage flag.

In v1.9.x loading an index driver that does not support external lineage will generate a deprecation warning.

In v2.0.x all index drivers must support external lineage, and the legacy API and data model will no longer be supported.

Database Representation

Lineage relations as nodes in a many-to-many network database, similar to how lineage is tracked now, except the source and derived is not enforced to exist in the dataset table.

Other simpler representations are possible (e.g. just storing LineageTrees as JSON blobs - the LineageTree class is documented below), which allows reads and writes to be much faster. This approach might be fine for the in-memory index driver, but I think the postgis driver needs to be able to enforce a lineage consistency across a whole index. Users who don't care about lineage consistency probably don't care about lineage at all.

Lineage/Dataset-source table

Column Description Type Null? Unique Indexes and other comments
derived_id Derived Dataset ID UUID N no enforced referential integrity to dataset table; unique with source_id
source_id Source Dataset ID UUID N no enforced referential integrity to dataset table; unique with derived_id
classifier Lineage Type String N no unique indexes
Foreign Dataset Home table (optional):
Column Description Type Null? Unique Indexes and other comments
dataset_id Dataset ID UUID N no enforced referential integrity to dataset table; Primary Key
home An ODC index Char/Text N

This table records an optional text value that may be associated with particular datasets referenced by the lineage table that may be external. The home field is provided to record an identifier for the database/index that the dataset is known to reside in. It is not interpreted by the API, but could be used to contain e.g. an index name from a shared config file, a database connection string, or a uri.

It is not required that an external database be registered with a home in this table, and datasets that do exist in the index may also be registered in this table. The value and significance of home is entirely user-defined.

Standard API Representation

In the current/legacy API, lineage information is always represented as a nesting of complete datasets under the sources property of the root Dataset object.

In the proposed API, lineage information is represented by a LineageTree:

datacube.model.LineageTree:

from dataclasses import dataclass
from enum import Enum
from uuid import UUID
from typing import Mapping, Optional, Sequence

class LineageDirection(Enum):
    SOURCES = 1   # Tree shows all source datasetss of the root node.
    DERIVED = 2   # Tree shows all derived datasets of the root node.

@dataclass
class LineageTree:
    direction: LineageDirection   # Whether this is a node in a source tree or a derived tree
    dataset_id: UUID              # The dataset id associated with this node
    children: Optional[Mapping[str, Sequence["LineageTree"]]] = None
                                  # An optional sequence of lineage nodes of the same direction as this node. The keys of the mapping
                                  # are classifier strings.  children=None means that there may be children in the database.  children={}
                                  # means there are no children in the database.
                                  # children represent source datasets or derived datasets depending on the direction.
                                  # child nodes must have the same `direction` as the parent node.
    home: Optional[str] = None    # The home index associated with this node's dataset

LineageTree may be implemented as a NamedTuple or dataclass or a fully fledged class (i.e. TBD).

A LineageTree may represent the sources of the root dataset (and the sources' sources, and the sources' sources' sources, etc.) OR datasets derived from the root dataset (and datasets derived from datasets derived from the root dataset, etc.), but not both at once.

Optional new properties to be added to the Dataset model and it's constructor:

source_tree: Optional[LineageTree]=None    # Assumed to be of "source" direction
derived_tree: Optional[LineageTree]=None   # Assumed to be of "derived" direction

v.1.8.x: The existing optional sources property not populated by an index driver that supports_external_lineage, and source_tree and derived_tree will not populated by an index driver that does not.

v1.9.x: The sources property becomes deprecated.

v2.0.0: The sources property will be removed.

Common internal index driver representation

The datacube.models.lineage.LineageRelations class will be provided support converting back and forth and validating consistency between flattened dataset relations as are stored in the index under the proposed database representation above, and LineageTrees as presented to end-users in the public API.

class InconsistentLineageException(Exception):
    """
    Exception to raise on detecting inconsistent lineage.
    """


@dataclass
class LineageRelation:
    classifier: str
    source_id: UUID
    derived_id: UUID


class LineageRelations:
    """
    A dynamically updatable indexed network of LineageRelations that enforces internal consistency.
    """
    def __init__(self,
                 tree: Optional[LineageTree] = None,
                 max_depth: int = 0,
                 merge_with: Optional["LineageRelations"] = None) -> None:
        """
        Create an empty LineageRelations object.  
        Merge in the data from a LineageTree and/or another LineageRelations object if supplied, raising 
        InconsistentLineageException if either merge would result in inconsistent lineage.
        """

    def merge_new_lineage_relation(self, rel: LineageRelation) -> None:
        """
        Merge a new LineageRelation into this object.  Can raise InconsistentLineageException.
        """

    def merge(self, pool: "LineageRelations") -> None:
        """
        Merge another LineageRelations object into this one. Can raise InconsistentLineageException.
        """

    def merge_tree(self, tree: LineageTree,
                                     parent_id: Optional[UUID] = None,
                                     max_depth: int = 0) -> None:
        """
        Merge a LineageTree into this object. Can raise InconsistentLineageException on inconsistent lineage,
        or circular dependencies or other potential triggers of infinite recursion.

        Tree is walked to the requested maximum depth.  If max_depth is zero (the default), recurse indefinitely.

        parent_id is to support recursive implementation and is normally left as None.
        """

    def relations_diff(self,
                  existing_relations: Optional["LineageRelations"] = None,
                  allow_updates: bool = False
                 ) -> Tuple[
                      Mapping[Tuple[UUID, UUID], str],
                      Mapping[Tuple[UUID, UUID], str],
                      Mapping[UUID, str],
                      Mapping[UUID, str]
                      ]:
        """
        To support the "replace" flag on Lineage API methods, as described below.

        If the current object contains lineage information from a new LineageTree object passed in to be added to the index,
        and existing_relations represents all potentially overlapping lineage data stored in the index, then this method returns
        a tuple of mappings containing: new relations to be added to the database, existing relations to be updated in the database,
        new homes to be added to the database and existing homes to be updated to the database.

        Raises InconsistentLineageException if allow_updates is False and either update mapping is not empty.

        Tuple(UUID, UUID) represents a (source_id, derived_id) pair.
        """

Old/Legacy Lineage API methods

datacube.index.hl.Doc2Dataset

Constructor argument Current (v1.8.x) Proposed (v2.0.x)
index The ODC Index that newly constructed Dataset models are intended to be saved into. No change.
products List of product names (existing in index) to consider for matching. Default None meaning consider all products in index. No change.
exclude_products List of product names (existing in index) to exclude from matching. Default None meaning no explicit exclusions. No change.
fail_on_missing_lineage Fail if any datasets referenced in lineage do not exist in index. Default False. Only False supported.
verify_lineage Check that nested lineage documents match versions already in database, and fail if they don't. Default True. Ignore for eo3 documents Ignored (as all documents EO3)
skip_lineage Strip out and ignore all lineage information. Overrides fail_on_missing_lineage and verify_lineage if set. Default False. No change
eo3 Pre-process EO3 documents: auto/True/False. Default auto. All documents are EO3, so False not supported and auto==True
home_index proposed new argument Optional string. If provided and implementation supports the foreign dataset home table, all lineage dataset ids will be recorded as belonging to this home index.
source_tree proposed new argument Optional source-direction LineageTree. If provided and skip_lineage is not False, lineage information is taken from this tree instead of from the dataset document.
max_depth proposed new argument Maximum depth to read source_tree to. default is 0, meaning recurse indefinitely.

The result of calling a Doc2Dataset objects is DatasetOrError which is defined as:

DatasetOrError = Union[
    Tuple[Dataset, None],
    Tuple[None, Union[str, Exception]]
]

Currently, lineage information is packed into the Dataset object as nested Dataset objects in the source property.

For index drivers supporting the new data model described above, and exclusively in ODCv2, lineage information will instead by packed into the Dataset object as a source LineageTree in the source_tree property, as discussed above. (Maybe truncated with respect to the source_tree passed in, as per max_depth.)

datacube.index.abstract.AbstractDatasetResource

get(id, include_sources=False)

Retrieve a Dataset from the index. If include_sources is True then the full source lineage information is packaged in the returned Dataset.

Current/Legacy behaviour: Source lineage information returned as nested Dataset objects in the sources field of the root Dataset, always fully recursive.

Proposed/v2 behaviour: Source lineage information returned as a LineageTree object in the source_tree field of the root (and only) Dataset.

Add new parameter include_deriveds=False - if true, also return derived lineage information as a LineageTree object in the derived_tree field of the Dataset.

Add new parameter max_depth=0 - limits the depth of source and/or derived lineage tree returned. (0/default = no limit)

add(ds, with_lineage=True)

Current/Legacy behaviour:

# :param with_lineage:
#           - ``True (default)`` attempt adding lineage datasets if missing
#           - ``False`` record lineage relations, but do not attempt
#             adding lineage datasets to the db

(where lineage data is assumed to be stored in sources field of ds.)

Proposed/v2 behaviour:

  1. Lineage data is assumed to be stored in source_tree and derived_tree fields of ds.

  2. with_lineage argument is ignored (and is dropped all together in v2). Always record lineage relations only (both sourcewards and derivedwards, as provided in the Dataset object).

search(..., source_fields=None)

Legacy API allows searching by metadata on source dataset.

Propose v2 drop support for source_fields argument.

get_derived(id)

Currently: Return a list of datasets that are derived from the named dataset id.

Proposed:

  • Raise NotImplementedError (and deprecation warning in legacy driver in 1.9)
New Lineage API

Add a new Index Resource lineage. (i.e. dc.index.lineage like existing dc.index.products and dc.index.datasets.)

Abstract class definition with docstrings of API:


class AbstractLineageResource(ABC):
    """
    Abstract base class for the Lineage portion of an index api.

    All LineageResource implementations should inherit from this base class.

    Note that this is a "new" resource only supported by new index drivers with `supports_external_lineage`
    set to True.  If a driver does NOT support external lineage, it can use LegacyLineageResource below,
    which is a minimal implementation of this resource that raises a NotImplementedError for all methods.
    """

    @abstractmethod
    def get_derived_tree(self, id: DSID, max_depth: int = 0) -> LineageTree:
        """
        Extract a LineageTree from the index, with:
            - "id" at the root of the tree.
            - "derived" direction (i.e. datasets derived from id, datasets derived from
              datasets derived from id, etc.)
            - maximum depth as requested (default 0 = unlimited depth)

        Tree may be empty (i.e. just the root node) if no lineage for id is stored.

        :param id: the id of the dataset at the root of the returned tree
        :param max_depth: Maximum recursion depth.  Default/Zero = unlimited depth
        :return: A derived-direction Lineage tree with id at the root.
        """

    @abstractmethod
    def get_source_tree(self, id: DSID, max_depth: int = 0) -> LineageTree:
        """
        Extract a LineageTree from the index, with:
            - "id" at the root of the tree.
            - "source" direction (i.e. datasets id was derived from, the dataset ids THEY were derived from, etc.)
            - maximum depth as requested (default 0 = unlimited depth)

        Tree may be empty (i.e. just the root node) if no lineage for id is stored.

        :param id: the id of the dataset at the root of the returned tree
        :param max_depth: Maximum recursion depth.  Default/Zero = unlimited depth
        :return: A source-direction Lineage tree with id at the root.
        """

    @abstractmethod
    def add(self, tree: LineageTree, max_depth: int = 0, replace: bool = False) -> None:
        """
        Add or update a LineageTree into the Index.

        If the provided tree is inconsistent with lineage data already
        recorded in the database, by default a ValueError is raised,
        If replace is True, the provided tree is treated as authoritative
        and the database is updated to match.

        Raise Exception on error.

        :param tree: The LineageTree to add to the index
        :param max_depth: Maximum recursion depth. Default/Zero = unlimited depth
        :param replace: If True, update database to match tree exactly.
        """

    @abstractmethod
    def remove(self, id_: DSID, direction: LineageDirection, max_depth: int = 0) -> None:
        """
        Remove lineage information from the Index.

        Raise Exception on error.
        
        :param id_: The Dataset ID to start removing lineage from.
        :param direction: The direction in which to remove lineage (from id_)
        :param max_depth: The maximum depth to which to remove lineage (0/default = no limit)
        """

Bulk read/write methods.

Current bulk read/write methods are flagged as being unstable (i.e. subject to further change).

I propose adding new bulk read/write methods for lineage data. These would operate with flat records (not full LineageTrees) and would be similar in format to the existing (unstable) bulk read/write methods. Details of these new bulk methods and stabilisation of the existing read/write and cloning API methods is a deferred to a future EP.

CLI changes

Some existing CLI commands/options will have to be updated to reflect the API changes above, and new CLI commands for handling lineage will need to be added. In particular, a CLI command to index lineage information ONLY (i.e. don't index the dataset, just extract and save the lineage info.) would be desirable.

The detailed specifications for these are deferred to a future EP.

Notes on Lineage consistency

The above design can easily detect/fix:

  1. a dataset ID with a different home to what is recorded in the index.
  2. a lineage reationship between two dataset ids with a different classifier to what is recorded in the index.

The above design as it stands cannot detect all cases of circular dependency when saving lineage information, although can detect circular dependencies within a constructed LineageTree, and can catch cyclic relationships when loading from the index without falling into an infinite loop. It is hard to see how a cyclic relationships could be prevented in the general case without imposing a limit on the maxmimum depth of lineage trees.

Feedback

Edit and add your comments here

Known outstanding issues

Current EO3 metadata standard (i.e. what is recorded in the json stac or yaml dataset metadata document, as read at indexing time) supports:

  1. Source lineage only, only one level-deep.
  2. DOES support multiple source IDs for a single classifier, but this is currently overwritten and flattened by the ODC at index time.

Although the API enhancements in this EP can proceed with the current EO3 metadata format, this EP may be an appropriate place to consider adding (optional) extensions to the EO3 format to support:

  1. nested lineage; and
  2. derived as well as source lineage

If any extensions to the EO3 format are to be made, they should be made as part of a larger effort to draft a more formal definition the EO3 format.

Voting

Enhancement Proposal Team

  • Paul Haesler (@SpacemanPaul)

Links

Clone this wiki locally