Skip to content
Paul Haesler edited this page Feb 1, 2023 · 23 revisions

ODC-EP 008 - New lineage API for postgis driver and ODCv2

Overview

The existing data model and API for lineage/source data is constrained by decisions made a long time ago to meet requirements that no longer exist. The current API is a significant barrier to efficient reimplementation of key operational bottlenecks in the datacube index layer.

Several elements of this EP have been flagged previously in:

This Enhancement Proposal outlines a new data model and API for dataset lineage and a migration path within the context of the ODCv2 road map.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Issues with the current implementation of lineage/sources include:

  • A lineage relationship between two datasets can only be recorded in an index if both datasets already exist in the index, this unnecessarily complicates indexing, and prevents the recording of derivation from datasets stored in another index (external lineage).
  • A unique index on (source_dataset, classifier), requiring arbitrary multiplication of classifiers (e.g. ard1, ard2, etc for geomedian) - (in general requiring a rewriting of the source eo3 document!)
  • The "source_field" search API greatly complicates the search API and is rarely used.
  • Lineage trees are only handled in the API with fully populated trees of Dataset objects which presents which is not compatible with external lineage.

Proposal

Migration Path

  1. Add new "supports" flag to AbstractIndexDriver: supports_external_lineage. Defaults to False.
  2. supports_external_lineage=False means the index driver is fully compatible with existing APIs, and does not support new API features proposed herein.
  3. supports_external_lineage=True means the index driver supports the new API features proposed herein, and is therefore not fully compatible with legacy API.

Subsequent v1.8.x releases will support both APIs, as per the active driver's supports_external_lineage flag.

In v1.9.x loading an index driver that does not support external lineage will generate a deprecation warning.

In v2.0.x all index drivers must support external lineage, and the legacy API and data model will no longer be supported.

Minimal Database Representation

An index driver that supports_external_lineage must implement a database table equivalent to EITHER of the following:

Lineage relations as nodes in a many-to-many network database

  • Similar to how lineage is tracked now, except dataset id is not enforced to exist in the database.
  • Slower to read and write
  • More storage-efficient
  • Better indexed for search,
  • Enforces lineage consistency across the index.
  • Fast consistent editing of individual nodes and relationships across whole index.
Lineage/Dataset-source table
Column Description Type Null? Unique Indexes and other comments
derived_id Derived Dataset ID UUID N no enforced referential integrity to dataset table; unique with source_id
source_id Source Dataset ID UUID N no enforced referential integrity to dataset table; unique with derived_id
classifier Lineage Type String N no unique indexes
Foreign Dataset Home table (optional):
Column Description Type Null? Unique Indexes and other comments
dataset_id Dataset ID UUID N no enforced referential integrity to dataset table; Primary Key
home An ODC index Char/Text N

This table records an optional text value that may be associated with particular datasets referenced by the lineage table that may be external. The home field is provided to record an identifier for the database/index that the dataset is known to reside in. It is not interpreted by the API, but could be used to contain e.g. an index name from a shared config file, a database connection string, or a uri.

If the table is implemented, it is not required that an external database be registered with a home in this table, and datasets that do exist in the index may also be registered in this table. The value and significance of home is entirely user-defined.

index.supports_external_home should be set to True if this table is provided. Otherwise the driver that does not implement this table SHOULD always treat passed-in home as None, and return home as None.

Lineage relations as JSON blobs

  • Similar to how non-lineage dataset metadata is stored now.
  • Much faster to read and write
  • Less storage-efficient
  • No lineage consistency across the index

home values MUST be stored verbatim.

Standard API Representation

In the current/legacy API, lineage information is always represented as a nesting of complete datasets under the sources property of the root Dataset object.

In the proposed API, lineage information is represented by a LineageTree:

datacube.model.LineageTree:

from dataclasses import dataclass
from enum import Enum
from uuid import UUID
from typing import Mapping, Optional, Sequence

class LineageDirection(Enum):
    SOURCES = 1   # Tree shows all source datasetss of the root node.
    DERIVED = 2   # Tree shows all derived datasets of the root node.

@dataclass
class LineageTree:
    direction: LineageDirection   # Whether this is a node in a source tree or a derived tree
    dataset_id: UUID              # The dataset id associated with this node
    children: Optional[Mapping[str, Sequence["LineageTree"]]] = None
                                  # An optional sequence of lineage nodes of the same direction as this node. The keys of the mapping
                                  # are classifier strings.  children=None means that there may be children in the database.  children={}
                                  # means there are no children in the database.
                                  # children represent source datasets or derived datasets depending on the direction.
    home: Optional[str] = None    # The home index associated with this node's dataset

LineageTree may be implemented a NamedTuple or dataclass or a fully fledged class (i.e. TBD).

A LineageTree may represent the sources of the root dataset (and the sources' sources, and the sources' sources' sources, etc.) OR datasets derived from the root dataset (and datasets derived from datasets derived from the root dataset, etc.), but not both at once.

Optional new properties to be added to the Dataset model and it's constructor:

source_tree: Optional[LineageTree]=None    # Assumed to be of "source" direction
derived_tree: Optional[LineageTree]=None   # Assumed to be of "derived" direction

v.1.8.x: The existing optional sources property not populated by an index driver that supports_external_lineage, and source_tree and derived_tree will not populated by an index driver that does not.

v1.9.x: The sources property becomes deprecated.

v2.0.0: The sources property will be removed.

Old/Legacy Lineage API methods

datacube.index.hl.Doc2Dataset

Constructor argument Current (v1.8.x) Proposed (v2.0.x)
index The ODC Index that newly constructed Dataset models are intended to be saved into. No change.
products List of product names (existing in index) to consider for matching. Default None meaning consider all products in index. No change.
exclude_products List of product names (existing in index) to exclude from matching. Default None meaning no explicit exclusions. No change.
fail_on_missing_lineage Fail if any datasets referenced in lineage do not exist in index. Default False. Only False supported.
verify_lineage Check that nested lineage documents match versions already in database, and fail if they don't. Default True. Ignore for eo3 documents Ignored (as all documents EO3)
skip_lineage Strip out and ignore all lineage information. Overrides fail_on_missing_lineage and verify_lineage if set. Default False. No change
eo3 Pre-process EO3 documents: auto/True/False. Default auto. All documents are EO3, so False not supported and auto==True
home_index proposed new argument Optional string. If provided and implementation supports the foreign dataset home table, all lineage dataset ids will be recorded as belonging to this home index.
source_tree proposed new argument Optional source-direction LineageTree. If provided and skip_lineage is not False, lineage information is taken from this tree instead of from the dataset document.

The result of calling a Doc2Dataset objects is DatasetOrError which is defined as:

DatasetOrError = Union[
    Tuple[Dataset, None],
    Tuple[None, Union[str, Exception]]
]

Currently, lineage information is packed into the Dataset object as nested Dataset objects in the source property.

For index drivers supporting the new data model described above, and exclusively in ODCv2, lineage information will instead by packed into the Dataset object as a source LineageTree in the source_tree property, as discussed above.

datacube.index.abstract.AbstractDatasetResource (and driver specific implementations thereof)

get(id, include_sources=False)

Retrieve a Dataset from the index. If include_sources is True then the full recursive source lineage information is packaged in the returned Dataset.

Current/Legacy behaviour: Source lineage information returned as nested Dataset objects in the sources field of the root Dataset.

Proposed/v2 behaviour: Source lineage information returned as a LineageTree object in the source_tree field of the root (and only) Dataset.

Add new parameter include_deriveds=False - if true, also return derived lineage information as a LineageTree object in the derived_tree field of the Dataset.

add(ds, with_lineage=True)

Current/Legacy behaviour:

# :param with_lineage:
#           - ``True (default)`` attempt adding lineage datasets if missing
#           - ``False`` record lineage relations, but do not attempt
#             adding lineage datasets to the db

(where lineage data is assumed to be stored in sources field of ds.)

Proposed/v2 behaviour:

  1. Lineage data is assumed to be stored in source_tree and derived_tree fields of ds.

  2. with_lineage argument is ignored (and is dropped all together in v2). Always record lineage relations only (both sourcewards and derivedwards, as provided in the Dataset object).

search(..., source_fields=None)

Legacy API allows searching by metadata on source dataset.

Propose v2 drop support for source_fields argument.

get_derived(id)

Currently: Return a list of datasets that are derived from the named dataset id.

Proposed:

  • Do not implement old API in a driver that supports_external_lineage deprecation warning.
New Lineage API

Add a new Index Resource lineage. (i.e. dc.index.lineage like existing dc.index.products and dc.index.datasets.

Methods:

  • Add a new method get_derived_tree(id, max_depth=0) that returns a derived-direction LineageTree with the named dataset ID at the root. Recurse to max_depth levels of recursion (0, the default means no maximum depth, keep building as deep as you have to.)
  • Add corresponding new get_source_tree(id, max_depth=0) method that returns a source-direction LineageTree with the named dataset ID at the root.
  • New method add(tree: LineageTree, replace=False) - add lineage information only to the index. Merges with existing lineage, if any. Conflicts between existing lineage and the supplied tree trigger an error, unless replace is True, in which case the conflicting information is removed and the supplied tree is treated as authoritative.
  • New method remove_lineage(id, direction, max_depth=0) Remove all source or derived lineage for a dataset id - to the specified recursion depth (0, the default means keep recursing all the way.

Bulk read/write methods.

Current bulk read/write methods are flagged as being unstable (i.e. subject to further change).

I propose adding new bulk read/write methods for lineage data. These would operate with flat records (not full LineageTrees) and would be similar in format to the existing (unstable) bulk read/write methods. Details of these new bulk methods and stabilisation of the existing read/write and cloning API methods is a deferred to a future EP.

CLI changes

Some existing CLI commands/options will have to be updated to reflect the API changes above, and new CLI commands for handling lineage will need to be added. In particular, a CLI command to index lineage information ONLY (i.e. don't index the dataset, just extract and save the lineage info.) would be desirable.

The detailed specifications for these are deferred to a future EP.

Feedback

Edit and add your comments here

Known outstanding issues

Current EO3 metadata standard (i.e. what is recorded in the json stac or yaml dataset metadata document, as read at indexing time) supports:

  1. Source lineage only, only one level-deep.
  2. DOES support multiple source IDs for a single classifier, but this is currently overwritten and flattened by the ODC at index time.

Although the API enhancements in this EP can proceed with the current EO3 metadata format, this EP may be an appropriate place to consider adding (optional) extensions to the EO3 format to support:

  1. nested lineage; and
  2. derived as well as source lineage

If any extensions to the EO3 format are to be made, they should be made as part of a larger effort to draft a more formal definition the EO3 format.

Voting

Enhancement Proposal Team

  • Paul Haesler (@SpacemanPaul)

Links

Clone this wiki locally