Skip to content
Paul Haesler edited this page Jan 17, 2023 · 23 revisions

ODC-EP 008 - New lineage API for postgis driver and ODCv2

Overview

The existing data model and API for lineage/source data is a barrier to efficient implementation, and has other issues as raised below.

Several elements of this EP have been flagged previously in:

This Enhancement Proposal outlines a specific data model, API and migration path within the context of the ODCv2 road map.

Proposed By

Paul Haesler (@SpacemanPaul)

State

  • Under Discussion
  • In Progress
  • Completed
  • Rejected
  • Deferred

Motivation

Issues with the current implementation of lineage/sources include:

  • A lineage relationship between two datasets can only be recorded in an index if both datasets already exist in the index, this unnecessarily complicates indexing, and prevents the recording of derivation from datasets stored in another index.
  • A unique index on (source_dataset, classifier), requiring arbitrary multiplication of classifiers (e.g. ard1, ard2, etc for geomedian)
  • The "source_field" search API greatly complicates the search API and is rarely used.
  • Lineage trees are only handled in the API with fully populated trees of Dataset objects which presents obstacles to addressing any of the above issues.

Proposal

Migration Path

  1. Add new "supports" flag to AbstractIndexDriver: supports_external_lineage. Defaults to False.
  2. supports_external_lineage=False means the index driver is fully compatible with existing APIs, and does not support new API features proposed herein.
  3. supports_external_lineage=True means the index driver supports the new API features proposed herein, and is therefore not fully compatible with legacy API.

From the release of ODCv2, all index drivers must have supports_external_lineage=True and the legacy API will no longer be available.

Minimal Database Representation

Lineage/Dataset-source table

An index driver that supports_external_lineage must implement a database table equivalent to:

Column Description Type Null? Unique Indexes and other comments
derived_id Derived Dataset ID UUID N no enforced referential integrity to dataset table; unique with source_id
source_id Source Dataset ID UUID N no enforced referential integrity to dataset table; unique with derived_id
classifier Lineage Type String N no unique indexes

Foreign Dataset Home table (optional):

An index driver that supports_external_lineage MAY implement a database table equivalent to:

Column Description Type Null? Unique Indexes and other comments
dataset_id Dataset ID UUID N no enforced referential integrity to dataset table; Primary Key
home An ODC index String N

This table records the external database index that a particular dataset referenced by the lineage table resides in. If the table is implemented, it is not required that an external database be registered with a home in this table, and datasets that do exist in the index may also be registered in this table.

Standard API Representation

In the current/legacy API, lineage information is always represented as a nesting of complete datasets under the sources property of the root Dataset object.

In the proposed API, lineage information is represented by a LineageTree:

datacube.index.types.LineageTree:

from enum import Enum
from uuid import UUID
from typing import NamedTuple, Optional

class LineageDirection(Enum):
    SOURCES = 1   # Tree shows all source datasetss of the root node.
    DERIVED = 2   # Tree shows all derived datasets of the root node.

class LineageTree(NamedTuple):
    direction: LineageDirection   # Whether this is a node in a source tree or a derived tree
    dataset_id: UUID              # The dataset id associated with this node
    home: Optional[str] = None    # The home index associated with this node's dataset
    children: Optional[Mapping[str, Sequence["LineageNode"]]] = None
                                  # An optional sequence of lineage nodes of the same direction as this node. The keys of the mapping
                                  # are classifier strings.  children=None means that there may be children in the database.  children={}
                                  # means there are no children in the database.
                                  # children represent source datasets or derived datasets depending on the direction.

A LineageTree may represent the sources of the root dataset (and the sources' sources, and the sources' sources' sources, etc.) OR datasets derived from the root dataset (and datasets derived from datasets derived from the root dataset, etc.), but not both at once.

Optional new properties to be added to the Dataset model:

source_tree: Optional[LineageTree]    # Assumed to be of "source" direction
derived_tree: Optional[LineageTree]   # Assumed to be of "derived" direction

The existing optional sources property will be deprecated and not populated by an index driver that supports_external_lineage

Old/Legacy Lineage API methods

datacube.index.hl.Doc2Dataset

Constructor argument Current (v1.8.x) Proposed (v2.0.x)
index The ODC Index that newly constructed Dataset models are intended to be saved into. No change.
products List of product names (existing in index) to consider for matching. Default None meaning consider all products in index. No change.
exclude_products List of product names (existing in index) to exclude from matching. Default None meaning no explicit exclusions. No change.
fail_on_missing_lineage Fail if any datasets referenced in lineage do not exist in index. Default False. Only False supported.
verify_lineage Check that nested lineage documents match versions already in database, and fail if they don't. Default True. Ignore for eo3 documents Ignored (as all documents EO3)
skip_lineage Strip out and ignore all lineage information. Overrides fail_on_missing_lineage and verify_lineage if set. Default False. No change
eo3 Pre-process EO3 documents: auto/True/False. Default auto. All documents are EO3, so False not supported and auto==True
home_index proposed new argument Optional string. If provided and implementation supports the foreign dataset home table, all lineage dataset ids will be recorded as belonging to this home index.
source_tree proposed new argument Optional source-direction LineageTree. If provided and skip_lineage is not False, lineage information is taken from this tree instead of from the dataset document.

The result of calling a Doc2Dataset objects is DatasetOrError which is defined as:

DatasetOrError = Union[
    Tuple[Dataset, None],
    Tuple[None, Union[str, Exception]]
]

Currently, lineage information is packed into the Dataset object as nested Dataset objects.

For index drivers supporting the new data model described above, and exclusively in ODCv2, lineage information will instead by packed into the Dataset object as a source LineageTree in the source_tree property, as discussed above.

datacube.index.abstract.AbstractDatasetResource (and driver specific implementations thereof)

get(id, include_sources=False)

Retrieve a Dataset from the index. If include_sources is True then the full recursive source lineage information is packaged in the returned Dataset.

Current/Legacy behaviour: Source lineage information returned as nested Dataset objects in the sources field of the root Dataset.

Proposed/v2 behaviour: Source lineage information returned as a LineageTree object in the source_tree field of the root (and only) Dataset.

Add new parameter include_deriveds=False - if true, also return derived lineage informaton as a LineageTree object in the derived_tree field of the Dataset.

add(ds, with_lineage=True)

** Current/Legacy behaviour:**

# :param with_lineage:
#           - ``True (default)`` attempt adding lineage datasets if missing
#           - ``False`` record lineage relations, but do not attempt
#             adding lineage datasets to the db

(where lineage data is assumed to be stored in sources field of ds.)

Proposed/v2 behaviour:

  1. Lineage data is assumed to be stored in source_tree and derived_tree fields of ds.

  2. with_lineage argument is ignored (and is dropped all together in v2). Always record lineage relations only (both sourcewards and derivedwards).

search(..., source_fields=None)

Legacy API allows searching by metadata on source dataset.

Propose v2 drop support for source_fields argument.

get_derived(id) (and get_derived_ids(id))

Currently: Return a list of datasets (or dataset ids) that are derived from the named dataset id.

Proposed:

  • Replace both methods with a new method get_derived_tree(id) that returns a derived-direction LineageTree with the named ID at the root.
  • Add corresponding new get_source_tree(id) method that returns a source-direction LineageTree with the named ID at the root.

Bulk read/write methods.

Current bulk read/write methods are flagged as being unstable (i.e. subject to further change).

Propose adding new bulk read/write methods for lineage data. Would operate with flat records (not full LineageTrees) and would be similar in format to the existing (unstable) bulk read/write methods.

Stablisation of bulk read/write and cloning API methods is a subject for another EP.

Feedback

Known outstanding issues

Current EO3 metadata standard (i.e. what is recorded in the yaml dataset metadata document, as read at indexing time) supports:

  1. Source lineage only, only one level-deep.
  2. DOES support multiple source IDs for a single classifier, but this is currently overwritten and flattened at index time.

Although the API enhancements in this EP can proceed with the current EO3 metadata format, this EP may be an appropriate place to consider adding (optional) extensions to the EO3 format to support:

  1. nested lineage; and
  2. derived as well as source lineage

If any extensions to the EO3 format are to be made, the should be made as part of a larger effort to draft a more formal standarised definition the EO3 format.

Voting

Enhancement Proposal Team

  • Paul Haesler (@SpacemanPaul)

Links

Clone this wiki locally