-
Notifications
You must be signed in to change notification settings - Fork 176
ODC EP 008
The existing data model and API for lineage/source data is a barrier to efficient implementation, and has other issues as raised below.
Several elements of this EP have been flagged previously in:
- ODC-EP03 Replace the ODC Index and Internal Database API
- ODC-EP06 Extract Geometry utilities into a Separate Package
- Overhaul of index driver layer
- ODCv2 Road Map
This Enhancement Proposal outlines a specific data model, API and migration path within the context of the ODCv2 road map.
Paul Haesler (@SpacemanPaul)
- Under Discussion
- In Progress
- Completed
- Rejected
- Deferred
Issues with the current implementation of lineage/sources include:
- A lineage relationship between two datasets can only be recorded in an index if both datasets already exist in the index, this unnecessarily complicates indexing, and prevents the recording of derivation from datasets stored in another index.
- A unique index on (source_dataset, classifier), requiring arbitrary multiplication of classifiers (e.g. ard1, ard2, etc for geomedian)
- The "source_field" search API greatly complicates the search API and is rarely used.
- Lineage trees are only handled in the API with fully populated trees of Dataset objects which presents obstacles to addressing any of the above issues.
- Add new "supports" flag to
AbstractIndexDriver
:supports_external_lineage
. Defaults to False. -
supports_external_lineage=False
means the index driver is fully compatible with existing APIs, and does not support new API features proposed herein. -
supports_external_lineage=True
means the index driver supports the new API features proposed herein, and is therefore not fully compatible with legacy API.
From the release of ODCv2, all index drivers must have supports_external_lineage=True
and the legacy API will no longer be available.
An index driver that supports_external_lineage
must implement a database table equivalent to:
Column | Description | Type | Null? | Unique Indexes and other comments |
---|---|---|---|---|
derived_id | Derived Dataset ID | UUID | N | no enforced referential integrity to dataset table; unique with source_id |
source_id | Source Dataset ID | UUID | N | no enforced referential integrity to dataset table; unique with derived_id |
classifier | Lineage Type | String | N | no unique indexes |
An index driver that supports_external_lineage
MAY implement a database table equivalent to:
Column | Description | Type | Null? | Unique Indexes and other comments |
---|---|---|---|---|
dataset_id | Dataset ID | UUID | N | no enforced referential integrity to dataset table; Primary Key |
home | An ODC index | String | N |
This table records the external database index that a particular dataset referenced by the lineage table resides in. If the table is implemented, it is not required that an external database be registered with a home in this table, and datasets that do exist in the index may also be registered in this table.
In the current/legacy API, lineage information is always represented as a nesting of complete datasets under the sources
property of the root Dataset object.
In the proposed API, lineage information is represented by a LineageTree:
datacube.index.types.LineageTree
:
from enum import Enum
from uuid import UUID
from typing import NamedTuple, Optional
class LineageDirection(Enum):
SOURCES = 1 # Tree shows all source datasetss of the root node.
DERIVED = 2 # Tree shows all derived datasets of the root node.
class LineageTree(NamedTuple):
direction: LineageDirection # Whether this is a node in a source tree or a derived tree
dataset_id: UUID # The dataset id associated with this node
home: Optional[str] = None # The home index associated with this node's dataset
children: Optional[Mapping[str, Sequence["LineageNode"]]] = None
# An optional sequence of lineage nodes of the same direction as this node. The keys of the mapping
# are classifier strings. children=None means that there may be children in the database. children={}
# means there are no children in the database.
# children represent source datasets or derived datasets depending on the direction.
A LineageTree may represent the sources of the root dataset (and the sources' sources, and the sources' sources' sources, etc.) OR datasets derived from the root dataset (and datasets derived from datasets derived from the root dataset, etc.), but not both at once.
Optional new properties to be added to the Dataset
model:
source_tree: Optional[LineageTree] # Assumed to be of "source" direction
derived_tree: Optional[LineageTree] # Assumed to be of "derived" direction
The existing optional sources
property will be deprecated and not populated by an index driver that supports_external_lineage
Constructor argument | Current (v1.8.x) | Proposed (v2.0.x) |
---|---|---|
index |
The ODC Index that newly constructed Dataset models are intended to be saved into. | No change. |
products |
List of product names (existing in index) to consider for matching. Default None meaning consider all products in index. | No change. |
exclude_products |
List of product names (existing in index) to exclude from matching. Default None meaning no explicit exclusions. | No change. |
fail_on_missing_lineage |
Fail if any datasets referenced in lineage do not exist in index. Default False. | Only False supported. |
verify_lineage |
Check that nested lineage documents match versions already in database, and fail if they don't. Default True. Ignore for eo3 documents | Ignored (as all documents EO3) |
skip_lineage |
Strip out and ignore all lineage information. Overrides fail_on_missing_lineage and verify_lineage if set. Default False. |
No change |
eo3 |
Pre-process EO3 documents: auto/True/False. Default auto. | All documents are EO3, so False not supported and auto==True |
home_index |
proposed new argument | Optional string. If provided and implementation supports the foreign dataset home table, all lineage dataset ids will be recorded as belonging to this home index. |
source_tree |
proposed new argument | Optional source-direction LineageTree . If provided and skip_lineage is not False, lineage information is taken from this tree instead of from the dataset document. |
The result of calling a Doc2Dataset objects is DatasetOrError
which is defined as:
DatasetOrError = Union[
Tuple[Dataset, None],
Tuple[None, Union[str, Exception]]
]
Currently, lineage information is packed into the Dataset object as nested Dataset objects.
For index drivers supporting the new data model described above, and exclusively in ODCv2, lineage information will instead by packed into the Dataset object as a source LineageTree
in the source_tree property, as discussed above.
Retrieve a Dataset from the index. If include_sources
is True then the full recursive source lineage information is packaged in the returned Dataset.
Current/Legacy behaviour: Source lineage information returned as nested Dataset objects in the sources
field of the root Dataset.
Proposed/v2 behaviour: Source lineage information returned as a LineageTree
object in the source_tree
field of the root (and only) Dataset.
Add new parameter include_deriveds=False
- if true, also return derived lineage informaton as a LineageTree
object in the derived_tree
field of the Dataset.
** Current/Legacy behaviour:**
# :param with_lineage:
# - ``True (default)`` attempt adding lineage datasets if missing
# - ``False`` record lineage relations, but do not attempt
# adding lineage datasets to the db
(where lineage data is assumed to be stored in sources
field of ds
.)
Proposed/v2 behaviour:
-
Lineage data is assumed to be stored in
source_tree
andderived_tree
fields ofds
. -
with_lineage
argument is ignored (and is dropped all together in v2). Always record lineage relations only (both sourcewards and derivedwards).
Legacy API allows searching by metadata on source dataset.
Propose v2 drop support for source_fields
argument.
Currently: Return a list of datasets (or dataset ids) that are derived from the named dataset id.
Proposed:
- Replace both methods with a new method
get_derived_tree(id)
that returns a derived-directionLineageTree
with the named ID at the root. - Add corresponding new
get_source_tree(id)
method that returns a source-directionLineageTree
with the named ID at the root.
Current bulk read/write methods are flagged as being unstable (i.e. subject to further change).
Propose adding new bulk read/write methods for lineage data. Would operate with flat records (not full LineageTrees) and would be similar in format to the existing (unstable) bulk read/write methods.
Stablisation of bulk read/write and cloning API methods is a subject for another EP.
Current EO3 metadata standard (i.e. what is recorded in the yaml dataset metadata document, as read at indexing time) supports:
- Source lineage only, only one level-deep.
- DOES support multiple source IDs for a single classifier, but this is currently overwritten and flattened at index time.
Although the API enhancements in this EP can proceed with the current EO3 metadata format, this EP may be an appropriate place to consider adding (optional) extensions to the EO3 format to support:
- nested lineage; and
- derived as well as source lineage
If any extensions to the EO3 format are to be made, the should be made as part of a larger effort to draft a more formal standarised definition the EO3 format.
- Paul Haesler (@SpacemanPaul)
Welcome to the Open Data Cube