-
Notifications
You must be signed in to change notification settings - Fork 176
ODC EP 008
The existing data model and API for lineage/source data is constrained by decisions made a long time ago to meet requirements that no longer exist. The current API is a significant barrier to efficient reimplementation of key operational bottlenecks in the datacube index layer.
Several elements of this EP have been flagged previously in:
- ODC-EP03 Replace the ODC Index and Internal Database API
- ODC-EP06 Extract Geometry utilities into a Separate Package
- Overhaul of index driver layer
- ODCv2 Road Map
This Enhancement Proposal outlines a new data model and API for dataset lineage and a migration path within the context of the ODCv2 road map.
Paul Haesler (@SpacemanPaul)
- Under Discussion
- In Progress
- Completed
- Rejected
- Deferred
Issues with the current implementation of lineage/sources include:
- A lineage relationship between two datasets can only be recorded in an index if both datasets already exist in the index, this unnecessarily complicates indexing, and prevents the recording of derivation from datasets stored in another index (external lineage).
- A unique index on (source_dataset, classifier), requiring arbitrary multiplication of classifiers (e.g. ard1, ard2, etc for geomedian) - (in general requiring a rewriting of the source eo3 document!)
- The "source_field" search API greatly complicates the search API and is rarely used.
- Lineage trees are only handled in the API with fully populated trees of Dataset objects which presents which is not compatible with external lineage.
- Add new "supports" flag to
AbstractIndexDriver
:supports_external_lineage
. Defaults to False. -
supports_external_lineage=False
means the index driver is fully compatible with existing APIs, and does not support new API features proposed herein. -
supports_external_lineage=True
means the index driver supports the new API features proposed herein, and is therefore not fully compatible with legacy API.
Subsequent v1.8.x releases will support both APIs, as per the active driver's supports_external_lineage
flag.
In v1.9.x loading an index driver that does not support external lineage will generate a deprecation warning.
In v2.0.x all index drivers must support external lineage, and the legacy API and data model will no longer be supported.
An index driver that supports_external_lineage
must implement a database table equivalent to EITHER of the following:
- Similar to how lineage is tracked now, except dataset id is not enforced to exist in the database.
- Slower to read and write
- More storage-efficient
- Better indexed for search,
- Enforces lineage consistency across the index.
- Fast consistent editing of individual nodes and relationships across whole index.
Column | Description | Type | Null? | Unique Indexes and other comments |
---|---|---|---|---|
derived_id | Derived Dataset ID | UUID | N | no enforced referential integrity to dataset table; unique with source_id |
source_id | Source Dataset ID | UUID | N | no enforced referential integrity to dataset table; unique with derived_id |
classifier | Lineage Type | String | N | no unique indexes |
Column | Description | Type | Null? | Unique Indexes and other comments |
---|---|---|---|---|
dataset_id | Dataset ID | UUID | N | no enforced referential integrity to dataset table; Primary Key |
home | An ODC index | Char/Text | N |
This table records an optional text value that may be associated with particular datasets referenced by the lineage table that may be external. The home field is provided to record an identifier for the database/index that the dataset is known to reside in. It is not interpreted by the API, but could be used to contain e.g. an index name from a shared config file, a database connection string, or a uri.
If the table is implemented, it is not required that an external database be registered with a home in this table, and datasets that do exist in the index may also be registered in this table. The value and significance of home is entirely user-defined.
index.supports_external_home
should be set to True if this table is provided. Otherwise the driver that does not implement this table SHOULD always treat passed-in home as None, and return home as None.
- Similar to how non-lineage dataset metadata is stored now.
- Much faster to read and write
- Less storage-efficient
- No lineage consistency across the index
home
values MUST be stored verbatim.
In the current/legacy API, lineage information is always represented as a nesting of complete datasets under the sources
property of the root Dataset object.
In the proposed API, lineage information is represented by a LineageTree:
datacube.model.LineageTree
:
from dataclasses import dataclass
from enum import Enum
from uuid import UUID
from typing import Mapping, Optional, Sequence
class LineageDirection(Enum):
SOURCES = 1 # Tree shows all source datasetss of the root node.
DERIVED = 2 # Tree shows all derived datasets of the root node.
@dataclass
class LineageTree:
direction: LineageDirection # Whether this is a node in a source tree or a derived tree
dataset_id: UUID # The dataset id associated with this node
children: Optional[Mapping[str, Sequence["LineageTree"]]] = None
# An optional sequence of lineage nodes of the same direction as this node. The keys of the mapping
# are classifier strings. children=None means that there may be children in the database. children={}
# means there are no children in the database.
# children represent source datasets or derived datasets depending on the direction.
home: Optional[str] = None # The home index associated with this node's dataset
LineageTree may be implemented a NamedTuple
or dataclass
or a fully fledged class (i.e. TBD).
A LineageTree may represent the sources of the root dataset (and the sources' sources, and the sources' sources' sources, etc.) OR datasets derived from the root dataset (and datasets derived from datasets derived from the root dataset, etc.), but not both at once.
Optional new properties to be added to the Dataset
model and it's constructor:
source_tree: Optional[LineageTree]=None # Assumed to be of "source" direction
derived_tree: Optional[LineageTree]=None # Assumed to be of "derived" direction
v.1.8.x: The existing optional sources
property not populated by an index driver that supports_external_lineage
, and source_tree
and derived_tree
will not populated by an index driver that does not.
v1.9.x: The sources property becomes deprecated.
v2.0.0: The sources property will be removed.
Constructor argument | Current (v1.8.x) | Proposed (v2.0.x) |
---|---|---|
index |
The ODC Index that newly constructed Dataset models are intended to be saved into. | No change. |
products |
List of product names (existing in index) to consider for matching. Default None meaning consider all products in index. | No change. |
exclude_products |
List of product names (existing in index) to exclude from matching. Default None meaning no explicit exclusions. | No change. |
fail_on_missing_lineage |
Fail if any datasets referenced in lineage do not exist in index. Default False. | Only False supported. |
verify_lineage |
Check that nested lineage documents match versions already in database, and fail if they don't. Default True. Ignore for eo3 documents | Ignored (as all documents EO3) |
skip_lineage |
Strip out and ignore all lineage information. Overrides fail_on_missing_lineage and verify_lineage if set. Default False. |
No change |
eo3 |
Pre-process EO3 documents: auto/True/False. Default auto. | All documents are EO3, so False not supported and auto==True |
home_index |
proposed new argument | Optional string. If provided and implementation supports the foreign dataset home table, all lineage dataset ids will be recorded as belonging to this home index. |
source_tree |
proposed new argument | Optional source-direction LineageTree . If provided and skip_lineage is not False, lineage information is taken from this tree instead of from the dataset document. |
The result of calling a Doc2Dataset objects is DatasetOrError
which is defined as:
DatasetOrError = Union[
Tuple[Dataset, None],
Tuple[None, Union[str, Exception]]
]
Currently, lineage information is packed into the Dataset object as nested Dataset objects in the source
property.
For index drivers supporting the new data model described above, and exclusively in ODCv2, lineage information will instead by packed into the Dataset object as a source LineageTree
in the source_tree property, as discussed above.
Retrieve a Dataset from the index. If include_sources
is True then the full recursive source lineage information is packaged in the returned Dataset.
Current/Legacy behaviour: Source lineage information returned as nested Dataset objects in the sources
field of the root Dataset.
Proposed/v2 behaviour: Source lineage information returned as a LineageTree
object in the source_tree
field of the root (and only) Dataset.
Add new parameter include_deriveds=False
- if true, also return derived lineage information as a LineageTree
object in the derived_tree
field of the Dataset.
Current/Legacy behaviour:
# :param with_lineage:
# - ``True (default)`` attempt adding lineage datasets if missing
# - ``False`` record lineage relations, but do not attempt
# adding lineage datasets to the db
(where lineage data is assumed to be stored in sources
field of ds
.)
Proposed/v2 behaviour:
-
Lineage data is assumed to be stored in
source_tree
andderived_tree
fields ofds
. -
with_lineage
argument is ignored (and is dropped all together in v2). Always record lineage relations only (both sourcewards and derivedwards, as provided in the Dataset object).
Legacy API allows searching by metadata on source dataset.
Propose v2 drop support for source_fields
argument.
Currently: Return a list of datasets that are derived from the named dataset id.
Proposed:
- Do not implement old API in a driver that
supports_external_lineage
deprecation warning.
Add a new Index Resource lineage
. (i.e. dc.index.lineage
like existing dc.index.products
and dc.index.datasets
.
Methods:
- Add a new method
get_derived_tree(id, max_depth=0)
that returns a derived-directionLineageTree
with the named dataset ID at the root. Recurse tomax_depth
levels of recursion (0, the default means no maximum depth, keep building as deep as you have to.) - Add corresponding new
get_source_tree(id, max_depth=0)
method that returns a source-directionLineageTree
with the named dataset ID at the root. - New method
add(tree: LineageTree, replace=False)
- add lineage information only to the index. Merges with existing lineage, if any. Conflicts between existing lineage and the supplied tree trigger an error, unlessreplace
is True, in which case the conflicting information is removed and the supplied tree is treated as authoritative. - New method
remove_lineage(id, direction, max_depth=0)
Remove all source or derived lineage for a dataset id - to the specified recursion depth (0, the default means keep recursing all the way.
Current bulk read/write methods are flagged as being unstable (i.e. subject to further change).
I propose adding new bulk read/write methods for lineage data. These would operate with flat records (not full LineageTrees) and would be similar in format to the existing (unstable) bulk read/write methods. Details of these new bulk methods and stabilisation of the existing read/write and cloning API methods is a deferred to a future EP.
Some existing CLI commands/options will have to be updated to reflect the API changes above, and new CLI commands for handling lineage will need to be added. In particular, a CLI command to index lineage information ONLY (i.e. don't index the dataset, just extract and save the lineage info.) would be desirable.
The detailed specifications for these are deferred to a future EP.
Edit and add your comments here
Current EO3 metadata standard (i.e. what is recorded in the json stac or yaml dataset metadata document, as read at indexing time) supports:
- Source lineage only, only one level-deep.
- DOES support multiple source IDs for a single classifier, but this is currently overwritten and flattened by the ODC at index time.
Although the API enhancements in this EP can proceed with the current EO3 metadata format, this EP may be an appropriate place to consider adding (optional) extensions to the EO3 format to support:
- nested lineage; and
- derived as well as source lineage
If any extensions to the EO3 format are to be made, they should be made as part of a larger effort to draft a more formal definition the EO3 format.
- Paul Haesler (@SpacemanPaul)
Welcome to the Open Data Cube