Skip to content

Commit

Permalink
Add developer notes on questions RE laziness of execution
Browse files Browse the repository at this point in the history
  • Loading branch information
sadielbartholomew committed Nov 9, 2021
1 parent 8a4005b commit 7a01cbe
Showing 1 changed file with 76 additions and 0 deletions.
76 changes: 76 additions & 0 deletions cf/data/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,79 @@ The mask hardness is most easily reset with the

`cf.Data.__setitem__` and `cf.Data.where` are examples of methods that
need to reset the mask in this manner.


Laziness
--------

To *be* lazy, or *not to be* lazy (in `cf.Data` itself)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Central to Dask is lazy execution i.e. delayed computation:
Dask operations essentially construct a graph of calculations
or transformations (etc.) that are ready to run later,
and only get evaluated together when requested with
a `<dask object>.compute` call.

We want to utilise this laziness because it is central to the
efficiency from using Dask, but to what extent to do we want
to incorporate laziness into `cf.Data`? Namely, for
an arbitary `cf.Data` method previously returning some result
(say, a Boolean or an array), which of these should we return:

1. The **pre-computed result**, i.e. the outcome from running
`compute` on the result graph constructed in the method
(e.g. the same Boolean or an array, etc., as before); or
2. The **uncomputed result**, i.e. a Dask object which only
evaluates to the result in (1) when either the user or
the code under-the-hood, later, runs a `compute`?

Arguments for choice (1) [advantages to (1) and disadvantages to (2)]:

* The simpler choice:
* means output is the same as before so documentation is easier and
less change relative to previous versions;
* logging and error handling can remain simple and as-is, whereas
choice (2) would mean we don't know whether a given log or error
message, dependent on the outcome, is applicable, so we can't
call it immediately (perhaps at all?). We might have to defer to
standard Dask messages, which would reduce specificity and clarity.
* Testing will be simpler, as with (2) we would have to add `compute`
calls in at appropriate points before running test assertions, etc.
* Inspection methods can return as they do now, whereas with choice (2)
we would have to work out what to show when certain aspects aren't
yet computed.

Arguments for choice (2):

* The technically more complicated but more efficient choice, overall:
* This choice is more efficient when we build up chains of operations,
because it avoids intermediate computation meaning parallelisation can
be optimised more comprehensively by Dask.

There may be at least one other option:

3. Could we, temporariliy or permanently, make use of a common keyword
argument such as `precompute` on methods so users and under-the-hood in
the code we can dictate whether or not to return the pre-computed or
uncomputed result? That would give extra flexibility, but mean more
boilerplate code (which can be consolidated somewhat, but at best
will require some extra lines per method).

If so, what would the best default be, `True` or `False`?

I think we need to ensure we are consistent in our approach, so choose either
(1), (2) or (3) (or another alternative), rather than a mixture, as that
will be a maintenance nightmare!


Logging and error handling
^^^^^^^^^^^^^^^^^^^^^^^^^^

When Dask operations are uncomputed, we don't know whether certain logging
and error messages are applicable or not.

Can we raise these in a delayed way, when we don't want to compute
early, in the case we are in the middle of under-the-hood operations and
also perhaps if we choose case (2) from the above points on extent of
laziness? How can it be done?

1 comment on commit 7a01cbe

@davidhassell
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sadie - thanks for writing this up. I think the .array attribute could be thought of as cf-python's version of .compute(). For methods that return new cf.Data objects (such as transpose) I think they should be lazy.Other methods, perhaps not. __repr__ and equals come to mind.

Please sign in to comment.