From 69e251348d1eeaa9115d5ff361ab87aa08e874c1 Mon Sep 17 00:00:00 2001 From: milancurcic Date: Fri, 7 Jul 2023 13:49:51 -0400 Subject: [PATCH 1/7] Begin reworking the Usage page in the docs --- docs/usage.rst | 125 ++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 114 insertions(+), 11 deletions(-) diff --git a/docs/usage.rst b/docs/usage.rst index 21c57855..f31871d0 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -3,7 +3,117 @@ Usage ===== -CloudDrift provides an easy way to convert Lagrangian datasets into +The CloudDrift library provides functions for: + +* Easy access to cloud-ready Lagrangian ragged-array datasets; +* Common Lagrangian analysis tasks on ragged arrays; +* Adapting custom Lagrangian datasets into ragged arrays. + +Let's start by importing the library and accessing a ready-to-use ragged-array +dataset. + +Accessing ragged-array Lagrangian datasets +------------------------------------------ + +We recommend to import the ``clouddrift`` using the ``cd`` shorthand, for convenience: + +>>> import clouddrift as cd + +CloudDrift provides a set of Lagrangian datasets that are ready to use. +They can be accessed via the ``datasets`` submodule. +In this example, we will load the NOAA's Global Drifter Program (GDP) hourly +dataset, which is hosted in a public AWS bucket as a cloud-optimized Zarr +dataset: + +>>> ds = cd.datasets.gdp1h() +>>> ds + +Dimensions: (traj: 17324, obs: 165754333) +Coordinates: + ids (obs) int64 ... + lat (obs) float32 ... + lon (obs) float32 ... + time (obs) datetime64[ns] ... +Dimensions without coordinates: traj, obs +Data variables: (12/55) + BuoyTypeManufacturer (traj) |S20 ... + BuoyTypeSensorArray (traj) |S20 ... + CurrentProgram (traj) float64 ... + DeployingCountry (traj) |S20 ... + DeployingShip (traj) |S20 ... + DeploymentComments (traj) |S20 ... + ... ... + sst1 (obs) float64 ... + sst2 (obs) float64 ... + typebuoy (traj) |S10 ... + typedeath (traj) int8 ... + ve (obs) float32 ... + vn (obs) float32 ... +Attributes: (12/16) + Conventions: CF-1.6 + acknowledgement: Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; Centurio... + contributor_name: NOAA Global Drifter Program + contributor_role: Data Acquisition Center + date_created: 2022-12-09T06:02:29.684949 + doi: 10.25921/x46c-3620 + ... ... + processing_level: Level 2 QC by GDP drifter DAC + publisher_email: aoml.dftr@noaa.gov + publisher_name: GDP Drifter DAC + publisher_url: https://www.aoml.noaa.gov/phod/gdp + summary: Global Drifter Program hourly data + title: Global Drifter Program hourly drifting buoy collection + +The ``gdp1h`` function returns an Xarray ``Dataset`` instance of the ragged-array dataset. +While the dataset is quite large, around a dozen GB, it is not downloaded to your +local machine. Instead, the dataset is accessed directly from the cloud, and only +the data that is needed for the analysis is downloaded. This is possible thanks to +the cloud-optimized Zarr format, which allows for efficient access to the data +stored in the cloud. + +Let's look at some variables in this dataset: + +>>> ds.lon + +[165754333 values with dtype=float32] +Coordinates: + ids (obs) int64 ... + lat (obs) float32 ... + lon (obs) float32 ... + time (obs) datetime64[ns] ... +Dimensions without coordinates: obs +Attributes: + long_name: Longitude + units: degrees_east + +You see that this array is very long--it has 165754333 elements. +This is because in a ragged array, many varying-length arrays are laid out as a +contiguous 1-dimensional array in memory. + +Let's look at the dataset dimensions: + +>>> ds.dims +Frozen({'traj': 17324, 'obs': 165754333}) + +The ``traj`` dimension has 17324 elements, which is the number of individual +trajectories in the dataset. +The sum of their lengths equals the length of the ``obs`` dimension. +Internally, these dimensions, their lengths, and the ``count`` (or ``rowsize``) +variables are used internally to make CloudDrift's analysis functions aware of +the bounds of each contiguous array within the ragged-array data structure. + +Doing common analysis tasks on ragged arrays +-------------------------------------------- + +Now that we have a ragged-array dataset loaded as an Xarray ``Dataset`` instance, +let's do some common analysis tasks on it. + +TODO + +Adapting custom Lagrangian datasets into ragged arrays +------------------------------------------------------ + +CloudDrift provides an easy way to convert custom Lagrangian datasets into `contiguous ragged arrays `_. .. code-block:: python @@ -26,14 +136,7 @@ CloudDrift provides an easy way to convert Lagrangian datasets into This snippet is specific to the hourly GDP dataset, however, you can use the ``RaggedArray`` class directly to convert other custom datasets into a ragged -array structure that is analysis ready via Xarray or Awkward Array packages. -We provide step-by-step guides to convert the individual trajectories from the -Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the -`CARTHE `_ experiment, and a typical output from a numerical -Lagrangian experiment in our -`repository of example Jupyter Notebooks `_. +array structure that is analysis ready via Xarray or Awkward Array packages. +The functions to do that are defined in the ``clouddrift.adapters`` submodule. You can use these examples as a reference to ingest your own or other custom -Lagrangian datasets into ``RaggedArray``. - -In the future, ``clouddrift`` will be including functions to perform typical -oceanographic Lagrangian analyses. +Lagrangian datasets into ``RaggedArray``. \ No newline at end of file From 40ef6a6cbd2372354e173b37a73f94e37dc2a7a9 Mon Sep 17 00:00:00 2001 From: milancurcic Date: Thu, 13 Jul 2023 10:56:39 -0400 Subject: [PATCH 2/7] Check for count or rowsize in subset --- clouddrift/analysis.py | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/clouddrift/analysis.py b/clouddrift/analysis.py index 1f6362f8..b31c7a17 100644 --- a/clouddrift/analysis.py +++ b/clouddrift/analysis.py @@ -977,6 +977,18 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset: ValueError If one of the variable in a criterion is not found in the Dataset """ + # Normally we expect the ragged-array dataset to have a "count" variable. + # However, some datasets may have a "rowsize" variable instead, e.g. if they + # have not gotten up to speed with our new convention. We check for both. + if "count" in ds.variables: + count_var = "count" + elif "rowsize" in ds.variables: + count_var = "rowsize" + else: + raise ValueError( + "Ragged-array Dataset ds must have a 'count' or 'rowsize' variable." + ) + mask_traj = xr.DataArray(data=np.ones(ds.dims["traj"], dtype="bool"), dims=["traj"]) mask_obs = xr.DataArray(data=np.ones(ds.dims["obs"], dtype="bool"), dims=["obs"]) @@ -990,7 +1002,7 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset: raise ValueError(f"Unknown variable '{key}'.") # remove data when trajectories are filtered - traj_idx = np.insert(np.cumsum(ds["count"].values), 0, 0) + traj_idx = np.insert(np.cumsum(ds[count_var].values), 0, 0) for i in np.where(~mask_traj)[0]: mask_obs[slice(traj_idx[i], traj_idx[i + 1])] = False @@ -1006,7 +1018,7 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset: # apply the filtering for both dimensions ds_sub = ds.isel({"traj": mask_traj, "obs": mask_obs}) # update the count - ds_sub["count"].values = segment( + ds_sub[count_var].values = segment( ds_sub.ids, 0.5, count=segment(ds_sub.ids, -0.5) ) return ds_sub From 9605e255292aa1b85cb9bd1cbfa95638de5b84d7 Mon Sep 17 00:00:00 2001 From: milancurcic Date: Thu, 13 Jul 2023 10:56:53 -0400 Subject: [PATCH 3/7] Subset to a smaller dataset in Usage --- docs/usage.rst | 55 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/docs/usage.rst b/docs/usage.rst index f31871d0..9dee169b 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -107,8 +107,61 @@ Doing common analysis tasks on ragged arrays Now that we have a ragged-array dataset loaded as an Xarray ``Dataset`` instance, let's do some common analysis tasks on it. +Our dataset is on a remote server and fairly large (a dozen GB or so), so let's +first subset it to several trajectories so that we can more easily work with it. +The variable ``ID`` is the unique identifier for each trajectory: -TODO +>>> ds.ID[:10].values +array([2578, 2582, 2583, 2592, 2612, 2613, 2622, 2623, 2931, 2932]) + +>>> from clouddrift.analysis import subset + +``subset`` allows you to subset a ragged array by some criterion. +In this case, we will subset it by the ``ID`` variable: + +>>> ds_sub = subset(ds, {"ID": list(ds.ID[:5])}) +>>> ds_sub + +Dimensions: (traj: 5, obs: 13612) +Coordinates: + ids (obs) int64 2578 2578 2578 2578 ... 2612 2612 2612 + lat (obs) float32 ... + lon (obs) float32 ... + time (obs) datetime64[ns] ... +Dimensions without coordinates: traj, obs +Data variables: (12/55) + BuoyTypeManufacturer (traj) |S20 ... + BuoyTypeSensorArray (traj) |S20 ... + CurrentProgram (traj) float64 ... + DeployingCountry (traj) |S20 ... + DeployingShip (traj) |S20 ... + DeploymentComments (traj) |S20 ... + ... ... + sst1 (obs) float64 ... + sst2 (obs) float64 ... + typebuoy (traj) |S10 ... + typedeath (traj) int8 ... + ve (obs) float32 ... + vn (obs) float32 ... +Attributes: (12/16) + Conventions: CF-1.6 + acknowledgement: Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; Centurio... + contributor_name: NOAA Global Drifter Program + contributor_role: Data Acquisition Center + date_created: 2022-12-09T06:02:29.684949 + doi: 10.25921/x46c-3620 + ... ... + processing_level: Level 2 QC by GDP drifter DAC + publisher_email: aoml.dftr@noaa.gov + publisher_name: GDP Drifter DAC + publisher_url: https://www.aoml.noaa.gov/phod/gdp + summary: Global Drifter Program hourly data + title: Global Drifter Program hourly drifting buoy collection + +You see that we now have a subset of the original dataset, with 5 trajectories +and a total of 13612 observations. +This subset is small enough to work with easily and quickly for demonstration +purposes. Adapting custom Lagrangian datasets into ragged arrays ------------------------------------------------------ From 8a03feae6fa943366d88d38de26ef851a6c6f50f Mon Sep 17 00:00:00 2001 From: milancurcic Date: Thu, 13 Jul 2023 15:43:29 -0400 Subject: [PATCH 4/7] Fix ordered list in docstring --- clouddrift/analysis.py | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/clouddrift/analysis.py b/clouddrift/analysis.py index b31c7a17..438b31a5 100644 --- a/clouddrift/analysis.py +++ b/clouddrift/analysis.py @@ -696,12 +696,9 @@ def velocity_from_position( Difference scheme can take one of three values: - 1. "forward" (default): finite difference is evaluated as - dx[i] = dx[i+1] - dx[i]; - 2. "backward": finite difference is evaluated as - dx[i] = dx[i] - dx[i-1]; - 3. "centered": finite difference is evaluated as - dx[i] = (dx[i+1] - dx[i-1]) / 2. + #. "forward" (default): finite difference is evaluated as ``dx[i] = dx[i+1] - dx[i]``; + #. "backward": finite difference is evaluated as ``dx[i] = dx[i] - dx[i-1]``; + #. "centered": finite difference is evaluated as ``dx[i] = (dx[i+1] - dx[i-1]) / 2``. Forward and backward schemes are effectively the same except that the position at which the velocity is evaluated is shifted one element down in From f77e580d7e5d01a71bdca9bd766380c5c1920800 Mon Sep 17 00:00:00 2001 From: milancurcic Date: Thu, 13 Jul 2023 15:43:54 -0400 Subject: [PATCH 5/7] Switch to sphinx-book-theme --- docs/conf.py | 4 +--- docs/requirements.txt | 2 +- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index afccaf01..f83af931 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -49,9 +49,7 @@ # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. -# -html_theme = "pydata_sphinx_theme" # alabaster, sphinx_rtd_theme -# html_theme = "sphinx_rtd_theme" +html_theme = "sphinx_book_theme" # alabaster, sphinx_rtd_theme # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, diff --git a/docs/requirements.txt b/docs/requirements.txt index 3f732c0d..1274f913 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,2 +1,2 @@ sphinx -pydata_sphinx_theme +sphinx-book-theme From 1d074367db81c394f65b6f5dd7fb06698ca0234b Mon Sep 17 00:00:00 2001 From: milancurcic Date: Thu, 13 Jul 2023 15:44:55 -0400 Subject: [PATCH 6/7] Update copyright year --- docs/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/conf.py b/docs/conf.py index f83af931..be81ffe5 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -19,7 +19,7 @@ # -- Project information ----------------------------------------------------- project = "CloudDrift" -copyright = "2022, CloudDrift" +copyright = "2022-2023, CloudDrift" author = "Philippe Miron" # -- General configuration --------------------------------------------------- From 3ef8a00c37e1d6317048a385fa9404175fdf0171 Mon Sep 17 00:00:00 2001 From: milancurcic Date: Thu, 13 Jul 2023 16:42:28 -0400 Subject: [PATCH 7/7] Complete the mini-tutorial --- docs/usage.rst | 72 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 71 insertions(+), 1 deletion(-) diff --git a/docs/usage.rst b/docs/usage.rst index 9dee169b..c5bb07bf 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -160,8 +160,78 @@ Attributes: (12/16) You see that we now have a subset of the original dataset, with 5 trajectories and a total of 13612 observations. -This subset is small enough to work with easily and quickly for demonstration +This subset is small enough to quickly and easily work with for demonstration purposes. +Let's see how we can compute the mean and maximum velocities of each trajectory. +To start, we'll need to obtain the velocities over all trajectory times. +Although the GDP dataset already comes with velocity variables, we won't use +them here so that we can learn how to compute them ourselves from positions. +``clouddrift`` provides the ``velocity_from_position`` function that allows you +to do just that. + +>>> from clouddrift.analysis import velocity_from_position + +At a minimum ``velocity_from_position`` requires three input parameters: +consecutive x- and y-coordinates and time, so we could do: + +>>> u, v = velocity_from_position(ds_sub.lon, ds_sub.lat, ds_sub.time) + +``velocity_from_position`` returns two arrays, ``u`` and ``v``, which are the +zonal and meridional velocities, respectively. +By default, it assumes that the coordinates are in degrees, and it handles the +great circle path calculation and longitude wraparound under the hood. +However, recall that ``ds_sub.lon``, ``ds_sub.lat``, and ``ds_sub.time`` are +ragged arrays, so we need a different approach to calculate velocities while +respecting the trajectory boundaries. +For this, we can use the ``ragged_apply`` function, which applies a function +to each trajectory in a ragged array, and returns the concatenated result. + +>>> from clouddrift.analysis import apply_ragged +>>> u, v = apply_ragged(velocity_from_position, [ds_sub.lon, ds_sub.lat, ds_sub.time], ds_sub.rowsize) + +``u`` and ``v`` here are still ragged arrays, which means that the five +contiguous trajectories are concatenated into 1-dimensional arrays. + +Now, let's compute the velocity magnitude in meters per second. +The time in this dataset is loaded in nanoseconds by default: + +>>> ds_sub.time.values +array(['2005-04-15T20:00:00.000000000', '2005-04-15T21:00:00.000000000', + '2005-04-15T22:00:00.000000000', ..., + '2005-10-02T03:00:00.000000000', '2005-10-02T04:00:00.000000000', + '2005-10-02T05:00:00.000000000'], dtype='datetime64[ns]') + +So, to obtain the velocity magnitude in meters per second, we'll need to +multiply our velocities by ``1e9``. + +>>> velocity_magnitude = np.sqrt(u**2 + v**2) * 1e9 +>>> velocity_magnitude +array([0.28053388, 0.6164632 , 0.89032112, ..., 0.2790803 , 0.20095603, + 0.20095603]) + +>>> velocity_magnitude.mean(), velocity_magnitude.max() +(0.22115242718877506, 1.6958275672626286) + +However, these aren't the results we are looking for! Recall that we have the +velocity magnitude of five different trajectories concatenated into one array. +This means that we need to use ``apply_ragged`` again to compute the mean and +maximum values: + +>>> apply_ragged(np.mean, [velocity_magnitude], ds_sub.rowsize) +array([0.32865148, 0.17752435, 0.1220523 , 0.13281067, 0.14041268]) +>>> apply_ragged(np.max, [velocity_magnitude], ds_sub.rowsize) +array([1.69582757, 1.36804354, 0.97343434, 0.60353528, 1.05044213]) + +And there you go! We used ``clouddrift`` to: + +#. Load a real-world Lagrangian dataset from the cloud; +#. Subset the dataset by trajectory IDs; +#. Compute the velocity vectors and their magnitudes for each trajectory; +#. Compute the mean and maximum velocity magnitudes for each trajectory. + +``clouddrift`` offers many more functions for common Lagrangian analysis tasks. +Please explore the `API `_ +to learn about other functions and how to use them. Adapting custom Lagrangian datasets into ragged arrays ------------------------------------------------------