Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add demo for dataset generation. #7

Closed

Conversation

ethanluoyc
Copy link

This PR includes a demonstration for logging environment interactions and generating datasets in HDF5 format.

Please refer to README.md for an explanation of the dataset specification.

@WillDudley
Copy link
Contributor

As I'm unfamiliar with the dm ecosystem, I find this impressive yet a tad difficult to understand. It'll be very useful in the near future; for now I'd vote for focussing on a simple mvp inspired by MDPDatasets

@ethanluoyc
Copy link
Author

The PR consists of two parts,

  1. A spec of how RL episodic data is serialized and stored in HDF5 datasets.
  2. A proof of concept experiment showing how to use an RL library (Acme in this case) together with a logger to generate the data and convert that into the spec-forming HDF5 datasets.

For 1, I think it's important to describe formally what the spec looks like. My proposal is to leverage the already defined spec from RLDS and store a flattened nested group of datasets in HDF5. Ideally, all datasets that will be released by Kabuki should use the same format. I described a few things in D4RL I wish that we can avoid in future datasets. e.g., the omission of terminal observations in D4RL which can cause problems for some lines of offline RL research. If anything, I think for future datasets we should aim for capturing lossless information whenever possible so that these datasets can be utilized by a wider community. This also means that we should support storing episodes that

  1. Has additional metadata.
  2. Uses nested observations (dicts, tuple etc)
    I worry less about 2 at the moment because I feel it's too soon to say we have a good logger implementation and it's often easier to migrate to a different logger instead of migrating the datasets. Also, there is a very likely chance that people will generate the dataset without using a logger (e.g., they may choose to curate a subset of experience to release them). Having said this, I think the EnvLogger library I was using is pretty good despite lacking native support for gym. I would expect that if we were to put a logger implementation in Kabuki we should support the same functionality.

I feel we could open a discussion about whether that suffices a broad range of use cases and where they are scenarios where that won't suffice. Ideally, this would be a format that all future RL datasets in Kabuki should use so it's also important to consider the implications of

  • Scale and Performance: if the format allows us to easily scale to large datasets. The datasets in D4RL are at most a few hundred megabytes. In that regime, storing the entire thing in a single HDF5 suffices. During training, I can even load the entire dataset in memory, put it on the GPU, and train my offline RL agent and not have to worry about any IO inefficiency or memory issues. However, I feel moving forward we should discuss what was used for D4RL would also suffice for Kabuki if the ultimate goal is to allow sharing of a common set of infrastructures for RL datasets which can vary in size and modalities. Of course, we can always think about this later when we beyond the small scale.
  • Simplicity: The format should be easy to process. Users of the datasets should easily be able to extract episodes of experience from the dataset. Providing a dataset loading where only transitions of the form (s, a, r, s') will not suffice for some offline methods such as Decision Transformers. Ideally, Kabuki provides the functionality to do so out of the box so that people don't have to bother about reinventing the wheel.
  • Use widely adopted formats: We should use a format that is recognized widely in the community. HDF5 is a very good candidate, but frankly, I never find it user-friendly for storing time series (episodic data). The Envlogger library I used as an example here supports file formats such as Riegelli, which is not very widely used. some of the datasets in RLDS use tfrecords together with tf.data which depends on TensorFlow.

I agree that we should maybe focus on an MVP for now, but here are just some of my thoughts and I hope they would be useful in one way or another. It will also be very good if potential dataset contributors have specific questions and would like to clarify.

Regarding the MDPDatasets. I believe @WillDudley is referring to #6. I briefly looked through the PR and here are some of my thoughts.

  1. Is there a particular reason that we need to have a dataset definition in Cython?
  2. The definitions of episode and transition seem a bit simplistic. I know it's WIP so no worries if you are working on that. We should consider supporting nested observations, observations that are not floating arrays, etc. I saw you can save some metadata with a JSON file, but we should also support use cases where we allow saving per-step/per-episode metadata. (e.g., saving physics state of MuJoCo simulation, saving information about goals, saving log_probs of policy)

@WillDudley WillDudley mentioned this pull request Oct 29, 2022
@WillDudley
Copy link
Contributor

Thanks for the detail! I'll comment on my PR first:

My PR was and is intended to be a very rough prototype for a MVP. D3RL's MDPDataset was chosen as I've had some experience of playing with MDPDatasets before - meaning that it was relatively easy for me to copy/paste over to Kabuki.

The fact that d3rl uses cython wasn't a factor in choosing it, so I don't have much to comment regarding that.

My PR focuses more on the hosting side of things, namely

  • the process of downloading datasets
  • the process of uploading datasets
  • adhering to dataset naming conventions

@WillDudley
Copy link
Contributor

The PR consists of two parts,

  1. A spec of how RL episodic data is serialized and stored in HDF5 datasets.
  2. A proof of concept experiment showing how to use an RL library (Acme in this case) together with a logger to generate the data and convert that into the spec-forming HDF5 datasets.

For 1, I think it's important to describe formally what the spec looks like. My proposal is to leverage the already defined spec from RLDS and store a flattened nested group of datasets in HDF5. Ideally, all datasets that will be released by Kabuki should use the same format. I described a few things in D4RL I wish that we can avoid in future datasets. e.g., the omission of terminal observations in D4RL which can cause problems for some lines of offline RL research. If anything, I think for future datasets we should aim for capturing lossless information whenever possible so that these datasets can be utilized by a wider community. This also means that we should support storing episodes that

  1. Has additional metadata.
  2. Uses nested observations (dicts, tuple etc)
    I worry less about 2 at the moment because I feel it's too soon to say we have a good logger implementation and it's often easier to migrate to a different logger instead of migrating the datasets. Also, there is a very likely chance that people will generate the dataset without using a logger (e.g., they may choose to curate a subset of experience to release them). Having said this, I think the EnvLogger library I was using is pretty good despite lacking native support for gym. I would expect that if we were to put a logger implementation in Kabuki we should support the same functionality.

I fully agree that the more information recorded the better. I've also been thinking that users should have the flexibility to use their own logger if they wish. Making it easy for a user to convert their dataset/buffer into our format is important.

I feel we could open a discussion about whether that suffices a broad range of use cases and where they are scenarios where that won't suffice. Ideally, this would be a format that all future RL datasets in Kabuki should use so it's also important to consider the implications of

  • Scale and Performance: if the format allows us to easily scale to large datasets. The datasets in D4RL are at most a few hundred megabytes. In that regime, storing the entire thing in a single HDF5 suffices. During training, I can even load the entire dataset in memory, put it on the GPU, and train my offline RL agent and not have to worry about any IO inefficiency or memory issues. However, I feel moving forward we should discuss what was used for D4RL would also suffice for Kabuki if the ultimate goal is to allow sharing of a common set of infrastructures for RL datasets which can vary in size and modalities. Of course, we can always think about this later when we beyond the small scale.

For sure. Some datasets will likely be fairly large, though I can't say to what extent. The question is are there any considerations we need to take into account now to reduce complications regarding scaling in the future?

  • Simplicity: The format should be easy to process. Users of the datasets should easily be able to extract episodes of experience from the dataset. Providing a dataset loading where only transitions of the form (s, a, r, s') will not suffice for some offline methods such as Decision Transformers. Ideally, Kabuki provides the functionality to do so out of the box so that people don't have to bother about reinventing the wheel.

Indeed. Users may possibly get overwhelmed if there are too many columns in the datasets, but datasets should provide maximal opportunity for various models. Certain niche columns may need to be separated somehow if they end up being fairly data intensive.

  • Use widely adopted formats: We should use a format that is recognized widely in the community. HDF5 is a very good candidate, but frankly, I never find it user-friendly for storing time series (episodic data). The Envlogger library I used as an example here supports file formats such as Riegelli, which is not very widely used. some of the datasets in RLDS use tfrecords together with tf.data which depends on TensorFlow.

I'm unfamiliar with best practices for storing time series data, I just used hdf5 for my PR due to familiarity. The user wouldn't really interact with the file format so my vote is whatever's fast and reliable.

I agree that we should maybe focus on an MVP for now, but here are just some of my thoughts and I hope they would be useful in one way or another. It will also be v

Oh, this is very useful! Please continue on this! I'm away tomorrow but back Monday evening :)

@WillDudley
Copy link
Contributor

Another thing to consider is the security of the file format, but that should be fine as long as it's not pickle

@WillDudley WillDudley marked this pull request as draft November 2, 2022 01:28
@WillDudley
Copy link
Contributor

Hey it would help if you could elaborate on what additional data fields d3rl could do with, as you say "Providing a dataset loading where only transitions of the form (s, a, r, s') will not suffice for some offline methods such as Decision Transformers" without elaborating.

The first and last states can be inferred from their position. In any means, I'm adding term and trunc to the dataset to distinguish between the two.

The only info rlds provides that d3rl doesn't is the discount factor, but that isn't really used anyway.

@WillDudley WillDudley closed this Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants