Improves MinariDataset load speed to speed up sampling, fixes total_steps bug and adds test coverage #129

balisujohn · 2023-08-01T00:19:23Z

Description

Previously, we would always iterate over all episodes to calculate total_steps when loading a MinariDataset. Now we will just load this from the MinariStorage when the MinariDataset is created, assuming no filter is specified. Empirically, this speeds up our sampling a lot because we no longer have to iterate over every episode regardless of how many we are sampling.

Since the patch required changing how total_steps is loaded, I also extended test coverage for total_steps and found some bugs in how it's calculate in some settings, which this PR also patches.

Performance of Minari v.s. Pickle for sampling 10 episodes across different dataset sizes.

Performance prior to patch:

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Checklist:

I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
I have run pytest -v and no errors are present.
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…_steps tests coverage and bugfix

younik

Not change requests, just comments

younik · 2023-08-02T10:23:12Z

minari/dataset/minari_storage.py

-            file.attrs.modify("total_episodes", last_episode_id + len(buffer))
-            file.attrs.modify(
-                "total_steps", file.attrs["total_steps"] + additional_steps
-            )
+            self._total_steps = file.attrs["total_steps"] + additional_steps
+            self._total_episodes = last_episode_id + len(buffer)
+
+            file.attrs.modify("total_episodes", self._total_episodes)
+            file.attrs.modify("total_steps", self._total_steps)


Thanks for these fixes also. I am wondering if we need self._total_steps or if we can just make the @property total_steps reading from the file (and same for total_episodes)

I think it might marginally save on file reads if we keep it the way it is for some use-cases, but it doesn't seem like it would make a super big difference either way.

younik · 2023-08-02T10:29:06Z

minari/dataset/minari_dataset.py

+            total_steps = sum(
+                self._data.apply(
+                    lambda episode: episode["total_timesteps"],
+                    episode_indices=episode_indices,
+                )
+            )


This also means that creating subsets of the dataset can be very slow
Previously, I did a lazy initialization for total_steps attribute; we removed it to include it in the MinariDatasetSpec

We can maybe consider having metadata you can load without having to load the corresponding episode in a future PR, though I'm not super opinionated either way at this point.

balisujohn added 7 commits July 22, 2023 16:55

attempted to fix list_remote_datasets slowdown

15336ff

tester.py draft

0d07d90

Merge branch 'main' into test-profiling

a7d3f5f

patch to speed up sampling from a minari dataset, MinariStorage total…

4c0f466

…_steps tests coverage and bugfix

some polish, removed tester file

2c9c0ab

removed print statements

72d379d

reverted hosting.py, change pending in seperate PR

a5d9e7b

balisujohn requested review from younik and rodrigodelazcano August 1, 2023 00:46

younik reviewed Aug 2, 2023

View reviewed changes

younik approved these changes Aug 3, 2023

View reviewed changes

balisujohn merged commit c0669fc into Farama-Foundation:main Aug 10, 2023
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves MinariDataset load speed to speed up sampling, fixes total_steps bug and adds test coverage #129

Improves MinariDataset load speed to speed up sampling, fixes total_steps bug and adds test coverage #129

balisujohn commented Aug 1, 2023 •

edited

Loading

younik left a comment •

edited

Loading

younik Aug 2, 2023

balisujohn Aug 3, 2023

younik Aug 2, 2023

balisujohn Aug 3, 2023

Improves MinariDataset load speed to speed up sampling, fixes total_steps bug and adds test coverage #129

Improves MinariDataset load speed to speed up sampling, fixes total_steps bug and adds test coverage #129

Conversation

balisujohn commented Aug 1, 2023 • edited Loading

Description

Type of change

Checklist:

younik left a comment • edited Loading

Choose a reason for hiding this comment

younik Aug 2, 2023

Choose a reason for hiding this comment

balisujohn Aug 3, 2023

Choose a reason for hiding this comment

younik Aug 2, 2023

Choose a reason for hiding this comment

balisujohn Aug 3, 2023

Choose a reason for hiding this comment

balisujohn commented Aug 1, 2023 •

edited

Loading

younik left a comment •

edited

Loading