Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lazy_load parameter to open() #573

Merged
merged 8 commits into from
Oct 24, 2018
Merged

Conversation

vmarkovtsev
Copy link
Contributor

This allows loading ASDF files, then closing the file
and still retaining full access to the tree objects.

Signed-off-by: Vadim Markovtsev vadim@sourced.tech

This allows loading ASDF files, then closing the file
and still retaining full access to the tree objects.

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
@stsci-bot
Copy link

stsci-bot bot commented Oct 24, 2018

Hi there @vmarkovtsev 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog. I help make sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

@coveralls
Copy link

coveralls commented Oct 24, 2018

Coverage Status

Coverage decreased (-0.03%) to 93.647% when pulling 62561d3 on vmarkovtsev:master into c2fbf33 on spacetelescope:master.

@drdavella
Copy link
Contributor

drdavella commented Oct 24, 2018

Hi @vmarkovtsev thanks for the PR!

We explicitly disallowed access to the tree for a closed ASDF file handle here: #407.

But it seems like this is intended to apply to the case where you have opened an ASDF file using another open file handle, is that correct? The ASDF file handle itself remains open, but you want to be able to close the original file handle.

This PR does seem to fill a gap: you want to retain memory mapped access to a file that existed outside the scope of the ASDF file handle. I assume that using the existing copy_arrays option would be too expensive.

It's probably really none of my business, but what's wrong with keeping the file handle open? My concern here is that ASDF may be enabling resource leakage since the underlying file is not actually entirely closed: the ASDF object now holds potentially many memory mapped handles to the original file. I assume garbage collection can handle this properly, but I just want to make sure we're not introducing a problem.

I'm just thinking out loud and trying to make sure I understand the intent of the PR and any side effects it might present.

@vmarkovtsev
Copy link
Contributor Author

So this PR allows to load all the tree directly into memory so that it no longer requires the file backup at all.

The reason why we need this is that our arrays are relatively small and always fit into RAM on one side, and it is hard for us to maintain an open file handle throughout the life of the upstream object on the other. We do leverage mmap() with huge arrays sometimes though.

The team continues to hit the same problem over and over again: we load the asdf tree, pass the arrays from the tree into faraway parts of the codebase, and then everything breaks at random spots because the file is already closed and it is impossible to _make_array(). Folks are used to the semantics of with file: pickle.load and with file: pickle.dump. Bookkeeping of the open files is hard and greatly complicates the things. Sometimes, the loading code cannot borrow the file descriptor since it comes from the outside.

@vmarkovtsev
Copy link
Contributor Author

We store machine learning models in ASDF. We love ASDF because it is way more efficient than pickles, way more flexible than hdf5 and is not nailed to Python in theory.

@drdavella
Copy link
Contributor

So this PR allows to load all the tree directly into memory so that it no longer requires the file backup at all.

Is there a reason the copy_arrays argument doesn't meet your needs, then?

We love ASDF because it is way more efficient than pickles, way more flexible than hdf5 and is not nailed to Python in theory.

Maybe we should add this to the user-testimonial section of our documentation 😉

@vmarkovtsev
Copy link
Contributor Author

I can write a long success story about ASDF fwiw 😄

The copy_arrays argument does a different thing: it enables or disables mmap()-ing of the arrays. So if copy_arrays is False, we mmap() and never load the whole array into the memory (cool feature btw). If it is True, we read arrays from disk into memory as soon as it is required - that is, lazily.

@perrygreenfield
Copy link
Contributor

@vmarkovtsev Please do if you have the time. We are in the process of trying to spread its adoption within astronomy and any support would be quite useful in that regard. Thanks very much for the words of encouragement.

@drdavella
Copy link
Contributor

That makes sense. In light of that, this seems like a reasonable change. Just let me review a bit more and then we should be able to move ahead with it.

@drdavella drdavella added this to the v2.2.0 milestone Oct 24, 2018
asdf/asdf.py Outdated
When `True` and the read file is seekable, underlying data arrays
will be materialized when they are used for the first time.
This implies that the file should stay open during the lifetime
of the tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick, but I find this a little confusing. Instead I might say the following:

"When True and the underlying file handle is seekable, data arrays will only be loaded lazily: i.e. when they are accessed for the first time. In this case the underlying file must stay open during the lifetime of the tree. Setting to False causes all data arrays to be loaded up front, which means that they can be accessed even after the underlying file is closed."

I'm happy to make this change myself, BTW, but just want to make sure it makes sense to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much better!

Copy link
Contributor

@drdavella drdavella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a changelog entry to the 2.2.0 section for this as well?

@vmarkovtsev
Copy link
Contributor Author

@drdavella All the review comments have been addressed, waiting for the CI to pass.

@vmarkovtsev
Copy link
Contributor Author

Done

asdf/asdf.py Outdated
@@ -90,7 +90,18 @@ def __init__(self, tree=None, uri=None, extensions=None, version=None,

copy_arrays : bool, optional
When `False`, when reading files, attempt to memmap underlying data
arrays when possible.
arrays when possible. Please note that the file must be opened
in "rw" mode; "r" will result a SIGSEGV.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but it seems like we should have better error handling for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I had to debug segmentation faults otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is really useful to know, but I think this comment opens a different can of worms. Not all use cases involve passing a file handle to asdf.open, so I think this comment needs to be a little more specific to clarify. Maybe we should leave it for another PR and put it elsewhere in the documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current behavior:

asdf.open("/path", mode="r")  # SIGSEGV
asdf.open("/path", mode="rw")  # OK
with open("/path", mode="r") as file:
    asdf.open(file)  # SIGSEGV
with open("/path", mode="rw") as file:
    asdf.open(file)  # OK

So the comment is always effective as long as the option itself is effective (that is, no streaming and non-seekable files).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am surprised that this has not been reported yet. In my case I always use compressed blocks so never had that SIGSEGV before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, I should have added that the "rw" requirement is only for files with uncompressed arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I'm confused by this and will need to investigate further. Your example above suggests that opening any file without explicitly given a mode of 'rw' leads to segfault, which doesn't make sense since this is most common use case, both in the test suite, and in our internal code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with uncompressed arrays, that's the most common use case since it's the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but can we leave this comment out of this PR? I'm tracking it in #574 but I don't want to hold this PR up while I figure out what the issue is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

CHANGES.rst Outdated
numpy arrays are materialized. Thus it becomes safe to close the file
and continue using ``AsdfFile.tree``. However, `copy_arrays` parameter
is still effective and the active memory maps may still require the file
to stay open in case `copy_arrays` is ``False``. [#573]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for another nit, but we should have double backticks around lazy_load and copy_arrays here so they render properly on github.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@drdavella drdavella merged commit efca103 into asdf-format:master Oct 24, 2018
@drdavella
Copy link
Contributor

Thanks @vmarkovtsev!

@vmarkovtsev vmarkovtsev mentioned this pull request Nov 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants