Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converter block storage #1508

Merged
merged 34 commits into from
May 2, 2023
Merged

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Mar 28, 2023

This PR adds block storage support to Converters.

Relevant docs are here: https://asdf--1508.org.readthedocs.build/en/1508/asdf/extending/converters.html#block-storage

The key contribution of this PR is allowing Converters to read/write blocks during from_yaml_tree/to_yaml_tree. This is accomplished by (privately) providing the BlockManger to the SerializationContext (that is passed as an argument to Converter methods) and introducing a few new public methods:

  • Converter.reserve_blocks
  • SerializationContext.load_block EDIT: removed in 92a786a
  • SerializationContext.get_block_data_callback EDIT: added in 92a786a
  • SerializationContext.assign_block_key
  • SerializationContext.find_block_index
  • asdf.util.BlockKey EDIT: added in a7193a1

The docs update describes the public API and several tests were added to cover these new methods and features (this is also mostly compatible with the asdf_zarr chunking prototype with one exception, the storage settings, described below).

There was one major change required to get this to work. With this PR, when AsdfFile.write_to is called, a new AsdfFile object is created to be used during write. This allows write to modify blocks (in this case overwriting data callbacks used to defer reading data until a block is ready to be written) without effecting blocks created during initial reads of a file. This change required updates to many tests which expected the blocks to change after write_to.

A few minor changes include:

  • Fixes Calling update after write_to fails #1505 (and adds regression test test_write_to_before_update)
  • Don't limit IntegerType _value_cache per ctx/AsdfFile
  • Ignore ERA001 rule that is supposed to remove commented code, this failed several times removing some but not all of the commented code so I just disabled it

There are a few things missing from or related to this PR that should be discussed and likely included as follow-up PRS.

  • storage settings. I initially had these included and modeled after the existing array settings (set_block_storage == set_array_storage, etc). I removed them because I think a more flexible and simpler interface will be preferred. The asdf_zarr prototype used/uses these settings to determine if a zarray should be included as internal blocks or left as an external zarray. However this required adding additional settings to the block converter, functions to the AsdfFile and to the SerializationContext to cover all places where these settings are used. This did not seem ideal and did not allow for settings that didn't match those used for arrays (setting storage to 'internal', 'external', 'inline', etc). I have a vague idea that this might be a useful way to provide additional options to configure the converter and discussing and deciding on an API seems worthwhile (even if it ends up being a copy of what's used for arrays).
  • untangle uses of converter methods and reuse of blocks. Currently (and with this PR) there are a lot of multi-use objects within the ASDF internal. These made the updates in this PR difficult and often brittle as a small change to Block (for example) had the possibility of changing reading, writing, compression etc. I spent some time trying to break apart some of the main offenders (mainly the block management) but none of that resulted in a simpler PR. Here are a few other examples:
    • from_yaml_tree is used for reading, fil_defaults, remove_defaults and is unaware of the context making it difficult to (for example) know if blocks should be created
    • to_yaml_tree is used for writing, validate, fill_default, remove_defaults, twice during update (once to judge tree size)
    • blocks are used for storing settings (compression storage), sometimes data (sometimes a file descriptor), file state (offset, allocated size), and for both reading and writitng
      I will open issues for the above if reviews do not lead to related changes or fixes.

@github-actions github-actions bot added this to the 3.0.0 milestone Mar 28, 2023
@braingram braingram marked this pull request as ready for review March 31, 2023 19:41
@braingram braingram requested a review from a team as a code owner March 31, 2023 19:41
@braingram
Copy link
Contributor Author

braingram commented Mar 31, 2023

Based off the discussion during the ASDF tag-up I replaced load_block which required holding onto the SerializationContext with generate_block_data_callback to allow lazy loading without holding onto a reference to the context and removed the use of id(obj) in the examples and tests to avoid issues with memory reuse.

Thanks @eslavich and @perrygreenfield

@braingram braingram force-pushed the converter_block_storage branch 2 times, most recently from 08d4e8c to 0729620 Compare April 5, 2023 14:26
@braingram
Copy link
Contributor Author

Apologies in advance for what will probably be a novel.

TLDR; the asdf_zarr prototype is now up to date with this PR and is ready to be moved to asdf-format. Also, a follow up PR might allow us to remove the need for reserved blocks.

I've spent a few days trying to sort out a reasonable set of functions to expose the storage options to converters. This would include changes to AsdfFile (where settings are often set by the user), SerializationContext (which is available to the converter) and probably BlockManager as this serves storage of settings (at the moment). The settings are complicated by the key/id issue mentioned above. There are a few points I'd like to bring up after looking into this.

First, the way I was using the settings previously in the asdf_zarr prototype is likely not something we should aim to support. I was looking at the array storage to determine if the zarr chunk store should be included as internal asdf blocks (if the array storage was 'internal') or left as an external store (like a DirectoryStore, if the array storage was 'external'). This conflates options for the Converter and options for the blocks. I rewrote the asdf_zarr to instead require a conversion of zarr-arrays with a to_internal function (without copying data until the asdf file is written). There might be a wider discussion of if we want to provide a means of providing options to converters but I think that this should not be required for chunked array support.

Second, the block management is complicated (as mentioned above) and the BlockKey introduced in this PR seems workable for our current needs. However, I think the following alternative strategy might be better but would require substantial rework of the block management. Currently the block keys are required because memory addresses might be reused such as with the following:

af = asdf.open(fn, mode="rw")  # open a file with a custom object 'a' at id(obj) == 42
del af['a']  # this does not delete the blocks that would be associated with 'a'
b = Foo()  # make a new object, let's assume python puts it at id(b) == 42
af['b'] = b  # now the blocks for 'a' will be incorrectly assigned to 'b'

The alternative strategy would be to have asdf internally manage keys for each block that hold weakrefs to objects (here is a partial example from some of my exploration). This would allow objects to be garbage collected when they are removed from the tree and be resilient to memory address reuse (as 'matching' the key would involve resolving the weakref which would return None for the missing object). This has the benefit of not requiring that developers of converters generate their own BlockKey instances and since Converter.from_yaml_tree returns the object that I think we would want to attach to the blocks we might be able to have ASDF check which blocks were accessed during from_yaml_tree, take the returned object and create and assign block keys outside the converter. This should mean that reserve_blocks is no longer needed and possibly assign_block_key as well (we might still want some way of assigning keys within the converter because, as with the asdf_zarr prototype, a single object might use multiple blocks which may need to be retrieved during to_yaml_tree for an update). I've made a few attempts at getting this to work without rewriting the BlockManager with no success.

@braingram braingram force-pushed the converter_block_storage branch 2 times, most recently from e36a4d0 to 906072f Compare April 11, 2023 16:01
asdf/asdf.py Show resolved Hide resolved
asdf/asdf.py Show resolved Hide resolved
asdf/asdf.py Outdated Show resolved Hide resolved
asdf/asdf.py Show resolved Hide resolved
asdf/asdf.py Outdated Show resolved Hide resolved
asdf/block.py Show resolved Hide resolved
asdf/block.py Outdated Show resolved Hide resolved
asdf/_tests/test_block_converter.py Show resolved Hide resolved
asdf/block.py Outdated
Comment on lines 564 to 566
sctx = ctx._create_serialization_context()
tag = converter.select_tag(node, sctx)
for key in converter.reserve_blocks(node, tag, sctx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither the unit test converters nor the asdf_zarr converter use the serialization context or tag, maybe we ought to skip those? I'm not sure it's safe to hand over the serialization context anyway, since _find_used_blocks is called on the original AsdfFile's block manager, and we don't want to modify that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original thought was that different tag versions might require different block usage. I guess one option might be to pass in the tag but not the serialization context but this felt at odds with the other parts of the Converter API. The serialization context would still need to be provided to select_tag so it's possible a converter could mess with blocks at that point as well.

One solution to this might be a more aware serialization context. Something that knows if this is a validation, a write, a read, etc and only allows modifications that are consistent with that operation (e.g. on validate, don't modify, allocate or free blocks). I made a brief attempt at something like this but ran into issues with the block manager (it's difficult to interact without producing a change in it's state).

So some options to address this might be:

  • drop the serialization context (and possibly tag) argument to reserve_blocks
  • leave in the argument(s) and add tests to cover their usage

As there is no current use of the serialization context in reserve_blocks I'm not sure what test should be written to cover it's usage. Are there examples of where this argument is used in other converters?

I'm good with either option at this point. I'm also not seeing SerializationContext listed in our user or developer api documentation but attempting to add a fix for that to this PR seems like scope creep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to drop it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the ctx arg here: 93c2eac

asdf/block.py Show resolved Hide resolved
asdf/asdf.py Outdated
Comment on lines 1396 to 1399
for k in modified_keys:
if k in self._tree:
del self._tree[k]
self._tree.update(pre_write_tree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be safer, I think, to shallow copy the tree before assigning it to naf._tree, and then reassign modified_keys to deep copies of themselves in that shallow-copied tree. That way the original tree stays unmodified even if there's an error and this code doesn't execute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Added that here: 0c44742

def __hash__(self):
return self._key

def __eq__(self, other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're supposed to check the type of "other" and return NotImplemented if we don't know how to compare it, see https://docs.python.org/3/reference/datamodel.html#object.__eq__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added that here: 11159ad

asdf/block.py Outdated
Comment on lines 564 to 566
sctx = ctx._create_serialization_context()
tag = converter.select_tag(node, sctx)
for key in converter.reserve_blocks(node, tag, sctx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to drop it :)

@@ -76,7 +73,7 @@ def to_tree(cls, node, ctx):

array = np.array(words, dtype=np.uint32)
if node._storage == "internal":
cls._value_cache[ctx][abs_value] = array
cls._value_cache[abs_value] = array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can drop it in the replacement PR for #1475 too

Copy link
Contributor

@eslavich eslavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what roman_datamodels is complaining about, I think this looks great

@braingram
Copy link
Contributor Author

braingram commented Apr 28, 2023

I don't know what roman_datamodels is complaining about, I think this looks great

¯\(ツ)

I think the roman failure was related to:
spacetelescope/roman_datamodels#160

Which means it should go away with this most recent rebase.

I expect the devdeps jobs to fail because of:
#1532

@CagtayFabry
Copy link
Contributor

If I understand correctly, this would solve #895 and possibly enable #1013 ?

I haven't looked too deep into it yet though

@braingram
Copy link
Contributor Author

If I understand correctly, this would solve #895 and possibly enable #1013 ?

I haven't looked too deep into it yet though

@CagtayFabry Thanks for taking a look!

I'm not sure if this PR solves #895 as the data callbacks for writing ASDF blocks are expected to return ndarrays (of dtype 'uint8'). However, the BlockConverter example in the tests does convert between ndarrays and bytes (using frombuffer and tobytes):
https://github.com/asdf-format/asdf/pull/1508/files#diff-8498843bd39a0657bc330de864a4ead415900e01f5bcdd15a17cb7af98bbee0fR25-R44

#1013 isn't addressed with this PR. There are some issues with the block management that make addressing this difficult (mainly that block settings, data and state are all stored in the same objects). I've been working on updating the block management in ASDF to allow us to move ndarray to a converter (and drop the legacy type system internally). I think that #1013 should be addressable with the rewrite as it more cleanly separates block options (like storage settings) from other aspects of block management. In addition to the internal changes, the SerializationContext will need to be updated to have methods similar to get/set_array_storage/compression etc.

@CagtayFabry
Copy link
Contributor

I see, thank you for explaining @braingram

@braingram
Copy link
Contributor Author

Newly failing weldx CI appears related to a new version of pint:
BAMWelDX/weldx#872

@braingram braingram merged commit 353d213 into asdf-format:main May 2, 2023
@braingram braingram deleted the converter_block_storage branch May 2, 2023 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development No backport required Downstream CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling update after write_to fails
3 participants