Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve validation performance by avoiding deep copy #784

Merged

Conversation

eslavich
Copy link
Contributor

When we fixed caching of schema objects in #682, we had to continue returning unique copies of those objects from load_schema, since a certain package (ahem, jwst) was mutating some schemas and thereby corrupting the cache. That copy isn't necessary within asdf, since it doesn't mess with schemas once they're loaded, but at the time I didn't think to skip the copy step internally.

It turns out that copy adds a significant performance penalty in the validator, where schemas are loaded by tag for every tagged object in an asdf file. For files with complex trees, making deep copies about doubles the read time! So, this PR changes internal calls to load_schema to _load_schema_cached, which dodges the copy.

I tested this with the asdf file described here and on my laptop, the read time decreases from an average of 11.6 seconds to 5.7 seconds.

@eslavich eslavich added this to the 2.6.1 milestone Apr 27, 2020
@codecov
Copy link

codecov bot commented Apr 27, 2020

Codecov Report

Merging #784 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #784   +/-   ##
=======================================
  Coverage   93.87%   93.87%           
=======================================
  Files          43       43           
  Lines        5041     5041           
=======================================
  Hits         4732     4732           
  Misses        309      309           
Impacted Files Coverage Δ
asdf/asdf.py 92.79% <100.00%> (ø)
asdf/schema.py 94.05% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37ab3d4...6aa1d96. Read the comment docs.

@eslavich eslavich added the bug label Apr 27, 2020
@jdavies-st
Copy link
Contributor

🎉

Ed, can you file an issue over at jwst recommending not mutating the schema? I believe it was in exp_to_source, though it is possible it is doing it at a lower level in datamodels. I don't recall the details you found when investigating #682 .

The fundamental issue between the 2 packages is that asdf uses schemas to serialize/deserialize tagged objects. That's the only purpose. But jwst uses schemas to validate user data, so there's a bit of a different purpose in some sense. That said, jwst should be doing things correctly by asdf, especially if it is a performance penalty not to be doing so.

@perrygreenfield
Copy link
Contributor

Can you clarify if this requires JWST to do an explicit copy of the schema, or are they calling load_schema directly thus still forcing the copy?

@eslavich
Copy link
Contributor Author

JWST is calling load_schema directly, that happens here:

https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/model_base.py#L113

So I think this is a have-our-cake-and-eat-it-too situation. JWST shouldn't see any performance impact from the copy since it only calls load_schema once per file, whereas asdf calls the function 10,004 times when reading my complex test file.

Copy link
Contributor

@perrygreenfield perrygreenfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is right to make mutants feel more pain.

@eslavich eslavich force-pushed the eslavich-improve-validation-performance branch from 3f98289 to 6aa1d96 Compare May 5, 2020 19:17
@eslavich eslavich merged commit 67ea697 into asdf-format:master May 5, 2020
@eslavich eslavich deleted the eslavich-improve-validation-performance branch May 5, 2020 19:43
@eslavich eslavich modified the milestones: 2.6.1, 2.7.0 Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants