-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve validation performance by avoiding deep copy #784
Improve validation performance by avoiding deep copy #784
Conversation
Codecov Report
@@ Coverage Diff @@
## master #784 +/- ##
=======================================
Coverage 93.87% 93.87%
=======================================
Files 43 43
Lines 5041 5041
=======================================
Hits 4732 4732
Misses 309 309
Continue to review full report at Codecov.
|
🎉 Ed, can you file an issue over at The fundamental issue between the 2 packages is that |
Can you clarify if this requires JWST to do an explicit copy of the schema, or are they calling load_schema directly thus still forcing the copy? |
JWST is calling load_schema directly, that happens here: https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/model_base.py#L113 So I think this is a have-our-cake-and-eat-it-too situation. JWST shouldn't see any performance impact from the copy since it only calls load_schema once per file, whereas asdf calls the function 10,004 times when reading my complex test file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is right to make mutants feel more pain.
3f98289
to
6aa1d96
Compare
When we fixed caching of schema objects in #682, we had to continue returning unique copies of those objects from
load_schema
, since a certain package (ahem, jwst) was mutating some schemas and thereby corrupting the cache. That copy isn't necessary within asdf, since it doesn't mess with schemas once they're loaded, but at the time I didn't think to skip the copy step internally.It turns out that copy adds a significant performance penalty in the validator, where schemas are loaded by tag for every tagged object in an asdf file. For files with complex trees, making deep copies about doubles the read time! So, this PR changes internal calls to
load_schema
to_load_schema_cached
, which dodges the copy.I tested this with the asdf file described here and on my laptop, the read time decreases from an average of 11.6 seconds to 5.7 seconds.