Improve validation performance by avoiding deep copy #784

eslavich · 2020-04-27T20:52:19Z

When we fixed caching of schema objects in #682, we had to continue returning unique copies of those objects from load_schema, since a certain package (ahem, jwst) was mutating some schemas and thereby corrupting the cache. That copy isn't necessary within asdf, since it doesn't mess with schemas once they're loaded, but at the time I didn't think to skip the copy step internally.

It turns out that copy adds a significant performance penalty in the validator, where schemas are loaded by tag for every tagged object in an asdf file. For files with complex trees, making deep copies about doubles the read time! So, this PR changes internal calls to load_schema to _load_schema_cached, which dodges the copy.

I tested this with the asdf file described here and on my laptop, the read time decreases from an average of 11.6 seconds to 5.7 seconds.

codecov · 2020-04-27T21:41:37Z

Codecov Report

Merging #784 into master will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #784   +/-   ##
=======================================
  Coverage   93.87%   93.87%           
=======================================
  Files          43       43           
  Lines        5041     5041           
=======================================
  Hits         4732     4732           
  Misses        309      309

Impacted Files	Coverage Δ
asdf/asdf.py	`92.79% <100.00%> (ø)`
asdf/schema.py	`94.05% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37ab3d4...6aa1d96. Read the comment docs.

jdavies-st · 2020-04-28T14:43:51Z

🎉

Ed, can you file an issue over at jwst recommending not mutating the schema? I believe it was in exp_to_source, though it is possible it is doing it at a lower level in datamodels. I don't recall the details you found when investigating #682 .

The fundamental issue between the 2 packages is that asdf uses schemas to serialize/deserialize tagged objects. That's the only purpose. But jwst uses schemas to validate user data, so there's a bit of a different purpose in some sense. That said, jwst should be doing things correctly by asdf, especially if it is a performance penalty not to be doing so.

perrygreenfield · 2020-04-28T15:10:50Z

Can you clarify if this requires JWST to do an explicit copy of the schema, or are they calling load_schema directly thus still forcing the copy?

eslavich · 2020-04-28T15:16:06Z

JWST is calling load_schema directly, that happens here:

https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/model_base.py#L113

So I think this is a have-our-cake-and-eat-it-too situation. JWST shouldn't see any performance impact from the copy since it only calls load_schema once per file, whereas asdf calls the function 10,004 times when reading my complex test file.

perrygreenfield

Yes, it is right to make mutants feel more pain.

…emas

eslavich added this to the 2.6.1 milestone Apr 27, 2020

eslavich requested review from perrygreenfield and nden April 27, 2020 20:52

eslavich added the bug label Apr 27, 2020

perrygreenfield approved these changes Apr 28, 2020

View reviewed changes

Ed Slavich added 2 commits May 5, 2020 15:16

Improve validation performance by avoiding deep copy when loading sch…

c293527

…emas

Add CHANGES.rst entry

6aa1d96

eslavich force-pushed the eslavich-improve-validation-performance branch from 3f98289 to 6aa1d96 Compare May 5, 2020 19:17

eslavich merged commit 67ea697 into asdf-format:master May 5, 2020

eslavich deleted the eslavich-improve-validation-performance branch May 5, 2020 19:43

eslavich modified the milestones: 2.6.1, 2.7.0 Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve validation performance by avoiding deep copy #784

Improve validation performance by avoiding deep copy #784

eslavich commented Apr 27, 2020

codecov bot commented Apr 27, 2020 •

edited

Loading

jdavies-st commented Apr 28, 2020

perrygreenfield commented Apr 28, 2020

eslavich commented Apr 28, 2020

perrygreenfield left a comment

Improve validation performance by avoiding deep copy #784

Improve validation performance by avoiding deep copy #784

Conversation

eslavich commented Apr 27, 2020

codecov bot commented Apr 27, 2020 • edited Loading

Codecov Report

jdavies-st commented Apr 28, 2020

perrygreenfield commented Apr 28, 2020

eslavich commented Apr 28, 2020

perrygreenfield left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 27, 2020 •

edited

Loading