Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Schema validation performance #159

Open
mdboom opened this issue Jul 27, 2015 · 5 comments
Open

JSON Schema validation performance #159

mdboom opened this issue Jul 27, 2015 · 5 comments

Comments

@mdboom
Copy link
Contributor

mdboom commented Jul 27, 2015

JSON schema validation currently takes 60% of load time on a benchmark with 10000 arrays.

Unlike the YAML parsing where there was a lot of low-hanging fruit, in JSON schema things are tricky. It's hard to figure out what to do to improve the performance of jsonschema without obliterating its really clean architecture.

Relatedly, I experimented with adding a flag to turn of JSON schema validation. The problem is that then many of the type converters become more brittle in interesting ways because they don't do their own validation that the JSON schema is currently doing for them. Duplicating that work seems like a way to only make things slower, so not sure what to do there.

@embray
Copy link
Contributor

embray commented Jul 28, 2015

Within the jsonschema module itself, have you been able to identify any specific problem areas?
What's a good benchmark I could use to try that out myself?

@mdboom
Copy link
Contributor Author

mdboom commented Jul 28, 2015

I've been using this as a benchmark:

from pyasdf import AsdfFile
import numpy as np
import time

for i in (10, 100, 1000, 10000):
    arrays = []
    for x in range(i):
        arrays.append(np.random.rand(32, 32))

    tree = {'arrays': arrays}

    ff = AsdfFile(tree).write_to('test.asdf')
    t = time.time()
    with AsdfFile.open('test.asdf') as ff:
        pass
    t0 = time.time() - t

If you cProfile just the with AsdfFile.open('test.asdf') as ff: line, that's where I got the 60% of time spent on jsonschema validation. I haven't really found any obvious problem areas yet...

@embray
Copy link
Contributor

embray commented Jul 28, 2015

I'll play around with it. I'm also curious about what code breaks when json schema isn't used for validation. I think some duplicate validation is probably okay--if jsonschema slows things down so much, a little extra won't make a big difference there, and probably also won't add nearly as much overhead in the cases where jsonschema is turned off. But it's a question of how much redundancy we're talking about...

@mdboom
Copy link
Contributor Author

mdboom commented Jul 28, 2015

Well, it's things like not checking if the dtype is one of the acceptable values, whether the source is an int or string (but not a float). Lots of little stuff like that. But you're right, maybe I shouldn't assume it's too bad.

@embray
Copy link
Contributor

embray commented Jul 28, 2015

I figured for the most part you're on your own if you disable validation, and would only do so on known-good files. But I agree it makes sense to catch some things in the software so we don't just end up with unhelpful uncaught exceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants