JSON Schema validation performance #159

mdboom · 2015-07-27T12:53:04Z

JSON schema validation currently takes 60% of load time on a benchmark with 10000 arrays.

Unlike the YAML parsing where there was a lot of low-hanging fruit, in JSON schema things are tricky. It's hard to figure out what to do to improve the performance of jsonschema without obliterating its really clean architecture.

Relatedly, I experimented with adding a flag to turn of JSON schema validation. The problem is that then many of the type converters become more brittle in interesting ways because they don't do their own validation that the JSON schema is currently doing for them. Duplicating that work seems like a way to only make things slower, so not sure what to do there.

embray · 2015-07-28T18:42:34Z

Within the jsonschema module itself, have you been able to identify any specific problem areas?
What's a good benchmark I could use to try that out myself?

mdboom · 2015-07-28T18:53:48Z

I've been using this as a benchmark:

from pyasdf import AsdfFile
import numpy as np
import time

for i in (10, 100, 1000, 10000):
    arrays = []
    for x in range(i):
        arrays.append(np.random.rand(32, 32))

    tree = {'arrays': arrays}

    ff = AsdfFile(tree).write_to('test.asdf')
    t = time.time()
    with AsdfFile.open('test.asdf') as ff:
        pass
    t0 = time.time() - t

If you cProfile just the with AsdfFile.open('test.asdf') as ff: line, that's where I got the 60% of time spent on jsonschema validation. I haven't really found any obvious problem areas yet...

embray · 2015-07-28T19:20:45Z

I'll play around with it. I'm also curious about what code breaks when json schema isn't used for validation. I think some duplicate validation is probably okay--if jsonschema slows things down so much, a little extra won't make a big difference there, and probably also won't add nearly as much overhead in the cases where jsonschema is turned off. But it's a question of how much redundancy we're talking about...

mdboom · 2015-07-28T19:23:25Z

Well, it's things like not checking if the dtype is one of the acceptable values, whether the source is an int or string (but not a float). Lots of little stuff like that. But you're right, maybe I shouldn't assume it's too bad.

embray · 2015-07-28T19:34:59Z

I figured for the most part you're on your own if you disable validation, and would only do so on known-good files. But I agree it makes sense to catch some things in the software so we don't just end up with unhelpful uncaught exceptions.

drdavella mentioned this issue Oct 17, 2018

Add option to skip validation #565

Closed

eslavich mentioned this issue Apr 27, 2020

Improve validation performance by avoiding deep copy #784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Schema validation performance #159

JSON Schema validation performance #159

mdboom commented Jul 27, 2015

embray commented Jul 28, 2015

mdboom commented Jul 28, 2015

embray commented Jul 28, 2015

mdboom commented Jul 28, 2015

embray commented Jul 28, 2015

JSON Schema validation performance #159

JSON Schema validation performance #159

Comments

mdboom commented Jul 27, 2015

embray commented Jul 28, 2015

mdboom commented Jul 28, 2015

embray commented Jul 28, 2015

mdboom commented Jul 28, 2015

embray commented Jul 28, 2015