Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from hologram to mashumaro jsonschema #8132

Closed
wants to merge 8 commits into from

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Jul 18, 2023

resolves # 6776

Problem

We want to remove the dependency on hologram, which we have been using for jsonschema generation and validation.

Solution

Switch to using mashumaro jsonschema generation

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

One of the main changes to functionality is that mashumaro uses the "alias" field metadata option to construct both "from_dict" and the jsonschema, so it automatically uses dashes for our many fields which use dashes in yaml and underscores in Python (mostly pre-hook and post-hook in configs and a bunch of fields in dbt_project). In addition, we are now setting a to_dict option to use the dash forms when serializing. This means that quite a bit of our kludges to handle the conversion to and from dashes are no longer necessary, but also means that in a couple of places we have to kludge in the other direction, i.e. from_dict and to_dict will no longer handle or produce the underscore versioned names.

Code to convert to and from pre-hook/post-hook has been removed from pre_serialize and post_serialize methods in model_config.py Code to convert to and from dashes/underscores has been removed from methods in dbtClassMixin. The HyphenatedDbtClassMixin and _hypeneated class variable have been removed since they are no longer required.

All fields that are hyphenated in yaml have add an "alias" field metadata added.

Some extra code has been added to convert pre_hook and post_hook in dbt_project to pre-hook and post-hook, because doing that was specifically enabled by an earlier pull request.

The "register_pattern" method used by hologram has been removed and fields that validated via pattern have been "Annotated" with a Pattern.

The "resource_type" in various node classes had a "restrict" field metadata which was used by hologram to construct jsonschemas. The fields have been changed to a Literal, which results in a "const" field definition in jsonschema, which seems to work fine.

New "json_schema" and "validate" methods have been added to dbtClassMixin to replace the ones that came from hologram.

The "_get_fields" and "_get_field_names" methods that were in hologram and were used by model_config.py have been copied and updated to use field aliases instead of the "field_mapping" dictionaries.

The "Port" field used in the Connection object has been changed back to the original NewType because mashumaro now supports NewType and it was causing problems in json_schema generation.

The PortEncoder class which was used by hologram has been replaced by field Annotations on the "port" field.

The TimeDeltaFieldEncoder and PathEncoder fields don't appear to have been used and have been removed.

We changed the dependency from hologram to jsonschema.

The manifest schema version was bumped to 11 and a schema generated. At this point the generated schema doesn't use field definitions because there is a bug when doing that so it's quite a bit larger, but hopefully that will be addressed before this pr is closed.

@gshank gshank requested review from a team as code owners July 18, 2023 18:56
@gshank gshank requested review from jzhu13, mikealfare, emmyoop and aranke and removed request for a team July 18, 2023 18:56
@cla-bot cla-bot bot added the cla:yes label Jul 18, 2023
@gshank gshank marked this pull request as draft July 18, 2023 18:56
@gshank gshank requested review from peterallenwebb and removed request for jzhu13 and mikealfare July 18, 2023 18:56
@gshank
Copy link
Contributor Author

gshank commented Jul 18, 2023

Note: this requires changes to mashumaro, so it will not work correctly until those changes are released and we can pull them in.

@gshank
Copy link
Contributor Author

gshank commented Jul 18, 2023

mashumaro tickets:

Fatal1ty/mashumaro#126
Fatal1ty/mashumaro#125
Fatal1ty/mashumaro#123

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 2, 2023

Looks like all of the mashumaro tickets have been resolved! Are we now just waiting on a new release?

@Fatal1ty
Copy link

Fatal1ty commented Aug 2, 2023

Looks like all of the mashumaro tickets have been resolved! Are we now just waiting on a new release?

Hey @jtcohen6

I've been waiting for feedback from @gshank here but anyway I'm going to make a release tomorrow, so any other issues will be addressed after that.

@gshank
Copy link
Contributor Author

gshank commented Aug 2, 2023

@Fatal1ty I've been on vacation and on other work, so sorry for not getting back to you. Looking forward to the release!

@Fatal1ty
Copy link

Fatal1ty commented Aug 2, 2023

@jtcohen6 @gshank

New release 3.9 is ready!

@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Merging #8132 (21210bd) into main (991618d) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #8132      +/-   ##
==========================================
+ Coverage   86.23%   86.25%   +0.01%     
==========================================
  Files         174      174              
  Lines       25518    25475      -43     
==========================================
- Hits        22005    21973      -32     
+ Misses       3513     3502      -11     
Files Changed Coverage Δ
core/dbt/parser/base.py 93.45% <ø> (ø)
core/dbt/utils.py 81.38% <ø> (ø)
core/setup.py 0.00% <ø> (ø)
core/dbt/context/context_config.py 94.11% <100.00%> (+0.26%) ⬆️
core/dbt/contracts/connection.py 96.03% <100.00%> (ø)
core/dbt/contracts/graph/manifest.py 93.75% <100.00%> (ø)
core/dbt/contracts/graph/model_config.py 92.09% <100.00%> (-1.72%) ⬇️
core/dbt/contracts/graph/nodes.py 94.46% <100.00%> (ø)
core/dbt/contracts/graph/unparsed.py 93.09% <100.00%> (ø)
core/dbt/contracts/project.py 97.68% <100.00%> (+0.06%) ⬆️
... and 3 more

... and 1 file with indirect coverage changes

@gshank gshank marked this pull request as ready for review August 3, 2023 18:07
@gshank gshank closed this Aug 3, 2023
@gshank gshank reopened this Aug 3, 2023
@gshank
Copy link
Contributor Author

gshank commented Aug 3, 2023

It looks like this has much worse performance than the hologram json_schema generation. The main difference is that hologram cached the field lists in a class variable and the schemas in another class variable. There are a couple of things to try here: 1) caching the json_schemas in a class, 2) pre-generating the json_schemas and using them statically, 3) figuring out how to have mashumaro cache the fields.

Fatal1ty/mashumaro#137

@gshank
Copy link
Contributor Author

gshank commented Aug 16, 2023

Closing this for now, and will continue work on the performance/caching issue in another ticket. This branch will be used as the basis for the additional work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants