Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(db_engine_specs): improve Presto column type matching #10658

Merged
merged 4 commits into from
Aug 24, 2020

Conversation

villebro
Copy link
Member

@villebro villebro commented Aug 23, 2020

SUMMARY

Currently some column types aren't correctly identified on Presto, as the matching pattern doesn't allow for field length (for varchar it fails to correctly identify varchar(255)). Since similar but more robust regex matching is already implemented on MSSQL, this aligns that handling on Presto and removes duplicated code. As of this PR, unidentified column types on Presto will now raise an exception to avoid these types of bugs going unnoticed, as the type coverage seems very comprehensive and is easy to expand as needed.

TEST PLAN

CI + new tests

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@villebro villebro changed the title [WIP] fix: improve Presto column type matching fix(db_engine_specs): improve Presto column type matching Aug 23, 2020
Comment on lines +497 to +501
sqla_type = PrestoEngineSpec.get_sqla_column_type("varchar(255)")
assert isinstance(sqla_type, types.VARCHAR)
assert sqla_type.length == 255

sqla_type = PrestoEngineSpec.get_sqla_column_type("varchar")
assert isinstance(sqla_type, types.String)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can see that varchar(255) creates a 255 long VARCHAR object, whereas just varchar leaves the length undefined.

Comment on lines +337 to +340
(
re.compile(r"^varchar(\((\d+)\))*$", re.IGNORECASE),
lambda match: types.VARCHAR(int(match[2])) if match[2] else types.String(),
),
Copy link
Member Author

@villebro villebro Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're instantiating a types.VARCHAR object with a set length based on the regex match. This is usually unnecessary, but I thought I'd throw it in here to demonstrate how the matches from the regex can be used to customize the SQLA object.

Comment on lines 147 to 149
_column_type_mappings: Tuple[
Tuple[Pattern[str], Union[TypeEngine, Callable[[Match[str]], TypeEngine]]], ...,
] = ()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the type can either be a SQLA type instance, or a callback that receives the match object and in turn returns the SQLA type instance.

Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a comment

@@ -885,12 +890,18 @@ def get_sqla_column_type(cls, type_: str) -> Optional[TypeEngine]:
"""
Return a sqlalchemy native column type that corresponds to the column type
defined in the data source (return None to use default type inferred by
SQLAlchemy). Needs to be overridden if column requires special handling
SQLAlchemy). Override `_column_type_mappings` for specific needs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect subclasses to override _column_type_mappings should we make it public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the thought crossed my mind while writing this but there were other similar properties that were private so I went with the flow. I changed this one to public, and I'll update the other ones in a later PR to keep this PR as light as possible.

@codecov-commenter
Copy link

codecov-commenter commented Aug 24, 2020

Codecov Report

Merging #10658 into master will decrease coverage by 4.31%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10658      +/-   ##
==========================================
- Coverage   64.33%   60.01%   -4.32%     
==========================================
  Files         784      784              
  Lines       36952    36952              
  Branches     3529     3529              
==========================================
- Hits        23772    22178    -1594     
- Misses      13071    14587    +1516     
- Partials      109      187      +78     
Flag Coverage Δ
#cypress ?
#javascript 60.83% <ø> (ø)
#python 59.53% <91.66%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/models/sql_types/presto_sql_types.py 79.48% <ø> (-1.01%) ⬇️
superset/db_engine_specs/presto.py 70.91% <86.66%> (-11.78%) ⬇️
superset/db_engine_specs/base.py 87.87% <100.00%> (+0.23%) ⬆️
superset/db_engine_specs/mssql.py 92.30% <100.00%> (-1.03%) ⬇️
superset-frontend/src/SqlLab/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/dashboard/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/index.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/dashboard/index.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/setup/setupColors.js 0.00% <0.00%> (-100.00%) ⬇️
... and 151 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0177c2f...ca9f843. Read the comment docs.

Copy link
Member

@bkyryliuk bkyryliuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement, thank you!


@staticmethod
def test_get_sqla_column_type():
sqla_type = PrestoEngineSpec.get_sqla_column_type("varchar(255)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to add some more test cases here especially the ones that are not supported by presto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is blocking another fairly critical feature I'll defer the more comprehensive tests to a forthcoming PR and just cover the discovered bug here. I'm hoping I can carve out some time to build a template for testing this kind of functionality, and don't mind doing the initial work on the Presto spec.

@@ -490,3 +491,24 @@ def test_presto_expand_data_array(self):
self.assertEqual(actual_cols, expected_cols)
self.assertEqual(actual_data, expected_data)
self.assertEqual(actual_expanded_cols, expected_expanded_cols)

@staticmethod
def test_get_sqla_column_type():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to make it a class method, in pytest you can just do

def test_get_sqla_column_type():
  assert True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, not sure what I was doing there! I'll just tack it on as a regular class method like the other tests to limit confusion. Do we already have a test module that follows pytest best practices that I can mimic going forward?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro villebro merged commit 9461f9c into apache:master Aug 24, 2020
@villebro villebro deleted the villebro/fix-presto-types branch August 24, 2020 19:42
michellethomas pushed a commit to airbnb/superset-fork that referenced this pull request Aug 27, 2020
(see MSSQL for example of NCHAR/NVARCHAR handling).

:param type_: Column type returned by inspector
:return: SqlAlchemy column type
"""
for regex, sqla_type in cls.column_type_mappings:
match = regex.match(type_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro we tested this during a deploy and got errors here because sometimes type_ is None. get_sqla_column_type is used in connectors/sqla/models.py where sometimes the type on a column can be null.

serenajiang added a commit to airbnb/superset-fork that referenced this pull request Sep 2, 2020
etr2460 pushed a commit to airbnb/superset-fork that referenced this pull request Sep 11, 2020
ktmud added a commit to ktmud/superset that referenced this pull request Sep 15, 2020
ktmud added a commit to airbnb/superset-fork that referenced this pull request Sep 16, 2020
@dpgaspar dpgaspar added the v0.38 label Sep 22, 2020
dpgaspar pushed a commit that referenced this pull request Sep 22, 2020
* fix: improve Presto column type matching

* add optional callback to type map and add tests

* lint

* change private to public
Ofeknielsen pushed a commit to ofekisr/incubator-superset that referenced this pull request Oct 5, 2020
* fix: improve Presto column type matching

* add optional callback to type map and add tests

* lint

* change private to public
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* fix: improve Presto column type matching

* add optional callback to type map and add tests

* lint

* change private to public
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.38.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L v0.38 🚢 0.38.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants