Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] convert metrics to numeric in dataframe #4726

Merged
merged 3 commits into from
Apr 3, 2018

Conversation

mistercrunch
Copy link
Member

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
object in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
`object` in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.
@codecov-io
Copy link

codecov-io commented Mar 30, 2018

Codecov Report

Merging #4726 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4726      +/-   ##
==========================================
- Coverage   72.22%   72.22%   -0.01%     
==========================================
  Files         204      204              
  Lines       15323    15329       +6     
  Branches     1180     1181       +1     
==========================================
+ Hits        11067    11071       +4     
- Misses       4253     4255       +2     
  Partials        3        3
Impacted Files Coverage Δ
superset/models/core.py 86.52% <100%> (ø) ⬆️
superset/viz.py 79.62% <85.71%> (ø) ⬆️
...cripts/explore/components/ExploreViewContainer.jsx 0% <0%> (ø) ⬆️
...set/assets/javascripts/explore/stores/controls.jsx 39.25% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 069d61c...32a03cb. Read the comment docs.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go, just a few comments.

@@ -170,11 +170,21 @@ def get_df(self, query_obj=None):
if self.datasource.offset:
df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset)
df[DTTM_ALIAS] += self.time_shift

self.df_metrics_to_num(df, query_obj.get('metrics') or [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: {}.get already takes a default value when the key is not present:

self.df_metrics_to_num(df, query_obj.get('metrics', []))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is a bit safer if the key exists with a None value it will make it an empty dict. That situation shouldn't occur, but it's a tiny bit safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I assumed keys were never None.

superset/viz.py Outdated
@staticmethod
def df_metrics_to_num(df, metrics):
"""Converting metrics to numeric when pandas.read_sql cannot"""
for col, dtype in df.dtypes.iteritems():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3 dicts have no iteritems(), better to just use items() (the cost in Python is building the object, but the dict is small so it shouldn't matter).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I got confused as I copied the code over and though it must have been backported for compatibility reasons. I'll grep and remove other instances of it where I copied it from.

@mistercrunch mistercrunch merged commit f6fe11f into apache:master Apr 3, 2018
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Apr 11, 2018
* [bugfix] convert metrics to numeric in dataframe

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
`object` in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.

* Fix tests

* Remove all iteritems
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
* [bugfix] convert metrics to numeric in dataframe

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
`object` in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.

* Fix tests

* Remove all iteritems
timifasubaa pushed a commit to timifasubaa/incubator-superset that referenced this pull request May 31, 2018
* [bugfix] convert metrics to numeric in dataframe

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
`object` in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.

* Fix tests

* Remove all iteritems
"""Converting metrics to numeric when pandas.read_sql cannot"""
for col, dtype in df.dtypes.items():
if dtype.type == np.object_ and col in metrics:
df[col] = pd.to_numeric(df[col])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch we've seen issues with this change because metrics are not always numeric. You could have a max of a string for example. I'm going to look into other options for doing this, but if you have any thoughts let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really clarify what a metric is or isn't. My understanding it that metrics are numeric, but I can see how it may not have been enforced in the past, and that in some places it would be interpreted as [only] being the result of an aggregate function that may or may not be numerical (say MAX(my_string) in the context of a Table viz would just happen to work somehow in the past).

To be clear, the current conceptual model is dimensions are columns or non-aggregate-SQL-expressions, and metrics are always aggregate expressions and numeric. A more complete (but complex) model would be to have columns, sql-expression, and aggregate expressions, and each one may or may not be a metric and/or dimension. This model would require a fair amount of foundational redesign.

How common is this? My incline would be to handle non-numeric aggregate functions outside of the semantic layer, meaning in a SQL Lab subquery or upstream data pipeline.

Note that we could patch something to preserve backward compatibility here. Something like BaseViz.enforce_numeric_metrics = True that would be set to False for Table viz, and run the code referenced above conditionally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about thinking of a metric as an aggregation, and leaving it up to the visualizations to determine what data type is required?

The system can handle max(string) as it's a valid aggregation and gets processed in the system correctly with the exception of this post processing. So I'm not suggesting anything that would require a complex redesign. It seems valid for visualizations to only allow certain types to work correctly, but I think we should leave that up to the visualization instead of having this kind of thing break when returning the dataframe in base viz.

Max(date) is an example use case, which seems valid. A less frequent use case is having sums in a case statment.

cc: @john-bodley

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michellethomas what do you think of #5176 ?

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* [bugfix] convert metrics to numeric in dataframe

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
`object` in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.

* Fix tests

* Remove all iteritems
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.25.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.25.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants