[bugfix] convert metrics to numeric in dataframe #4726

mistercrunch · 2018-03-30T22:49:41Z

It appears sometimes the dbapi driver and pandas's read_sql fail at
returning the proper numeric types for metrics and they show up as
object in the dataframe. This results in "No numeric types to
aggregate" errors when trying to perform aggregations or pivoting in
pandas.

This PR looks for metrics in dataframes that are typed as "object"
and uses pandas' to_numeric to convert.

It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert.

codecov-io · 2018-03-30T23:41:23Z

Codecov Report

Merging #4726 into master will decrease coverage by <.01%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master    #4726      +/-   ##
==========================================
- Coverage   72.22%   72.22%   -0.01%     
==========================================
  Files         204      204              
  Lines       15323    15329       +6     
  Branches     1180     1181       +1     
==========================================
+ Hits        11067    11071       +4     
- Misses       4253     4255       +2     
  Partials        3        3

Impacted Files	Coverage Δ
superset/models/core.py	`86.52% <100%> (ø)`	⬆️
superset/viz.py	`79.62% <85.71%> (ø)`	⬆️
...cripts/explore/components/ExploreViewContainer.jsx	`0% <0%> (ø)`	⬆️
...set/assets/javascripts/explore/stores/controls.jsx	`39.25% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 069d61c...32a03cb. Read the comment docs.

betodealmeida

Good to go, just a few comments.

betodealmeida · 2018-03-31T05:37:16Z

superset/viz.py

@@ -170,11 +170,21 @@ def get_df(self, query_obj=None):
                if self.datasource.offset:
                    df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset)
                df[DTTM_ALIAS] += self.time_shift
+
+            self.df_metrics_to_num(df, query_obj.get('metrics') or [])


Nit: {}.get already takes a default value when the key is not present:

self.df_metrics_to_num(df, query_obj.get('metrics', []))

This way is a bit safer if the key exists with a None value it will make it an empty dict. That situation shouldn't occur, but it's a tiny bit safer.

Ah, good point. I assumed keys were never None.

betodealmeida · 2018-03-31T05:40:24Z

superset/viz.py

+    @staticmethod
+    def df_metrics_to_num(df, metrics):
+        """Converting metrics to numeric when pandas.read_sql cannot"""
+        for col, dtype in df.dtypes.iteritems():


Python 3 dicts have no iteritems(), better to just use items() (the cost in Python is building the object, but the dict is small so it shouldn't matter).

That's right. I got confused as I copied the code over and though it must have been backported for compatibility reasons. I'll grep and remove other instances of it where I copied it from.

* [bugfix] convert metrics to numeric in dataframe It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert. * Fix tests * Remove all iteritems

michellethomas · 2018-06-11T17:27:51Z

superset/viz.py

+        """Converting metrics to numeric when pandas.read_sql cannot"""
+        for col, dtype in df.dtypes.items():
+            if dtype.type == np.object_ and col in metrics:
+                df[col] = pd.to_numeric(df[col])


@mistercrunch we've seen issues with this change because metrics are not always numeric. You could have a max of a string for example. I'm going to look into other options for doing this, but if you have any thoughts let me know.

We should really clarify what a metric is or isn't. My understanding it that metrics are numeric, but I can see how it may not have been enforced in the past, and that in some places it would be interpreted as [only] being the result of an aggregate function that may or may not be numerical (say MAX(my_string) in the context of a Table viz would just happen to work somehow in the past).

To be clear, the current conceptual model is dimensions are columns or non-aggregate-SQL-expressions, and metrics are always aggregate expressions and numeric. A more complete (but complex) model would be to have columns, sql-expression, and aggregate expressions, and each one may or may not be a metric and/or dimension. This model would require a fair amount of foundational redesign.

How common is this? My incline would be to handle non-numeric aggregate functions outside of the semantic layer, meaning in a SQL Lab subquery or upstream data pipeline.

Note that we could patch something to preserve backward compatibility here. Something like BaseViz.enforce_numeric_metrics = True that would be set to False for Table viz, and run the code referenced above conditionally.

What about thinking of a metric as an aggregation, and leaving it up to the visualizations to determine what data type is required?

The system can handle max(string) as it's a valid aggregation and gets processed in the system correctly with the exception of this post processing. So I'm not suggesting anything that would require a complex redesign. It seems valid for visualizations to only allow certain types to work correctly, but I think we should leave that up to the visualization instead of having this kind of thing break when returning the dataframe in base viz.

Max(date) is an example use case, which seems valid. A less frequent use case is having sums in a case statment.

cc: @john-bodley

@michellethomas what do you think of #5176 ?

* [bugfix] convert metrics to numeric in dataframe It appears sometimes the dbapi driver and pandas's read_sql fail at returning the proper numeric types for metrics and they show up as `object` in the dataframe. This results in "No numeric types to aggregate" errors when trying to perform aggregations or pivoting in pandas. This PR looks for metrics in dataframes that are typed as "object" and uses pandas' to_numeric to convert. * Fix tests * Remove all iteritems

mistercrunch added 2 commits March 30, 2018 22:45

Fix tests

f6471a8

betodealmeida approved these changes Mar 31, 2018

View reviewed changes

Remove all iteritems

32a03cb

mistercrunch merged commit f6fe11f into apache:master Apr 3, 2018

zyclonite mentioned this pull request May 25, 2018

viz tries to convert to number for a calculated (db function) column which is string #5082

Closed

michellethomas reviewed Jun 11, 2018

View reviewed changes

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.25.0 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] convert metrics to numeric in dataframe #4726

[bugfix] convert metrics to numeric in dataframe #4726

mistercrunch commented Mar 30, 2018

codecov-io commented Mar 30, 2018 •

edited

Loading

betodealmeida left a comment

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018

betodealmeida Apr 2, 2018

betodealmeida Mar 31, 2018

mistercrunch Apr 2, 2018

michellethomas Jun 11, 2018

mistercrunch Jun 11, 2018

michellethomas Jun 11, 2018

mistercrunch Jun 11, 2018

[bugfix] convert metrics to numeric in dataframe #4726

[bugfix] convert metrics to numeric in dataframe #4726

Conversation

mistercrunch commented Mar 30, 2018

codecov-io commented Mar 30, 2018 • edited Loading

Codecov Report

betodealmeida left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Mar 30, 2018 •

edited

Loading