Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] convert metrics to numeric in dataframe #4726

Merged
merged 3 commits into from
Apr 3, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion superset/models/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -702,7 +702,7 @@ def needs_conversion(df_series):
return True
return False

for k, v in df.dtypes.iteritems():
for k, v in df.dtypes.items():
if v.type == numpy.object_ and needs_conversion(df[k]):
df[k] = df[k].apply(utils.json_dumps_w_dates)
return df
Expand Down
15 changes: 12 additions & 3 deletions superset/viz.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,11 +170,21 @@ def get_df(self, query_obj=None):
if self.datasource.offset:
df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset)
df[DTTM_ALIAS] += self.time_shift

self.df_metrics_to_num(df, query_obj.get('metrics') or [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: {}.get already takes a default value when the key is not present:

self.df_metrics_to_num(df, query_obj.get('metrics', []))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is a bit safer if the key exists with a None value it will make it an empty dict. That situation shouldn't occur, but it's a tiny bit safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I assumed keys were never None.


df.replace([np.inf, -np.inf], np.nan)
fillna = self.get_fillna_for_columns(df.columns)
df = df.fillna(fillna)
return df

@staticmethod
def df_metrics_to_num(df, metrics):
"""Converting metrics to numeric when pandas.read_sql cannot"""
for col, dtype in df.dtypes.items():
if dtype.type == np.object_ and col in metrics:
df[col] = pd.to_numeric(df[col])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch we've seen issues with this change because metrics are not always numeric. You could have a max of a string for example. I'm going to look into other options for doing this, but if you have any thoughts let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really clarify what a metric is or isn't. My understanding it that metrics are numeric, but I can see how it may not have been enforced in the past, and that in some places it would be interpreted as [only] being the result of an aggregate function that may or may not be numerical (say MAX(my_string) in the context of a Table viz would just happen to work somehow in the past).

To be clear, the current conceptual model is dimensions are columns or non-aggregate-SQL-expressions, and metrics are always aggregate expressions and numeric. A more complete (but complex) model would be to have columns, sql-expression, and aggregate expressions, and each one may or may not be a metric and/or dimension. This model would require a fair amount of foundational redesign.

How common is this? My incline would be to handle non-numeric aggregate functions outside of the semantic layer, meaning in a SQL Lab subquery or upstream data pipeline.

Note that we could patch something to preserve backward compatibility here. Something like BaseViz.enforce_numeric_metrics = True that would be set to False for Table viz, and run the code referenced above conditionally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about thinking of a metric as an aggregation, and leaving it up to the visualizations to determine what data type is required?

The system can handle max(string) as it's a valid aggregation and gets processed in the system correctly with the exception of this post processing. So I'm not suggesting anything that would require a complex redesign. It seems valid for visualizations to only allow certain types to work correctly, but I think we should leave that up to the visualization instead of having this kind of thing break when returning the dataframe in base viz.

Max(date) is an example use case, which seems valid. A less frequent use case is having sums in a case statment.

cc: @john-bodley

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michellethomas what do you think of #5176 ?


def query_obj(self):
"""Building a query object"""
form_data = self.form_data
Expand Down Expand Up @@ -1060,7 +1070,6 @@ def process_data(self, df, aggregate=False):
df = df.fillna(0)
if fd.get('granularity') == 'all':
raise Exception(_('Pick a time granularity for your time series'))

if not aggregate:
df = df.pivot_table(
index=DTTM_ALIAS,
Expand Down Expand Up @@ -1384,7 +1393,7 @@ def get_data(self, df):
pt = (pt / pt.sum()).T
pt = pt.reindex(row.index)
chart_data = []
for name, ys in pt.iteritems():
for name, ys in pt.items():
if pt[name].dtype.kind not in 'biufc' or name in self.groupby:
continue
if isinstance(name, string_types):
Expand All @@ -1395,7 +1404,7 @@ def get_data(self, df):
l = [str(s) for s in name[1:]] # noqa: E741
series_title = ', '.join(l)
values = []
for i, v in ys.iteritems():
for i, v in ys.items():
x = i
if isinstance(x, (tuple, list)):
x = ', '.join([text_type(s) for s in x])
Expand Down
2 changes: 2 additions & 0 deletions tests/viz_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ def test_get_df_handles_dttm_col(self):
results.df.empty = False
datasource.query = Mock(return_value=results)
test_viz = viz.BaseViz(datasource, form_data)

test_viz.df_metrics_to_num = Mock()
test_viz.get_fillna_for_columns = Mock(return_value=0)
test_viz.get_df(query_obj)
mock_call = df.__setitem__.mock_calls[0]
Expand Down