Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pivot table] Query with ROW LIMIT #13562

Closed
3 tasks done
graceguo-supercat opened this issue Mar 11, 2021 · 5 comments
Closed
3 tasks done

[pivot table] Query with ROW LIMIT #13562

graceguo-supercat opened this issue Mar 11, 2021 · 5 comments
Labels
!deprecated-label:bug Deprecated label - Use #bug instead test:case viz:charts:pivot Related to the Pivot Table charts

Comments

@graceguo-supercat
Copy link

graceguo-supercat commented Mar 11, 2021

How to reproduce the bug

  1. Create a PivotTable chart,
  2. Add ROW LIMIT = 200
  3. Check generated query by click VIEW QUERY

Actual results

SELECT "dim_destination_geo" AS "dim_destination_geo",
       date_trunc('week', CAST("night_in_latest_year" AS TIMESTAMP)) AS "night_in_latest_year",
       ...
FROM "tmp"."forward_occupancy_forecast_superset_table"
WHERE "night_in_latest_year" >= '2021-03-11 00:00:00.000000'
  AND "night_in_latest_year" < '2021-06-15 00:00:00.000000'
ORDER BY "Nights Backlog" DESC
LIMIT 200;

Instead of filtering out the top K rows in the superset view, it filters out the top K rows in the input data; this results in the pivot table having lots of missing cells.

Expected results

Superset should generate a sub-query with limit like this:

SELECT "dim_destination_geo" AS "dim_destination_geo",
       date_trunc('week', CAST("night_in_latest_year" AS TIMESTAMP)) AS "night_in_latest_year",
       ...
FROM "tmp"."forward_occupancy_forecast_superset_table"
WHERE "night_in_latest_year" >= '2021-03-11 00:00:00.000000'
  AND "night_in_latest_year" < '2021-06-15 00:00:00.000000'
       AND (dim_destination_geo IN
              (SELECT dim_destination_geo
               FROM
                 (SELECT dim_destination_geo,
                         sum(current_forward_nights_occupied)
                  FROM tmp.forward_occupancy_forecast_superset_table
                  WHERE is_most_recent_view = True
                  GROUP BY 1
                  order by 2 DESC
                  LIMIT 200))))
ORDER BY "Nights Backlog" DESC
LIMIT 50000;

Environment

latest master branch

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

Add any other context about the problem here.
cc @junlincc @zuzana-vej

@graceguo-supercat graceguo-supercat added the #bug Bug report label Mar 11, 2021
@graceguo-supercat graceguo-supercat changed the title [pivot table] Generated query with ROW LIMIT is [pivot table] Query with ROW LIMIT Mar 11, 2021
@junlincc
Copy link
Member

junlincc commented Mar 11, 2021

Hi @graceguo-supercat, can you attach a screenshot with all the control select. I assume you set filters and groud-by as well?

Couple PRs touched Pivot table lately. may need some help debugging. sorry about the regression.
apache-superset/superset-ui#954
#13057 (more likely?)

cc @villebro

@junlincc junlincc added viz:charts:pivot Related to the Pivot Table charts !deprecated-label:bug Deprecated label - Use #bug instead test:case and removed #bug Bug report labels Mar 11, 2021
@villebro
Copy link
Member

I don't believe this is a regression - this has most likely always been the case. I agree that the proposal is the correct way to filter the data. I believe we should be able to do this by leveraging the timeseries_limit (although I'd really like to refactor this so that the feature isn't related to timeseries):

if (
is_timeseries # pylint: disable=too-many-boolean-expressions
and timeseries_limit
and not time_groupby_inline
and groupby
):
if self.database.db_engine_spec.allows_joins:
# some sql dialects require for order by expressions
# to also be in the select clause -- others, e.g. vertica,
# require a unique inner alias
inner_main_metric_expr = self.make_sqla_column_compatible(
main_metric_expr, "mme_inner__"
)
inner_groupby_exprs = []
inner_select_exprs = []
for gby_name, gby_obj in groupby_exprs_sans_timestamp.items():
inner = self.make_sqla_column_compatible(gby_obj, gby_name + "__")
inner_groupby_exprs.append(inner)
inner_select_exprs.append(inner)
inner_select_exprs += [inner_main_metric_expr]
subq = select(inner_select_exprs).select_from(tbl)
inner_time_filter = dttm_col.get_time_filter(
inner_from_dttm or from_dttm,
inner_to_dttm or to_dttm,
time_range_endpoints,
)
subq = subq.where(and_(*(where_clause_and + [inner_time_filter])))
subq = subq.group_by(*inner_groupby_exprs)
ob = inner_main_metric_expr
if timeseries_limit_metric:
ob = self._get_timeseries_orderby(
timeseries_limit_metric, metrics_by_name, columns_by_name
)
direction = desc if order_desc else asc
subq = subq.order_by(direction(ob))
subq = subq.limit(timeseries_limit)
on_clause = []
for gby_name, gby_obj in groupby_exprs_sans_timestamp.items():
# in this case the column name, not the alias, needs to be
# conditionally mutated, as it refers to the column alias in
# the inner query
col_name = db_engine_spec.make_label_compatible(gby_name + "__")
on_clause.append(gby_obj == column(col_name))
tbl = tbl.join(subq.alias(), and_(*on_clause))

@graceguo-supercat
Copy link
Author

@junlincc this is controls list:

Screen Shot 2021-03-11 at 10 17 33 AM

@john-bodley
Copy link
Member

@villebro are there any concerns of merely adding a SERIES LIMIT control per your suggestion?

@john-bodley
Copy link
Member

john-bodley commented Jun 28, 2021

Actually @villebro adding the SERIES LIMIT might not be the right approach, i.e., when prototyping the change the limit adheres to the row limit of the groupings of both the GROUP BY and COLUMNS which is likely not what the user is after. Grepping through the code it seems non trivial to decouple this logic.

Screen Shot 2021-06-28 at 10 37 14 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
!deprecated-label:bug Deprecated label - Use #bug instead test:case viz:charts:pivot Related to the Pivot Table charts
Projects
None yet
Development

No branches or pull requests

5 participants