Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL Lab] Clarify SQL Lab query and display limits #7641

Merged

Conversation

etr2460
Copy link
Member

@etr2460 etr2460 commented Jun 3, 2019

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

Builds off of comments in #6942. Sets three row limits for the following uses.

SQL_MAX_ROW: The upper limit for rows returned from any query. Prevents excessive load on the database
DISPLAY_MAX_ROW: The upper limit for rows displayed in SQL Lab. Prevents browser crashes. Should be <= SQL_MAX_ROW
DEFAULT_SQLLAB_LIMIT: The default limit for queries in SQL Lab. Can be overwritten by the user to any value between 1 and DISPLAY_MAX_ROW

If anyone has other ideas to simplify this, I'm happy to take another stab at it. However, I think that with these three variables we can represent all levers required for admins and also prevent users from crashing their browsers.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Warning text + tooltip if DISPLAY_MAX_ROW cuts off displayed results:
Screen Shot 2019-06-03 at 1 14 12 PM

TEST PLAN

  1. Set the DISPLAY_MAX_ROW value to something smaller
  2. Set your query limit in the SQL Lab UI higher than the DISPLAY_MAX_ROW value
  3. See LIMIT warning label with the warning message

Also test:

  • Limits below the DISPLAY_MAX_ROW still work
  • CSV download isn't affected by DISPLAY_MAX_ROW
  • The limits are applied on both sync and async queries

ADDITIONAL INFORMATION

REVIEWERS

@john-bodley, @michellethomas, @mistercrunch

@etr2460 etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch from 4de0480 to e5abb9a Compare June 3, 2019 21:58
Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etr2460 I think enabling displayLimitReached is a good idea from a readability standpoint and also ensure consistency between sync and async.

In the future I still there there would be merit in the UI being more explicit that what is rendered is really a preview of the data and being more explicit that one is previewing the first n of m records.

superset/views/core.py Outdated Show resolved Hide resolved
@etr2460 etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch from e5abb9a to 6f9c31c Compare June 4, 2019 01:04
@mistercrunch
Copy link
Member

Happy to see the re-introduction of DISPLAY_MAX_ROW.

There's a question as to whether we want to allow the user to bump the limit above DISPLAY_MAX_ROW, with some sort of confirmation screen "are you sure you want to {...} this could result in 🔥 💻 🔥 ???".

But I think this is right as is. Allowing that could be in a future PR if users need that.

@@ -94,6 +94,8 @@ class Query(Model, ExtraJSONMixin):
sqla.Index('ti_user_id_changed_on', user_id, changed_on),
)

# TODO: Clean up/fix/refactor limit_reached and limit_used. limit_used defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed looks like we can drop both now. I'm guessing they got orphaned with the back and forth around limits

@@ -184,3 +184,10 @@ def get_datasource_info(datasource_id, datasource_type, form_data):
'The datasource associated with this chart no longer exists')
datasource_id = int(datasource_id)
return datasource_id, datasource_type

def apply_display_max_row_limit(data):
display_limit = app.config.get('DISPLAY_MAX_ROW', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT [optional]: None is implicit here

if display_limit and display_limit < data['query']['rows']:
data['data'] = data['data'][:display_limit]
data['displayLimitReached'] = True
return data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: CR

@@ -184,3 +184,10 @@ def get_datasource_info(datasource_id, datasource_type, form_data):
'The datasource associated with this chart no longer exists')
datasource_id = int(datasource_id)
return datasource_id, datasource_type

def apply_display_max_row_limit(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: data is vague here, maybe sql_results (given it's the complex data structure coming out of get_sql_results)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a docstring explaining why it's mutating this sql_results payload, something like:

"""
Given a `sql_results` nested structure, applies a limit to the number of rows

`sql_results` here is the nested structure coming out of sql_lab.get_sql_results, it contains
metadata about the query, as well as the data set returned by the query. This method
limits the number of rows adds a `displayLimitReached: True` flag to the metadata
"""

payload = json.dumps(
data,
apply_display_max_row_limit(data),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a sync request, we may want to apply DISPLAY_MAX_ROW straight onto the SQL itself.

The reason we have 2 limits in the first place is because we want to store the 1M rows in the results backend (say S3) to allow for CSV download of the 1M row as a web request that reads from the s3 bucket, yet not crash the web browser.

Here's another related idea that's been discussed before around this to get data results faster to the user. It's the idea of doing a first dump after a cursor.fetch_many(10000), we'd probably dump that so a specific cache key, and introduce a new query state: SAMPLES_DELIVERED (or something like that) and then fetch the rest cursor.fetch_all() before reaching SUCCESS, and dump the whole result set into another cache key. I think ultimately that's where we want to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were applying DISPLAY_MAX_ROW directly onto the query, how would we know to show the user that we limited displayed results on the server side? We wouldn't know if 10k results were shown because there's only 10k results in the db or because the query was actually limited.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's good idea but what about limiting to 10k+1 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so CSV download always makes an async request? If so then I'm happy to apply DISPLAY_MAX_ROW to the SQL, then make a follow up issue to track the samples_delivered enhancement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV button is only shown after the query was executed, and if it did run async (has a results_key), it will download it from results cache.
https://github.com/apache/incubator-superset/blob/5470d10155e28c96b54615a36464c6c4afc0a38b/superset/views/core.py#L2754

If not async, it reruns the query, but takes a total shortcut which seems like it could be wrong to me in some cases, but not related to this PR... https://github.com/apache/incubator-superset/blob/5470d10155e28c96b54615a36464c6c4afc0a38b/superset/views/core.py#L2770

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging into this further, I wasn't able to find a clean way to apply the query limit for the first call of the query, but then remove it when the query is called again to generate the CSV.

The logic between async vs sync queries and UI display vs CSV download seem like they could all be simplified in some way, but I'm not quite sure the best way to refactor them at this point. Would it be alright to get this fix merged in for now, then to revisit all these performance enhancements/refactors in a later issue + PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@etr2460 etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch 6 times, most recently from b353a88 to bd97d4f Compare June 5, 2019 04:49
@codecov-io
Copy link

codecov-io commented Jun 5, 2019

Codecov Report

Merging #7641 into master will decrease coverage by 8.87%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7641      +/-   ##
==========================================
- Coverage   65.59%   56.72%   -8.88%     
==========================================
  Files         437      351      -86     
  Lines       21787    10409   -11378     
  Branches     2400     2396       -4     
==========================================
- Hits        14291     5904    -8387     
+ Misses       7375     4384    -2991     
  Partials      121      121
Impacted Files Coverage Δ
...uperset/assets/src/SqlLab/components/SqlEditor.jsx 55.11% <100%> (+1.51%) ⬆️
...uperset/assets/src/SqlLab/components/SouthPane.jsx 91.17% <0%> (-0.26%) ⬇️
...t/assets/src/dashboard/reducers/getInitialState.js 0% <0%> (ø) ⬆️
superset/assets/src/SqlLab/constants.js 100% <0%> (ø) ⬆️
superset/assets/src/chart/Chart.jsx 10.86% <0%> (ø) ⬆️
superset/assets/src/SqlLab/App.jsx 0% <0%> (ø) ⬆️
superset/data/birth_names.py
superset/data/long_lat.py
superset/connectors/druid/__init__.py
superset/views/datasource.py
... and 80 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3091c5...4776db3. Read the comment docs.

@etr2460 etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch 2 times, most recently from 2bb4fa5 to 26fc13f Compare June 7, 2019 19:54
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



def downgrade():
op.add_column('query', sa.Column('limit_used', sa.BOOLEAN(), nullable=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you use the batch_alter_table context manager. I know there are some alter commands which are problematic for SQLite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to other migrations, add_column is safe, it's just drop_column that needs the batch_alter_table

):
sql_results['data'] = sql_results['data'][:display_limit]
sql_results['displayLimitReached'] = True
return sql_results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to return sql_results here as you're mutating the input? Also could you add Python typing for the inputs/outputs as this helps with readability. See here for more detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm returning it so I can call the function inline. And types added!

@etr2460 etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch from 26fc13f to 4776db3 Compare June 7, 2019 22:56
@john-bodley john-bodley merged commit f7812a3 into apache:master Jun 8, 2019
@saurabh1-singh
Copy link

Can I get a bit of clarification, what should be the variables and their value in my config if I want to display no more than 5k rows in any query result but on export to CSV either from chart/dashboard/sqllab I want to download all rows which can be 1M+

@saurabh1-singh
Copy link

Also, what do ROW_LIMIT and VIZ_ROW_LIMIT do, a lot of confusion here

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/S 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sqllab row limit controls don't get applied
5 participants