[SQL Lab] Clarify SQL Lab query and display limits #7641

etr2460 · 2019-06-03T20:26:00Z

SUMMARY

Builds off of comments in #6942. Sets three row limits for the following uses.

SQL_MAX_ROW: The upper limit for rows returned from any query. Prevents excessive load on the database
DISPLAY_MAX_ROW: The upper limit for rows displayed in SQL Lab. Prevents browser crashes. Should be <= SQL_MAX_ROW
DEFAULT_SQLLAB_LIMIT: The default limit for queries in SQL Lab. Can be overwritten by the user to any value between 1 and DISPLAY_MAX_ROW

If anyone has other ideas to simplify this, I'm happy to take another stab at it. However, I think that with these three variables we can represent all levers required for admins and also prevent users from crashing their browsers.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Warning text + tooltip if DISPLAY_MAX_ROW cuts off displayed results:

TEST PLAN

Set the DISPLAY_MAX_ROW value to something smaller
Set your query limit in the SQL Lab UI higher than the DISPLAY_MAX_ROW value
See LIMIT warning label with the warning message

Also test:

Limits below the DISPLAY_MAX_ROW still work
CSV download isn't affected by DISPLAY_MAX_ROW
The limits are applied on both sync and async queries

ADDITIONAL INFORMATION

Has associated issue: Fixes Sqllab row limit controls don't get applied #7590
Changes UI
Requires DB Migration.
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

REVIEWERS

@john-bodley, @michellethomas, @mistercrunch

john-bodley

@etr2460 I think enabling displayLimitReached is a good idea from a readability standpoint and also ensure consistency between sync and async.

In the future I still there there would be merit in the UI being more explicit that what is rendered is really a preview of the data and being more explicit that one is previewing the first n of m records.

superset/views/core.py

mistercrunch · 2019-06-04T05:02:55Z

Happy to see the re-introduction of DISPLAY_MAX_ROW.

There's a question as to whether we want to allow the user to bump the limit above DISPLAY_MAX_ROW, with some sort of confirmation screen "are you sure you want to {...} this could result in 🔥 💻 🔥 ???".

But I think this is right as is. Allowing that could be in a future PR if users need that.

mistercrunch · 2019-06-04T05:11:39Z

superset/models/sql_lab.py

@@ -94,6 +94,8 @@ class Query(Model, ExtraJSONMixin):
        sqla.Index('ti_user_id_changed_on', user_id, changed_on),
    )

+    # TODO: Clean up/fix/refactor limit_reached and limit_used. limit_used defaults


Agreed looks like we can drop both now. I'm guessing they got orphaned with the back and forth around limits

mistercrunch · 2019-06-04T05:14:42Z

superset/views/utils.py

@@ -184,3 +184,10 @@ def get_datasource_info(datasource_id, datasource_type, form_data):
            'The datasource associated with this chart no longer exists')
    datasource_id = int(datasource_id)
    return datasource_id, datasource_type
+
+def apply_display_max_row_limit(data):
+    display_limit = app.config.get('DISPLAY_MAX_ROW', None)


NIT [optional]: None is implicit here

mistercrunch · 2019-06-04T05:15:52Z

superset/views/utils.py

+    if display_limit and display_limit < data['query']['rows']:
+        data['data'] = data['data'][:display_limit]
+        data['displayLimitReached'] = True
+    return data


mistercrunch · 2019-06-04T05:18:11Z

superset/views/utils.py

@@ -184,3 +184,10 @@ def get_datasource_info(datasource_id, datasource_type, form_data):
            'The datasource associated with this chart no longer exists')
    datasource_id = int(datasource_id)
    return datasource_id, datasource_type
+
+def apply_display_max_row_limit(data):


NIT: data is vague here, maybe sql_results (given it's the complex data structure coming out of get_sql_results)

Maybe add a docstring explaining why it's mutating this sql_results payload, something like:

""" Given a `sql_results` nested structure, applies a limit to the number of rows `sql_results` here is the nested structure coming out of sql_lab.get_sql_results, it contains metadata about the query, as well as the data set returned by the query. This method limits the number of rows adds a `displayLimitReached: True` flag to the metadata """

mistercrunch · 2019-06-04T05:36:25Z

superset/views/core.py

            payload = json.dumps(
-                data,
+                apply_display_max_row_limit(data),


In the case of a sync request, we may want to apply DISPLAY_MAX_ROW straight onto the SQL itself.

The reason we have 2 limits in the first place is because we want to store the 1M rows in the results backend (say S3) to allow for CSV download of the 1M row as a web request that reads from the s3 bucket, yet not crash the web browser.

Here's another related idea that's been discussed before around this to get data results faster to the user. It's the idea of doing a first dump after a cursor.fetch_many(10000), we'd probably dump that so a specific cache key, and introduce a new query state: SAMPLES_DELIVERED (or something like that) and then fetch the rest cursor.fetch_all() before reaching SUCCESS, and dump the whole result set into another cache key. I think ultimately that's where we want to go.

If we were applying DISPLAY_MAX_ROW directly onto the query, how would we know to show the user that we limited displayed results on the server side? We wouldn't know if 10k results were shown because there's only 10k results in the db or because the query was actually limited.

I'm not sure if that's good idea but what about limiting to 10k+1 ?

Hmm, so CSV download always makes an async request? If so then I'm happy to apply DISPLAY_MAX_ROW to the SQL, then make a follow up issue to track the samples_delivered enhancement

CSV button is only shown after the query was executed, and if it did run async (has a results_key), it will download it from results cache.
https://github.com/apache/incubator-superset/blob/5470d10155e28c96b54615a36464c6c4afc0a38b/superset/views/core.py#L2754

If not async, it reruns the query, but takes a total shortcut which seems like it could be wrong to me in some cases, but not related to this PR... https://github.com/apache/incubator-superset/blob/5470d10155e28c96b54615a36464c6c4afc0a38b/superset/views/core.py#L2770

After digging into this further, I wasn't able to find a clean way to apply the query limit for the first call of the query, but then remove it when the query is called again to generate the CSV.

The logic between async vs sync queries and UI display vs CSV download seem like they could all be simplified in some way, but I'm not quite sure the best way to refactor them at this point. Would it be alright to get this fix merged in for now, then to revisit all these performance enhancements/refactors in a later issue + PR?

codecov-io · 2019-06-05T05:13:07Z

Codecov Report

Merging #7641 into master will decrease coverage by 8.87%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #7641      +/-   ##
==========================================
- Coverage   65.59%   56.72%   -8.88%     
==========================================
  Files         437      351      -86     
  Lines       21787    10409   -11378     
  Branches     2400     2396       -4     
==========================================
- Hits        14291     5904    -8387     
+ Misses       7375     4384    -2991     
  Partials      121      121

Impacted Files	Coverage Δ
...uperset/assets/src/SqlLab/components/SqlEditor.jsx	`55.11% <100%> (+1.51%)`	⬆️
...uperset/assets/src/SqlLab/components/SouthPane.jsx	`91.17% <0%> (-0.26%)`	⬇️
...t/assets/src/dashboard/reducers/getInitialState.js	`0% <0%> (ø)`	⬆️
superset/assets/src/SqlLab/constants.js	`100% <0%> (ø)`	⬆️
superset/assets/src/chart/Chart.jsx	`10.86% <0%> (ø)`	⬆️
superset/assets/src/SqlLab/App.jsx	`0% <0%> (ø)`	⬆️
superset/data/birth_names.py
superset/data/long_lat.py
superset/connectors/druid/__init__.py
superset/views/datasource.py
... and 80 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3091c5...4776db3. Read the comment docs.

mistercrunch

LGTM

john-bodley · 2019-06-07T21:28:46Z

superset/migrations/versions/d7c1a0d6f2da_remove_limit_used_from_query_model.py

+
+
+def downgrade():
+    op.add_column('query', sa.Column('limit_used', sa.BOOLEAN(), nullable=True))


I would suggest you use the batch_alter_table context manager. I know there are some alter commands which are problematic for SQLite.

According to other migrations, add_column is safe, it's just drop_column that needs the batch_alter_table

john-bodley · 2019-06-07T21:33:29Z

superset/views/utils.py

+    ):
+        sql_results['data'] = sql_results['data'][:display_limit]
+        sql_results['displayLimitReached'] = True
+    return sql_results


Any reason to return sql_results here as you're mutating the input? Also could you add Python typing for the inputs/outputs as this helps with readability. See here for more detail.

I'm returning it so I can call the function inline. And types added!

saurabh1-singh · 2019-10-21T14:03:19Z

Can I get a bit of clarification, what should be the variables and their value in my config if I want to display no more than 5k rows in any query result but on export to CSV either from chart/dashboard/sqllab I want to download all rows which can be 1M+

saurabh1-singh · 2019-10-21T14:27:09Z

Also, what do ROW_LIMIT and VIZ_ROW_LIMIT do, a lot of confusion here

pull-request-size bot added the size/S label Jun 3, 2019

etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch from 4de0480 to e5abb9a Compare June 3, 2019 21:58

john-bodley reviewed Jun 3, 2019

View reviewed changes

superset/views/core.py Outdated Show resolved Hide resolved

etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch from e5abb9a to 6f9c31c Compare June 4, 2019 01:04

mistercrunch reviewed Jun 4, 2019

View reviewed changes

etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch 6 times, most recently from b353a88 to bd97d4f Compare June 5, 2019 04:49

etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch 2 times, most recently from 2bb4fa5 to 26fc13f Compare June 7, 2019 19:54

mistercrunch approved these changes Jun 7, 2019

View reviewed changes

john-bodley reviewed Jun 7, 2019

View reviewed changes

[SQL Lab] Clarify SQL Lab query and display limits

4776db3

etr2460 force-pushed the erik-ritter--fix-sqllab-row-limit branch from 26fc13f to 4776db3 Compare June 7, 2019 22:56

john-bodley approved these changes Jun 8, 2019

View reviewed changes

john-bodley merged commit f7812a3 into apache:master Jun 8, 2019

yousoph mentioned this pull request Jun 20, 2023

Unable to 10k records from elasticsearch #24409

Closed

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SQL Lab] Clarify SQL Lab query and display limits #7641

[SQL Lab] Clarify SQL Lab query and display limits #7641

etr2460 commented Jun 3, 2019 •

edited

Loading

john-bodley left a comment

mistercrunch commented Jun 4, 2019

mistercrunch Jun 4, 2019

mistercrunch Jun 4, 2019

mistercrunch Jun 4, 2019

mistercrunch Jun 4, 2019

mistercrunch Jun 4, 2019

mistercrunch Jun 4, 2019

etr2460 Jun 4, 2019

mistercrunch Jun 5, 2019

etr2460 Jun 5, 2019

mistercrunch Jun 7, 2019

etr2460 Jun 7, 2019

mistercrunch Jun 7, 2019

codecov-io commented Jun 5, 2019 •

edited

Loading

mistercrunch left a comment

john-bodley Jun 7, 2019

etr2460 Jun 7, 2019

john-bodley Jun 7, 2019

etr2460 Jun 7, 2019

saurabh1-singh commented Oct 21, 2019

saurabh1-singh commented Oct 21, 2019



		def downgrade():
		op.add_column('query', sa.Column('limit_used', sa.BOOLEAN(), nullable=True))

[SQL Lab] Clarify SQL Lab query and display limits #7641

[SQL Lab] Clarify SQL Lab query and display limits #7641

Conversation

etr2460 commented Jun 3, 2019 • edited Loading

CATEGORY

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

john-bodley left a comment

Choose a reason for hiding this comment

mistercrunch commented Jun 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jun 5, 2019 • edited Loading

Codecov Report

mistercrunch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saurabh1-singh commented Oct 21, 2019

saurabh1-singh commented Oct 21, 2019

etr2460 commented Jun 3, 2019 •

edited

Loading

codecov-io commented Jun 5, 2019 •

edited

Loading