Skip Scan #3000

svenklemm · 2021-03-02T21:02:51Z

This patch implements a Skip Scan; an optimization for SELECT DISTINCT ON.
Usually for SELECT DISTINCT ON postgres will plan either a UNIQUE over a
sorted path, or some form of aggregate. In either case, it needs to scan the
entire table, even in cases where there are only a few unique values.

A skip-scan optimizes this case when we have an ordered index. Instead of
scanning the entire table and deduplicating after, the scan remembers the last
value returned, and searches the index for the next value after that one. This
means that for a table with k keys, with u distinct values, a skip-scan runs
in time u * log(k) as opposed to scanning then deduplicating, which takes time
k. We can write the number of unique values u as of function of k by
dividing by the number of repeats r i.e. u = k/r this means that a skip-scan
will be faster if each key is repeated more than a logarithmic number of times,
i.e. if r > log(k) then u * log(k) < k/log(k) * log(k) < k.

Disable-check: commit-count

codecov · 2021-03-05T00:10:48Z

Codecov Report

Merging #3000 (1601905) into master (eace5ea) will increase coverage by 0.00%.
The diff coverage is 94.24%.

❗ Current head 1601905 differs from pull request most recent head 128f84d. Consider uploading reports for the commit 128f84d to get more accurate results

@@           Coverage Diff            @@
##           master    #3000    +/-   ##
========================================
  Coverage   90.28%   90.28%            
========================================
  Files         213      215     +2     
  Lines       34885    35198   +313     
========================================
+ Hits        31495    31778   +283     
- Misses       3390     3420    +30

Impacted Files	Coverage Δ
src/compat.h	`100.00% <ø> (ø)`
tsl/src/nodes/skip_scan/exec.c	`94.01% <94.01%> (ø)`
tsl/src/nodes/skip_scan/planner.c	`94.14% <94.14%> (ø)`
src/constraint_aware_append.c	`92.50% <100.00%> (+0.11%)`	⬆️
src/guc.c	`97.36% <100.00%> (+0.07%)`	⬆️
tsl/src/init.c	`83.33% <100.00%> (+0.98%)`	⬆️
tsl/src/planner.c	`100.00% <100.00%> (ø)`
src/loader/bgw_message_queue.c	`84.51% <0.00%> (-2.59%)`	⬇️
src/loader/bgw_launcher.c	`89.50% <0.00%> (-2.47%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eace5ea...128f84d. Read the comment docs.

mkindahl

I do not see a check for a table with only null values in it, which could be a potentially bad case for the state machine. You have tests for tables that contain a mix of null and non-null values, but I do not see a test where the column has only nulls.

I'm approving anyway, since this is a minor thing, but please add a test for that.

tsl/src/nodes/skip_scan/planner.c

tsl/test/sql/include/skip_scan_query.sql

Add helper functions to check whether a CustomPath is a ChunkAppendPath and ConstraintAwareAppendPath.

This patch implements SkipScan; an optimization for SELECT DISTINCT ON. Usually for SELECT DISTINCT ON postgres will plan either a UNIQUE over a sorted path, or some form of aggregate. In either case, it needs to scan the entire table, even in cases where there are only a few unique values. A skip scan optimizes this case when we have an ordered index. Instead of scanning the entire table and deduplicating after, the scan remembers the last value returned, and searches the index for the next value after that one. This means that for a table with k keys, with u distinct values, a skip scan runs in time u * log(k) as opposed to scanning then deduplicating, which takes time k. We can write the number of unique values u as of function of k by dividing by the number of repeats r i.e. u = k/r this means that a skip scan will be faster if each key is repeated more than a logarithmic number of times, i.e. if r > log(k) then u * log(k) < k/log(k) * log(k) < k. Co-authored-by: Joshua Lockerman <josh@timescale.com>

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the skip-scan optimization, which significantly improves performance of queries with DISTINCT ON. This optimization is not available for queries on distributed hypertables. For multinode, this release adds a function to create a distributed restore point, which allows the consistent restore of a multinode cluster from a backup. The release also includes improvements to distributed query performance and memory usage. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * #2843 Add distributed restore point functionality * #3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * #2989 Refactor and harden size and stats functions * #3058 Reduce memory usage for distributed inserts * #3067 Fix extremely slow multi-node order by queries * #3082 Fix chunk index column name mapping * #3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

@BowenGG

This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * #2843 Add distributed restore point functionality * #3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * #2989 Refactor and harden size and stats functions * #3058 Reduce memory usage for distributed inserts * #3067 Fix extremely slow multi-node order by queries * #3082 Fix chunk index column name mapping * #3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.

svenklemm force-pushed the skipscan branch 3 times, most recently from f6723a3 to 4a650aa Compare March 4, 2021 23:57

svenklemm force-pushed the skipscan branch 2 times, most recently from 304903a to 60eb779 Compare March 6, 2021 13:56

svenklemm force-pushed the skipscan branch 21 times, most recently from 6003360 to 4b9b272 Compare March 16, 2021 13:32

NunoFilipeSantos assigned svenklemm Mar 16, 2021

svenklemm marked this pull request as ready for review March 16, 2021 17:25

svenklemm requested a review from a team as a code owner March 16, 2021 17:25

svenklemm force-pushed the skipscan branch 3 times, most recently from bddcbba to e5cb0db Compare April 8, 2021 16:22

mkindahl approved these changes Apr 8, 2021

View reviewed changes

tsl/src/nodes/skip_scan/planner.c Show resolved Hide resolved

tsl/test/sql/include/skip_scan_query.sql Show resolved Hide resolved

svenklemm force-pushed the skipscan branch from e5cb0db to 12b0105 Compare April 9, 2021 11:20

NunoFilipeSantos added this to the 2.2 milestone Apr 9, 2021

svenklemm force-pushed the skipscan branch 3 times, most recently from 05671a1 to 57fd98a Compare April 9, 2021 13:23

Add ts_is_chunk_append_path and ts_is_constraint_aware_append_path

258a0d3

Add helper functions to check whether a CustomPath is a ChunkAppendPath and ConstraintAwareAppendPath.

svenklemm force-pushed the skipscan branch 4 times, most recently from 9460222 to 3e3c119 Compare April 9, 2021 20:44

svenklemm force-pushed the skipscan branch 2 times, most recently from 2af251c to 0b114ee Compare April 9, 2021 21:33

svenklemm merged commit 31e1a9c into timescale:master Apr 9, 2021

k-rus mentioned this pull request Apr 12, 2021

Release 2.2.0 #3103

Merged

svenklemm changed the title ~~SkipScan~~ Skip Scan Apr 12, 2021

k-rus mentioned this pull request Apr 13, 2021

Release 2.2.0 #3108

Merged

svenklemm deleted the skipscan branch April 18, 2021 14:00

jameswinegar mentioned this pull request May 19, 2021

Aggregates with DISTINCT are not supported #3247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip Scan #3000

Skip Scan #3000

svenklemm commented Mar 2, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading

mkindahl left a comment

Skip Scan #3000

Skip Scan #3000

Conversation

svenklemm commented Mar 2, 2021 • edited Loading

codecov bot commented Mar 5, 2021 • edited Loading

Codecov Report

mkindahl left a comment

Choose a reason for hiding this comment

svenklemm commented Mar 2, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading