Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip Scan #3000

Merged
merged 2 commits into from
Apr 9, 2021
Merged

Skip Scan #3000

merged 2 commits into from
Apr 9, 2021

Conversation

svenklemm
Copy link
Member

@svenklemm svenklemm commented Mar 2, 2021

This patch implements a Skip Scan; an optimization for SELECT DISTINCT ON.
Usually for SELECT DISTINCT ON postgres will plan either a UNIQUE over a
sorted path, or some form of aggregate. In either case, it needs to scan the
entire table, even in cases where there are only a few unique values.

A skip-scan optimizes this case when we have an ordered index. Instead of
scanning the entire table and deduplicating after, the scan remembers the last
value returned, and searches the index for the next value after that one. This
means that for a table with k keys, with u distinct values, a skip-scan runs
in time u * log(k) as opposed to scanning then deduplicating, which takes time
k. We can write the number of unique values u as of function of k by
dividing by the number of repeats r i.e. u = k/r this means that a skip-scan
will be faster if each key is repeated more than a logarithmic number of times,
i.e. if r > log(k) then u * log(k) < k/log(k) * log(k) < k.

Disable-check: commit-count

@svenklemm svenklemm force-pushed the skipscan branch 3 times, most recently from f6723a3 to 4a650aa Compare March 4, 2021 23:57
@codecov
Copy link

codecov bot commented Mar 5, 2021

Codecov Report

Merging #3000 (1601905) into master (eace5ea) will increase coverage by 0.00%.
The diff coverage is 94.24%.

❗ Current head 1601905 differs from pull request most recent head 128f84d. Consider uploading reports for the commit 128f84d to get more accurate results
Impacted file tree graph

@@           Coverage Diff            @@
##           master    #3000    +/-   ##
========================================
  Coverage   90.28%   90.28%            
========================================
  Files         213      215     +2     
  Lines       34885    35198   +313     
========================================
+ Hits        31495    31778   +283     
- Misses       3390     3420    +30     
Impacted Files Coverage Δ
src/compat.h 100.00% <ø> (ø)
tsl/src/nodes/skip_scan/exec.c 94.01% <94.01%> (ø)
tsl/src/nodes/skip_scan/planner.c 94.14% <94.14%> (ø)
src/constraint_aware_append.c 92.50% <100.00%> (+0.11%) ⬆️
src/guc.c 97.36% <100.00%> (+0.07%) ⬆️
tsl/src/init.c 83.33% <100.00%> (+0.98%) ⬆️
tsl/src/planner.c 100.00% <100.00%> (ø)
src/loader/bgw_message_queue.c 84.51% <0.00%> (-2.59%) ⬇️
src/loader/bgw_launcher.c 89.50% <0.00%> (-2.47%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eace5ea...128f84d. Read the comment docs.

@svenklemm svenklemm force-pushed the skipscan branch 2 times, most recently from 304903a to 60eb779 Compare March 6, 2021 13:56
@svenklemm svenklemm force-pushed the skipscan branch 21 times, most recently from 6003360 to 4b9b272 Compare March 16, 2021 13:32
@svenklemm svenklemm marked this pull request as ready for review March 16, 2021 17:25
@svenklemm svenklemm requested a review from a team as a code owner March 16, 2021 17:25
@svenklemm svenklemm force-pushed the skipscan branch 3 times, most recently from bddcbba to e5cb0db Compare April 8, 2021 16:22
Copy link
Contributor

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a check for a table with only null values in it, which could be a potentially bad case for the state machine. You have tests for tables that contain a mix of null and non-null values, but I do not see a test where the column has only nulls.

I'm approving anyway, since this is a minor thing, but please add a test for that.

tsl/src/nodes/skip_scan/planner.c Show resolved Hide resolved
tsl/test/sql/include/skip_scan_query.sql Show resolved Hide resolved
Add helper functions to check whether a CustomPath is a
ChunkAppendPath and ConstraintAwareAppendPath.
@svenklemm svenklemm force-pushed the skipscan branch 4 times, most recently from 9460222 to 3e3c119 Compare April 9, 2021 20:44
This patch implements SkipScan; an optimization for SELECT DISTINCT ON.
Usually for SELECT DISTINCT ON postgres will plan either a UNIQUE over a
sorted path, or some form of aggregate. In either case, it needs to scan the
entire table, even in cases where there are only a few unique values.

A skip scan optimizes this case when we have an ordered index. Instead of
scanning the entire table and deduplicating after, the scan remembers the last
value returned, and searches the index for the next value after that one. This
means that for a table with k keys, with u distinct values, a skip scan runs
in time u * log(k) as opposed to scanning then deduplicating, which takes time
k. We can write the number of unique values u as of function of k by
dividing by the number of repeats r i.e. u = k/r this means that a skip scan
will be faster if each key is repeated more than a logarithmic number of times,
i.e. if r > log(k) then u * log(k) < k/log(k) * log(k) < k.

Co-authored-by: Joshua Lockerman <josh@timescale.com>
@svenklemm svenklemm force-pushed the skipscan branch 2 times, most recently from 2af251c to 0b114ee Compare April 9, 2021 21:33
@svenklemm svenklemm merged commit 31e1a9c into timescale:master Apr 9, 2021
k-rus added a commit to k-rus/timescaledb that referenced this pull request Apr 12, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the skip-scan optimization, which significantly
improves performance of queries with DISTINCT ON. This optimization is
not available for queries on distributed hypertables.

For multinode, this release adds a function to create a distributed
restore point, which allows the consistent restore of a multinode
cluster from a backup. The release also includes improvements to
distributed query performance and memory usage.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**Major Features**
* timescale#2843 Add distributed restore point functionality
* timescale#3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* timescale#2989 Refactor and harden size and stats functions
* timescale#3058 Reduce memory usage for distributed inserts
* timescale#3067 Fix extremely slow multi-node order by queries
* timescale#3082 Fix chunk index column name mapping
* timescale#3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
@k-rus k-rus mentioned this pull request Apr 12, 2021
@svenklemm svenklemm changed the title SkipScan Skip Scan Apr 12, 2021
k-rus added a commit to k-rus/timescaledb that referenced this pull request Apr 12, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the Skip Scan optimization, which significantly
improves the performance of queries with DISTINCT ON. This
optimization is not available for queries on distributed hypertables.

This release also adds a function to create a distributed
restore point, which allows performing a consistent restore of a
multi-node cluster from a backup.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**Major Features**
* timescale#2843 Add distributed restore point functionality
* timescale#3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* timescale#2989 Refactor and harden size and stats functions
* timescale#3058 Reduce memory usage for distributed inserts
* timescale#3067 Fix extremely slow multi-node order by queries
* timescale#3082 Fix chunk index column name mapping
* timescale#3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
k-rus added a commit to k-rus/timescaledb that referenced this pull request Apr 12, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the Skip Scan optimization, which significantly
improves the performance of queries with DISTINCT ON. This
optimization is not yet available for queries on distributed
hypertables.

This release also adds a function to create a distributed
restore point, which allows performing a consistent restore of a
multi-node cluster from a backup.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**PostgreSQL 11 deprecation announcement**
Timescale is working hard on our next exciting features. To make that
possible, we require functionality that is unfortunately absent on
PostgreSQL 11. For this reason, we will continue supporting PostgreSQL
11 until mid-June 2021. Sooner to that time, we will announce the
specific version of TimescaleDB in which PostgreSQL 11 support will
not be included going forward.

**Major Features**
* timescale#2843 Add distributed restore point functionality
* timescale#3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* timescale#2989 Refactor and harden size and stats functions
* timescale#3058 Reduce memory usage for distributed inserts
* timescale#3067 Fix extremely slow multi-node order by queries
* timescale#3082 Fix chunk index column name mapping
* timescale#3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
k-rus added a commit to k-rus/timescaledb that referenced this pull request Apr 13, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the Skip Scan optimization, which significantly
improves the performance of queries with DISTINCT ON. This
optimization is not yet available for queries on distributed
hypertables.

This release also adds a function to create a distributed
restore point, which allows performing a consistent restore of a
multi-node cluster from a backup.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**PostgreSQL 11 deprecation announcement**

Timescale is working hard on our next exciting features. To make that
possible, we require functionality that is unfortunately absent on
PostgreSQL 11. For this reason, we will continue supporting PostgreSQL
11 until mid-June 2021. Sooner to that time, we will announce the
specific version of TimescaleDB in which PostgreSQL 11 support will
not be included going forward.

**Major Features**
* timescale#2843 Add distributed restore point functionality
* timescale#3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* timescale#2989 Refactor and harden size and stats functions
* timescale#3058 Reduce memory usage for distributed inserts
* timescale#3067 Fix extremely slow multi-node order by queries
* timescale#3082 Fix chunk index column name mapping
* timescale#3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
k-rus added a commit that referenced this pull request Apr 13, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the Skip Scan optimization, which significantly
improves the performance of queries with DISTINCT ON. This
optimization is not yet available for queries on distributed
hypertables.

This release also adds a function to create a distributed
restore point, which allows performing a consistent restore of a
multi-node cluster from a backup.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**PostgreSQL 11 deprecation announcement**

Timescale is working hard on our next exciting features. To make that
possible, we require functionality that is unfortunately absent on
PostgreSQL 11. For this reason, we will continue supporting PostgreSQL
11 until mid-June 2021. Sooner to that time, we will announce the
specific version of TimescaleDB in which PostgreSQL 11 support will
not be included going forward.

**Major Features**
* #2843 Add distributed restore point functionality
* #3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* #2989 Refactor and harden size and stats functions
* #3058 Reduce memory usage for distributed inserts
* #3067 Fix extremely slow multi-node order by queries
* #3082 Fix chunk index column name mapping
* #3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
k-rus added a commit to k-rus/timescaledb that referenced this pull request Apr 13, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the Skip Scan optimization, which significantly
improves the performance of queries with DISTINCT ON. This
optimization is not yet available for queries on distributed
hypertables.

This release also adds a function to create a distributed
restore point, which allows performing a consistent restore of a
multi-node cluster from a backup.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**PostgreSQL 11 deprecation announcement**

Timescale is working hard on our next exciting features. To make that
possible, we require functionality that is unfortunately absent on
PostgreSQL 11. For this reason, we will continue supporting PostgreSQL
11 until mid-June 2021. Sooner to that time, we will announce the
specific version of TimescaleDB in which PostgreSQL 11 support will
not be included going forward.

**Major Features**
* timescale#2843 Add distributed restore point functionality
* timescale#3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* timescale#2989 Refactor and harden size and stats functions
* timescale#3058 Reduce memory usage for distributed inserts
* timescale#3067 Fix extremely slow multi-node order by queries
* timescale#3082 Fix chunk index column name mapping
* timescale#3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
@k-rus k-rus mentioned this pull request Apr 13, 2021
k-rus added a commit that referenced this pull request Apr 13, 2021
This release adds major new features since the 2.1.1 release.
We deem it moderate priority for upgrading.

This release adds the Skip Scan optimization, which significantly
improves the performance of queries with DISTINCT ON. This
optimization is not yet available for queries on distributed
hypertables.

This release also adds a function to create a distributed
restore point, which allows performing a consistent restore of a
multi-node cluster from a backup.

The bug fixes in this release address issues with size and stats
functions, high memory usage in distributed inserts, slow distributed
ORDER BY queries, indexes involving INCLUDE, and single chunk query
planning.

**PostgreSQL 11 deprecation announcement**

Timescale is working hard on our next exciting features. To make that
possible, we require functionality that is unfortunately absent on
PostgreSQL 11. For this reason, we will continue supporting PostgreSQL
11 until mid-June 2021. Sooner to that time, we will announce the
specific version of TimescaleDB in which PostgreSQL 11 support will
not be included going forward.

**Major Features**
* #2843 Add distributed restore point functionality
* #3000 SkipScan to speed up SELECT DISTINCT

**Bugfixes**
* #2989 Refactor and harden size and stats functions
* #3058 Reduce memory usage for distributed inserts
* #3067 Fix extremely slow multi-node order by queries
* #3082 Fix chunk index column name mapping
* #3083 Keep Append pathkeys in ChunkAppend

**Thanks**
* @BowenGG for reporting an issue with indexes with INCLUDE
* @fvannee for reporting an issue with ChunkAppend pathkeys
* @pedrokost and @RobAtticus for reporting an issue with size
  functions on empty hypertables
* @phemmer and @ryanbooz for reporting issues with slow
  multi-node order by queries
* @stephane-moreau for reporting an issue with high memory usage during
  single-transaction inserts on a distributed hypertable.
@svenklemm svenklemm deleted the skipscan branch April 18, 2021 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants