-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip Scan #3000
Skip Scan #3000
Conversation
f6723a3
to
4a650aa
Compare
Codecov Report
@@ Coverage Diff @@
## master #3000 +/- ##
========================================
Coverage 90.28% 90.28%
========================================
Files 213 215 +2
Lines 34885 35198 +313
========================================
+ Hits 31495 31778 +283
- Misses 3390 3420 +30
Continue to review full report at Codecov.
|
304903a
to
60eb779
Compare
6003360
to
4b9b272
Compare
bddcbba
to
e5cb0db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a check for a table with only null values in it, which could be a potentially bad case for the state machine. You have tests for tables that contain a mix of null and non-null values, but I do not see a test where the column has only nulls.
I'm approving anyway, since this is a minor thing, but please add a test for that.
05671a1
to
57fd98a
Compare
Add helper functions to check whether a CustomPath is a ChunkAppendPath and ConstraintAwareAppendPath.
9460222
to
3e3c119
Compare
This patch implements SkipScan; an optimization for SELECT DISTINCT ON. Usually for SELECT DISTINCT ON postgres will plan either a UNIQUE over a sorted path, or some form of aggregate. In either case, it needs to scan the entire table, even in cases where there are only a few unique values. A skip scan optimizes this case when we have an ordered index. Instead of scanning the entire table and deduplicating after, the scan remembers the last value returned, and searches the index for the next value after that one. This means that for a table with k keys, with u distinct values, a skip scan runs in time u * log(k) as opposed to scanning then deduplicating, which takes time k. We can write the number of unique values u as of function of k by dividing by the number of repeats r i.e. u = k/r this means that a skip scan will be faster if each key is repeated more than a logarithmic number of times, i.e. if r > log(k) then u * log(k) < k/log(k) * log(k) < k. Co-authored-by: Joshua Lockerman <josh@timescale.com>
2af251c
to
0b114ee
Compare
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the skip-scan optimization, which significantly improves performance of queries with DISTINCT ON. This optimization is not available for queries on distributed hypertables. For multinode, this release adds a function to create a distributed restore point, which allows the consistent restore of a multinode cluster from a backup. The release also includes improvements to distributed query performance and memory usage. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * #2843 Add distributed restore point functionality * #3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * #2989 Refactor and harden size and stats functions * #3058 Reduce memory usage for distributed inserts * #3067 Fix extremely slow multi-node order by queries * #3082 Fix chunk index column name mapping * #3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * timescale#2843 Add distributed restore point functionality * timescale#3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * timescale#2989 Refactor and harden size and stats functions * timescale#3058 Reduce memory usage for distributed inserts * timescale#3067 Fix extremely slow multi-node order by queries * timescale#3082 Fix chunk index column name mapping * timescale#3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This release adds major new features since the 2.1.1 release. We deem it moderate priority for upgrading. This release adds the Skip Scan optimization, which significantly improves the performance of queries with DISTINCT ON. This optimization is not yet available for queries on distributed hypertables. This release also adds a function to create a distributed restore point, which allows performing a consistent restore of a multi-node cluster from a backup. The bug fixes in this release address issues with size and stats functions, high memory usage in distributed inserts, slow distributed ORDER BY queries, indexes involving INCLUDE, and single chunk query planning. **PostgreSQL 11 deprecation announcement** Timescale is working hard on our next exciting features. To make that possible, we require functionality that is unfortunately absent on PostgreSQL 11. For this reason, we will continue supporting PostgreSQL 11 until mid-June 2021. Sooner to that time, we will announce the specific version of TimescaleDB in which PostgreSQL 11 support will not be included going forward. **Major Features** * #2843 Add distributed restore point functionality * #3000 SkipScan to speed up SELECT DISTINCT **Bugfixes** * #2989 Refactor and harden size and stats functions * #3058 Reduce memory usage for distributed inserts * #3067 Fix extremely slow multi-node order by queries * #3082 Fix chunk index column name mapping * #3083 Keep Append pathkeys in ChunkAppend **Thanks** * @BowenGG for reporting an issue with indexes with INCLUDE * @fvannee for reporting an issue with ChunkAppend pathkeys * @pedrokost and @RobAtticus for reporting an issue with size functions on empty hypertables * @phemmer and @ryanbooz for reporting issues with slow multi-node order by queries * @stephane-moreau for reporting an issue with high memory usage during single-transaction inserts on a distributed hypertable.
This patch implements a Skip Scan; an optimization for
SELECT DISTINCT ON
.Usually for
SELECT DISTINCT ON
postgres will plan either aUNIQUE
over asorted path, or some form of aggregate. In either case, it needs to scan the
entire table, even in cases where there are only a few unique values.
A skip-scan optimizes this case when we have an ordered index. Instead of
scanning the entire table and deduplicating after, the scan remembers the last
value returned, and searches the index for the next value after that one. This
means that for a table with
k
keys, withu
distinct values, a skip-scan runsin time
u * log(k)
as opposed to scanning then deduplicating, which takes timek
. We can write the number of unique valuesu
as of function ofk
bydividing by the number of repeats
r
i.e.u = k/r
this means that a skip-scanwill be faster if each key is repeated more than a logarithmic number of times,
i.e. if
r > log(k)
thenu * log(k) < k/log(k) * log(k) < k
.Disable-check: commit-count