Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use consistent snapshots when scanning metadata #5428

Merged

Conversation

erimatnor
Copy link
Contributor

Invalidate the catalog snapshot in the scanner to ensure that any lookups into pg_catalog uses a snapshot that is consistent with the snapshot used to scan TimescaleDB metadata.

This fixes an issue where a chunk could be looked up without having a proper relid filled in, causing an assertion failure (ASSERT_IS_VALID_CHUNK). When a chunk is scanned and found (in chunk_tuple_found()), the Oid of the chunk table is filled in using get_relname_relid(), which could return InvalidOid due to use of a different snapshot when scanning pg_class. Calling InvalidateCatalogSnapshot() before starting the metadata scan in Scanner ensures the pg_catalog snapshot used is refreshed.

Due to the difficulty of reproducing this MVCC issue, there is not yet a regression or isolation test for it, but is easy to hit when doing repeated concurrent COPY into a distributed hypertable.

@erimatnor erimatnor added the bug label Mar 10, 2023
@erimatnor erimatnor force-pushed the invalidate-catalog-snapshot-in-scanner branch from af95d2e to 413d12f Compare March 10, 2023 14:36
@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Merging #5428 (57dc6bd) into main (7d6cf90) will decrease coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5428      +/-   ##
==========================================
- Coverage   90.72%   90.68%   -0.05%     
==========================================
  Files         228      228              
  Lines       53096    53092       -4     
==========================================
- Hits        48172    48146      -26     
- Misses       4924     4946      +22     
Impacted Files Coverage Δ
src/scanner.c 95.02% <100.00%> (+0.02%) ⬆️

... and 5 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@svenklemm
Copy link
Member

I think this might also be the real reason for #4900 which occasionally still happens in CI

@erimatnor erimatnor force-pushed the invalidate-catalog-snapshot-in-scanner branch from 413d12f to 6bba6c2 Compare March 17, 2023 08:37
@erimatnor erimatnor marked this pull request as ready for review March 17, 2023 08:37
@erimatnor erimatnor requested a review from akuzm March 17, 2023 08:39
@erimatnor erimatnor force-pushed the invalidate-catalog-snapshot-in-scanner branch 2 times, most recently from bc1ca48 to d63ad8d Compare March 20, 2023 15:44
Invalidate the catalog snapshot in the scanner to ensure that any
lookups into `pg_catalog` uses a snapshot that is consistent with the
snapshot used to scan TimescaleDB metadata.

This fixes an issue where a chunk could be looked up without having a
proper relid filled in, causing an assertion failure
(`ASSERT_IS_VALID_CHUNK`). When a chunk is scanned and found (in
`chunk_tuple_found()`), the Oid of the chunk table is filled in using
`get_relname_relid()`, which could return InvalidOid due to use of a
different snapshot when scanning `pg_class`. Calling
`InvalidateCatalogSnapshot()` before starting the metadata scan in
`Scanner` ensures the pg_catalog snapshot used is refreshed.

Due to the difficulty of reproducing this MVCC issue, no regression or
isolation test is provided, but it is easy to hit this bug when doing
highly concurrent COPY:s into a distributed hypertable.
@erimatnor erimatnor force-pushed the invalidate-catalog-snapshot-in-scanner branch from d63ad8d to 57dc6bd Compare March 21, 2023 07:53
@erimatnor erimatnor merged commit 63b416b into timescale:main Mar 21, 2023
@erimatnor erimatnor deleted the invalidate-catalog-snapshot-in-scanner branch March 21, 2023 09:34
akuzm added a commit to akuzm/timescaledb that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* timescale#5410 Fix file trailer handling in the COPY fetcher
* timescale#5446 Add checks for malloc failure in libpq calls
* timescale#5233 Out of on_proc_exit slots on guc license change
* timescale#5428 Use consistent snapshots when scanning metadata
* timescale#5499 Do not segfault on large histogram() parameters
* timescale#5470 Ensure superuser perms during copy/move chunk
* timescale#5500 Fix when no FROM clause in continuous aggregate definition
* timescale#5433 Fix join rte in CAggs with joins
* timescale#5556 Fix duplicated entries on timescaledb_experimental.policies view
* timescale#5462 Fix segfault after column drop on compressed table
* timescale#5543 Copy scheduled_jobs list before sorting it
* timescale#5497 Allow named time_bucket arguments in Cagg definition
* timescale#5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* timescale#5558 Use regrole for job owner
* timescale#5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit to akuzm/timescaledb that referenced this pull request Apr 19, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* timescale#5410 Fix file trailer handling in the COPY fetcher
* timescale#5446 Add checks for malloc failure in libpq calls
* timescale#5233 Out of on_proc_exit slots on guc license change
* timescale#5428 Use consistent snapshots when scanning metadata
* timescale#5499 Do not segfault on large histogram() parameters
* timescale#5470 Ensure superuser perms during copy/move chunk
* timescale#5500 Fix when no FROM clause in continuous aggregate definition
* timescale#5433 Fix join rte in CAggs with joins
* timescale#5556 Fix duplicated entries on timescaledb_experimental.policies view
* timescale#5462 Fix segfault after column drop on compressed table
* timescale#5543 Copy scheduled_jobs list before sorting it
* timescale#5497 Allow named time_bucket arguments in Cagg definition
* timescale#5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* timescale#5558 Use regrole for job owner
* timescale#5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 20, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
akuzm added a commit that referenced this pull request Apr 20, 2023
 ## 2.10.2 (2023-04-20)

**Bugfixes**
* #5410 Fix file trailer handling in the COPY fetcher
* #5446 Add checks for malloc failure in libpq calls
* #5233 Out of on_proc_exit slots on guc license change
* #5428 Use consistent snapshots when scanning metadata
* #5499 Do not segfault on large histogram() parameters
* #5470 Ensure superuser perms during copy/move chunk
* #5500 Fix when no FROM clause in continuous aggregate definition
* #5433 Fix join rte in CAggs with joins
* #5556 Fix duplicated entries on timescaledb_experimental.policies view
* #5462 Fix segfault after column drop on compressed table
* #5543 Copy scheduled_jobs list before sorting it
* #5497 Allow named time_bucket arguments in Cagg definition
* #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket
* #5558 Use regrole for job owner
* #5542 Enable indexscan on uncompressed part of partially compressed chunks

**Thanks**
* @nikolaps for reporting an issue with the COPY fetcher
* @S-imo-n for reporting the issue on Background Worker Scheduler crash
* @geezhu for reporting issue on segfault in historgram()
* @mwahlhuetter for reporting the issue with joins in CAggs
* @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view
* @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants