release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350

rharding6373 · 2025-07-01T16:00:03Z

Backport 1/1 commits from #149119 on behalf of @rharding6373.

In the schema feed, when in updateTableHistory, we check that the current frontier is less than the current time. However, since we release the mutex protecting frontier while validating table descriptors, it's possible for another routine to advance the frontier before the first routine tries to advance it. For example, another routine may call pauseOrResumePolling and pause polling at the same time it advances the frontier.As a consequence, it's possible for the first routine to assert fail due to the current frontier being greater than the current time when it tries to advance it.

This change fixes this race by checking that the frontier is greater than the current time (endTS) again before trying to advance the frontier. If the frontier is less than or equal to the current time, the frontier does not need to be advanced and it returns. It's worth keeping the original check in, since it avoids the need to validate descriptors, and releasing the mutex also prevents the routine from holding it while validating.

Epic: none
Fixes: #148963

Release note (bug fix): Fixes a race condition when advancing a changefeed aggregator's frontier. When hit, the race condition could result in an internal error that would shut down the kvfeed and cause the changefeed to retry.

Release justification: Fixes a race condition. Change is protected by a flag, defaulted on, as an escape hatch. It's ok for the flag to be defaulted on since the fix itself is simple (<5 LOC) and easy to reason about.

In the schema feed, when in `updateTableHistory`, we check that the current frontier is less than the current time. However, since we release the mutex protecting frontier while validating table descriptors, it's possible for another routine to advance the frontier before the first routine tries to advance it. For example, another routine may call pauseOrResumePolling and pause polling at the same time it advances the frontier. As a consequence, it's possible for the first routine to assert fail due to the current frontier being greater than the current time when it tries to advance it. This change fixes this race by checking that the frontier is greater than the current time (endTS) again before trying to advance the frontier. If the frontier is less than or equal to the current time, the frontier does not need to be advanced and it returns. It's worth keeping the original check in, since it avoids the need to validate descriptors, and releasing the mutex also prevents the routine from holding it while validating. Epic: none Fixes: cockroachdb#148963 Release note (bug fix): Fixes a race condition when advancing a changefeed aggregator's frontier. When hit, the race condition could result in an internal error that would shut down the kvfeed and cause the changefeed to retry.

blathers-crl · 2025-07-01T16:00:06Z

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

Non-production code changes. Includes test-only changes, build system changes, etc.
Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

blathers-crl · 2025-07-01T16:00:08Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl · 2025-07-01T16:00:16Z

❌ PR #149350 does not comply with backport policy

Confidence: high
Explanation: The analysis of the pull request shows that it includes changes to a production file: 'pkg/ccl/changefeedccl/schemafeed/schema_feed.go'. The PR is meant to fix a race condition in the schema feed, which qualifies under the stability issue criteria for a critical bug. The PR also describes using a feature flag to protect the changes, which is stated to be defaulted 'on'. However, the backport policy requires that any changes be gated behind a feature flag that is disabled by default unless it's a critical bug fix. Since this is categorized under a critical bug (stability issue), the use of a feature flag is not mandatory. The absence of such a feature flag is permissible in this scenario since the PR addresses a critical stability issue and thus can skip feature flag requirements. Nevertheless, stating that the flag is 'defaulted on' raises a concern, but it does not affect the allowance of skipping the feature flag step due to the critical nature of the fix. However, the release justification is slightly questionable because it implies the fix is both protected by a flag and critically needed. The release justification indicates a valid exemption, but the explanation about the flag being defaulted 'on' may not strictly align with typical cautious backporting practices, especially in a stable release branch.
Recommendation: Consider asking the author to clarify the role and default state of the feature flag, or adjust the default state of the flag to 'off' to align better with cautious practice.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-07-01T16:00:34Z

This change is

pkg/ccl/changefeedccl/schemafeed/schema_feed.go

rharding6373 requested a review from a team as a code owner July 1, 2025 16:00

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jul 1, 2025

rharding6373 requested review from asg0451 and removed request for a team July 1, 2025 16:00

blathers-crl bot requested review from andyyang890 and KeithCh July 1, 2025 16:00

blathers-crl bot added the backport Label PR's that are backports to older release branches label Jul 1, 2025

andyyang890 approved these changes Jul 1, 2025

View reviewed changes

pkg/ccl/changefeedccl/schemafeed/schema_feed.go Show resolved Hide resolved

asg0451 approved these changes Jul 1, 2025

View reviewed changes

rharding6373 merged commit a622927 into cockroachdb:release-25.1 Jul 1, 2025
15 of 16 checks passed

celeste-cockroachdb bot added the target-release-25.1.9 label Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350

release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350

Uh oh!

rharding6373 commented Jul 1, 2025 •

edited

Loading

Uh oh!

blathers-crl bot commented Jul 1, 2025

Uh oh!

blathers-crl bot commented Jul 1, 2025

Uh oh!

blathers-crl bot commented Jul 1, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350

release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350

Uh oh!

Conversation

rharding6373 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Jul 1, 2025

Uh oh!

blathers-crl bot commented Jul 1, 2025

Uh oh!

blathers-crl bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rharding6373 commented Jul 1, 2025 •

edited

Loading

blathers-crl bot commented Jul 1, 2025 •

edited

Loading