Skip to content

release-24.3: changefeedccl: fix race advancing frontier in schemafeed #149349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rharding6373
Copy link
Collaborator

@rharding6373 rharding6373 commented Jul 1, 2025

Backport 1/1 commits from #149119 on behalf of @rharding6373.


In the schema feed, when in updateTableHistory, we check that the current frontier is less than the current time. However, since we release the mutex protecting frontier while validating table descriptors, it's possible for another routine to advance the frontier before the first routine tries to advance it. For example, another routine may call pauseOrResumePolling and pause polling at the same time it advances the frontier.As a consequence, it's possible for the first routine to assert fail due to the current frontier being greater than the current time when it tries to advance it.

This change fixes this race by checking that the frontier is greater than the current time (endTS) again before trying to advance the frontier. If the frontier is less than or equal to the current time, the frontier does not need to be advanced and it returns. It's worth keeping the original check in, since it avoids the need to validate descriptors, and releasing the mutex also prevents the routine from holding it while validating.

Epic: none
Fixes: #148963

Release note (bug fix): Fixes a race condition when advancing a changefeed aggregator's frontier. When hit, the race condition could result in an internal error that would shut down the kvfeed and cause the changefeed to retry.


Release justification: Fixes a race condition. Change is protected by a flag, defaulted on, as an escape hatch. It's ok for the flag to be defaulted on since the fix itself is simple (<5 LOC) and easy to reason about.

In the schema feed, when in `updateTableHistory`, we check that the
current frontier is less than the current time. However, since we
release the mutex protecting frontier while validating table
descriptors, it's possible for another routine to advance the frontier
before the first routine tries to advance it. For example, another
routine may call pauseOrResumePolling and pause polling at the same time
it advances the frontier. As a consequence, it's possible for the first
routine to assert fail due to the current frontier being greater than
the current time when it tries to advance it.

This change fixes this race by checking that the frontier is greater
than the current time (endTS) again before trying to advance the
frontier. If the frontier is less than or equal to the current time, the
frontier does not need to be advanced and it returns. It's worth keeping
the original check in, since it avoids the need to validate descriptors,
and releasing the mutex also prevents the routine from holding it while
validating.

Epic: none
Fixes: cockroachdb#148963

Release note (bug fix): Fixes a race condition when advancing a
changefeed aggregator's frontier. When hit, the race condition could
result in an internal error that would shut down the kvfeed and cause
the changefeed to retry.
@rharding6373 rharding6373 requested a review from a team as a code owner July 1, 2025 15:59
@rharding6373 rharding6373 removed the request for review from a team July 1, 2025 15:59
@blathers-crl blathers-crl bot added the blathers-backport This is a backport that Blathers created automatically. label Jul 1, 2025
@rharding6373 rharding6373 requested a review from andyyang890 July 1, 2025 15:59
@blathers-crl blathers-crl bot added the O-robot Originated from a bot. label Jul 1, 2025
@blathers-crl blathers-crl bot requested a review from KeithCh July 1, 2025 16:00
Copy link

blathers-crl bot commented Jul 1, 2025

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes. Includes test-only changes, build system changes, etc.
  • Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jul 1, 2025
Copy link

blathers-crl bot commented Jul 1, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Jul 1, 2025

✅ PR #149349 is compliant with backport policy

Confidence: high
Critical bug criteria met: [Stability or security issues]
Feature flag detected: Yes
Backward compatible: true
Explanation: The pull request addresses a race condition in advancing the changefeed aggregator's frontier, which aligns with the criteria for critical bugs related to stability. Specifically, it addresses a condition that can lead to an internal error and disrupt the operation of the changefeed. The PR clearly implements changes related to a critical bug by preventing errors when attempting to advance the frontier beyond the current time, if not necessary. This is done by checking with the condition 'if endTS.LessEq(frontier) && frontierAdvanceCheckEnabled.Get(&tf.settings.SV)' which uses the feature flag 'frontierAdvanceCheckEnabled' that is enabled by default but offers an escape hatch if needed. Additionally, the PR body explicitly mentions that the change is protected by a feature flag, ensuring that even though the PR defaults the flag to on, it remains backward compatible because it does not remove any existing functionality or features, merely adds a check before action.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rharding6373 rharding6373 merged commit 29683ed into cockroachdb:release-24.3 Jul 1, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. target-release-24.3.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants