-
Notifications
You must be signed in to change notification settings - Fork 3.9k
release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-25.1: changefeedccl: fix race advancing frontier in schemafeed #149350
Conversation
In the schema feed, when in `updateTableHistory`, we check that the current frontier is less than the current time. However, since we release the mutex protecting frontier while validating table descriptors, it's possible for another routine to advance the frontier before the first routine tries to advance it. For example, another routine may call pauseOrResumePolling and pause polling at the same time it advances the frontier. As a consequence, it's possible for the first routine to assert fail due to the current frontier being greater than the current time when it tries to advance it. This change fixes this race by checking that the frontier is greater than the current time (endTS) again before trying to advance the frontier. If the frontier is less than or equal to the current time, the frontier does not need to be advanced and it returns. It's worth keeping the original check in, since it avoids the need to validate descriptors, and releasing the mutex also prevents the routine from holding it while validating. Epic: none Fixes: cockroachdb#148963 Release note (bug fix): Fixes a race condition when advancing a changefeed aggregator's frontier. When hit, the race condition could result in an internal error that would shut down the kvfeed and cause the changefeed to retry.
Thanks for opening a backport. Before merging, please confirm that it falls into one of the following categories (select one):
Add a brief release justification to the PR description explaining your selection. Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy. All backports must be reviewed by the TL and EM for the owning area. |
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
❌ PR #149350 does not comply with backport policy Confidence: high 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Backport 1/1 commits from #149119 on behalf of @rharding6373.
In the schema feed, when in
updateTableHistory
, we check that the current frontier is less than the current time. However, since we release the mutex protecting frontier while validating table descriptors, it's possible for another routine to advance the frontier before the first routine tries to advance it. For example, another routine may call pauseOrResumePolling and pause polling at the same time it advances the frontier.As a consequence, it's possible for the first routine to assert fail due to the current frontier being greater than the current time when it tries to advance it.This change fixes this race by checking that the frontier is greater than the current time (endTS) again before trying to advance the frontier. If the frontier is less than or equal to the current time, the frontier does not need to be advanced and it returns. It's worth keeping the original check in, since it avoids the need to validate descriptors, and releasing the mutex also prevents the routine from holding it while validating.
Epic: none
Fixes: #148963
Release note (bug fix): Fixes a race condition when advancing a changefeed aggregator's frontier. When hit, the race condition could result in an internal error that would shut down the kvfeed and cause the changefeed to retry.
Release justification: Fixes a race condition. Change is protected by a flag, defaulted on, as an escape hatch. It's ok for the flag to be defaulted on since the fix itself is simple (<5 LOC) and easy to reason about.