-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wait_for_schema_agreement deadlock #256
Conversation
Fixes scylladb#168 Fix works by extracting part of on_down that marks host as down out of the executor - so it does not need to wait for free thread. When host is marked as down, wait_for_schema_agreement can finish, which in turn enables rest of on_down (the part that still runs on executor) to be executed.
We need to check if it fxes #225 |
I'll try it @Lorak-mmk didn't during the whole time trying to fix this issue, had test reproducing it in this repo ? |
By "it" you mean you mean #168 ? There is no test in repo, I use a local reproducer:
|
@Lorak-mmk this is 100% reproducing the issue ? seem like a quick straightforward test to write with CCM |
I can confirm scylla is passing 100/100 in |
failing test isn't related:
already seen multiple times: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'll try to add the test to this PR |
Regression test for deadlock when performing schema change right after killing a node: scylladb#168
I think it's ready to merge |
Please run ScyllaDB test.py topology test suite a bunch of times and see if this fix doesn't cause a regression. |
I did run it 100 with this fix, and didn't see any regression. |
Fixes #168
Fix works by extracting part of on_down that marks host as down out of the executor - so it does not need to wait for free thread. When host is marked as down, wait_for_schema_agreement can finish, which in turn enables rest of on_down (the part that still runs on executor) to be executed.
I initially rejected this approach because I think it could cause other issues - but given how hard it is to fix this issue otherwise and how many other problems it caused, this seems like a better option, especially that no tests show any new problems.