Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wait_for_schema_agreement deadlock #256

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

Lorak-mmk
Copy link

Fixes #168

Fix works by extracting part of on_down that marks host as down out of the executor - so it does not need to wait for free thread. When host is marked as down, wait_for_schema_agreement can finish, which in turn enables rest of on_down (the part that still runs on executor) to be executed.

I initially rejected this approach because I think it could cause other issues - but given how hard it is to fix this issue otherwise and how many other problems it caused, this seems like a better option, especially that no tests show any new problems.

Fixes scylladb#168

Fix works by extracting part of on_down that marks host as down
out of the executor - so it does not need to wait for free thread.
When host is marked as down, wait_for_schema_agreement can finish,
which in turn enables rest of on_down (the part that still runs on
executor) to be executed.
@Lorak-mmk
Copy link
Author

We need to check if it fxes #225

@fruch
Copy link

fruch commented Sep 26, 2023

We need to check if it fxes #225

I'll try it

@Lorak-mmk didn't during the whole time trying to fix this issue, had test reproducing it in this repo ?

@Lorak-mmk
Copy link
Author

@Lorak-mmk didn't during the whole time trying to fix this issue, had test reproducing it in this repo ?

By "it" you mean you mean #168 ? There is no test in repo, I use a local reproducer:

from cassandra.cluster import Cluster
import os
import logging
import sys
import time


def has_cql(addr, port):
    return os.system(f'nc -z {addr} {port}') == 0


def wait_for_cql(addr, port):
    i = 0
    while True:
        if has_cql(addr, port):
            logging.info(f'CQL to {addr}:{port} found after {i} tries')
            return
        time.sleep(0.1)
        i += 1


logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format='%(asctime)s.%(msecs)03d %(levelname)s [%(module)s:%(lineno)s]: %(message)s')

os.system("docker run --rm -d --name some-scylla --hostname some-scylla -p9042 scylladb/scylla --smp 1")
wait_for_cql('172.17.0.2', 9042)

os.system('docker run --rm -d --name some-scylla2 --hostname some-scylla2 -p9042 scylladb/scylla --smp 1 --seeds="$(docker inspect --format=\'{{ .NetworkSettings.IPAddress }}\' some-scylla)"')
wait_for_cql('172.17.0.3', 9042)

os.system('docker run --rm -d --name some-scylla3 --hostname some-scylla3 -p9042 scylladb/scylla --smp 1 --seeds="$(docker inspect --format=\'{{ .NetworkSettings.IPAddress }}\' some-scylla)"')
wait_for_cql('172.17.0.4', 9042)


try:
    cluster = Cluster(['172.17.0.2'], max_schema_agreement_wait=120, protocol_version=4)
    session = cluster.connect()

    session.execute("CREATE KEYSPACE IF NOT EXISTS example WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2' };")
    session.execute("USE example;")
    session.execute("CREATE TABLE IF NOT EXISTS example.test(k int, c int, v int, PRIMARY KEY (k, v));")
    session.execute("INSERT INTO example.test (k, c, v) VALUES (1, 2, 3);")
    print(session.execute("SELECT * FROM example.test WHERE k = 1;").all())

    os.system("docker kill some-scylla2")
    print('killed scylla')

    print(session.execute("ALTER TABLE example.test ADD v2 int;", timeout=180).all())

    print(session.execute("SELECT * FROM example.test WHERE k = 1;").all())
finally:
    os.system("docker kill some-scylla")
    os.system("docker kill some-scylla2")
    os.system("docker kill some-scylla3")

@fruch
Copy link

fruch commented Sep 26, 2023

@Lorak-mmk this is 100% reproducing the issue ? seem like a quick straightforward test to write with CCM

@fruch
Copy link

fruch commented Sep 26, 2023

We need to check if it fxes #225

I can confirm scylla is passing 100/100 in topology.test_concurrent_schema test

@fruch
Copy link

fruch commented Sep 27, 2023

@Lorak-mmk

failing test isn't related:

FAILED tests/integration/standard/test_udts.py::UDTTests::test_can_register_udt_before_connecting - AttributeError: 'user' object has no attribute 'state'

already seen multiple times:
#227

Copy link

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Lorak-mmk
Copy link
Author

@Lorak-mmk this is 100% reproducing the issue ? seem like a quick straightforward test to write with CCM

I'll try to add the test to this PR

Regression test for deadlock when performing schema change
right after killing a node: scylladb#168
@Lorak-mmk
Copy link
Author

I think it's ready to merge

@Lorak-mmk Lorak-mmk merged commit 501640c into scylladb:master Sep 27, 2023
13 checks passed
@kbr-scylla
Copy link

Please run ScyllaDB test.py topology test suite a bunch of times and see if this fix doesn't cause a regression.

@fruch
Copy link

fruch commented Sep 27, 2023

Please run ScyllaDB test.py topology test suite a bunch of times and see if this fix doesn't cause a regression.

I did run it 100 with this fix, and didn't see any regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock when performing a schema change right after killing a node
3 participants