Fix wait_for_schema_agreement deadlock #256

Lorak-mmk · 2023-09-26T19:09:08Z

Fixes #168

Fix works by extracting part of on_down that marks host as down out of the executor - so it does not need to wait for free thread. When host is marked as down, wait_for_schema_agreement can finish, which in turn enables rest of on_down (the part that still runs on executor) to be executed.

I initially rejected this approach because I think it could cause other issues - but given how hard it is to fix this issue otherwise and how many other problems it caused, this seems like a better option, especially that no tests show any new problems.

Fixes scylladb#168 Fix works by extracting part of on_down that marks host as down out of the executor - so it does not need to wait for free thread. When host is marked as down, wait_for_schema_agreement can finish, which in turn enables rest of on_down (the part that still runs on executor) to be executed.

Lorak-mmk · 2023-09-26T19:09:23Z

We need to check if it fxes #225

fruch · 2023-09-26T20:02:52Z

We need to check if it fxes #225

I'll try it

@Lorak-mmk didn't during the whole time trying to fix this issue, had test reproducing it in this repo ?

Lorak-mmk · 2023-09-26T20:07:02Z

@Lorak-mmk didn't during the whole time trying to fix this issue, had test reproducing it in this repo ?

By "it" you mean you mean #168 ? There is no test in repo, I use a local reproducer:

from cassandra.cluster import Cluster
import os
import logging
import sys
import time


def has_cql(addr, port):
    return os.system(f'nc -z {addr} {port}') == 0


def wait_for_cql(addr, port):
    i = 0
    while True:
        if has_cql(addr, port):
            logging.info(f'CQL to {addr}:{port} found after {i} tries')
            return
        time.sleep(0.1)
        i += 1


logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format='%(asctime)s.%(msecs)03d %(levelname)s [%(module)s:%(lineno)s]: %(message)s')

os.system("docker run --rm -d --name some-scylla --hostname some-scylla -p9042 scylladb/scylla --smp 1")
wait_for_cql('172.17.0.2', 9042)

os.system('docker run --rm -d --name some-scylla2 --hostname some-scylla2 -p9042 scylladb/scylla --smp 1 --seeds="$(docker inspect --format=\'{{ .NetworkSettings.IPAddress }}\' some-scylla)"')
wait_for_cql('172.17.0.3', 9042)

os.system('docker run --rm -d --name some-scylla3 --hostname some-scylla3 -p9042 scylladb/scylla --smp 1 --seeds="$(docker inspect --format=\'{{ .NetworkSettings.IPAddress }}\' some-scylla)"')
wait_for_cql('172.17.0.4', 9042)


try:
    cluster = Cluster(['172.17.0.2'], max_schema_agreement_wait=120, protocol_version=4)
    session = cluster.connect()

    session.execute("CREATE KEYSPACE IF NOT EXISTS example WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2' };")
    session.execute("USE example;")
    session.execute("CREATE TABLE IF NOT EXISTS example.test(k int, c int, v int, PRIMARY KEY (k, v));")
    session.execute("INSERT INTO example.test (k, c, v) VALUES (1, 2, 3);")
    print(session.execute("SELECT * FROM example.test WHERE k = 1;").all())

    os.system("docker kill some-scylla2")
    print('killed scylla')

    print(session.execute("ALTER TABLE example.test ADD v2 int;", timeout=180).all())

    print(session.execute("SELECT * FROM example.test WHERE k = 1;").all())
finally:
    os.system("docker kill some-scylla")
    os.system("docker kill some-scylla2")
    os.system("docker kill some-scylla3")

fruch · 2023-09-26T20:14:42Z

@Lorak-mmk this is 100% reproducing the issue ? seem like a quick straightforward test to write with CCM

fruch · 2023-09-26T21:39:38Z

We need to check if it fxes #225

I can confirm scylla is passing 100/100 in topology.test_concurrent_schema test

fruch · 2023-09-27T08:18:17Z

@Lorak-mmk

failing test isn't related:

FAILED tests/integration/standard/test_udts.py::UDTTests::test_can_register_udt_before_connecting - AttributeError: 'user' object has no attribute 'state'

already seen multiple times:
#227

fruch

LGTM

Lorak-mmk · 2023-09-27T11:18:46Z

@Lorak-mmk this is 100% reproducing the issue ? seem like a quick straightforward test to write with CCM

I'll try to add the test to this PR

Regression test for deadlock when performing schema change right after killing a node: scylladb#168

Lorak-mmk · 2023-09-27T14:18:43Z

I think it's ready to merge

kbr-scylla · 2023-09-27T16:12:27Z

Please run ScyllaDB test.py topology test suite a bunch of times and see if this fix doesn't cause a regression.

fruch · 2023-09-27T16:31:17Z

Please run ScyllaDB test.py topology test suite a bunch of times and see if this fix doesn't cause a regression.

I did run it 100 with this fix, and didn't see any regression.

Lorak-mmk mentioned this pull request Sep 26, 2023

Deadlock when performing a schema change right after killing a node #168

Closed

Lorak-mmk requested review from piodul, fruch, nyh and avelanarius September 26, 2023 19:12

fruch approved these changes Sep 27, 2023

View reviewed changes

Add regression test for schema deadlock

01383bc

Regression test for deadlock when performing schema change right after killing a node: scylladb#168

Lorak-mmk merged commit 501640c into scylladb:master Sep 27, 2023
13 checks passed

Lorak-mmk mentioned this pull request Oct 3, 2023

regression running scylladb.git test.py topology/test_concurrent_schema #225

Closed

bhalevy mentioned this pull request Oct 14, 2023

[dtest] wide_rows_test.TestWideRows.test_large_cell_in_materialized_view is flakey scylladb/scylladb#15700

Closed

Lorak-mmk mentioned this pull request Jan 11, 2024

Move schema agreement to asyncio #248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wait_for_schema_agreement deadlock #256

Fix wait_for_schema_agreement deadlock #256

Lorak-mmk commented Sep 26, 2023

Lorak-mmk commented Sep 26, 2023

fruch commented Sep 26, 2023

Lorak-mmk commented Sep 26, 2023

fruch commented Sep 26, 2023

fruch commented Sep 26, 2023

fruch commented Sep 27, 2023

fruch left a comment

Lorak-mmk commented Sep 27, 2023

Lorak-mmk commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

fruch commented Sep 27, 2023

Fix wait_for_schema_agreement deadlock #256

Fix wait_for_schema_agreement deadlock #256

Conversation

Lorak-mmk commented Sep 26, 2023

Lorak-mmk commented Sep 26, 2023

fruch commented Sep 26, 2023

Lorak-mmk commented Sep 26, 2023

fruch commented Sep 26, 2023

fruch commented Sep 26, 2023

fruch commented Sep 27, 2023

fruch left a comment

Choose a reason for hiding this comment

Lorak-mmk commented Sep 27, 2023

Lorak-mmk commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

fruch commented Sep 27, 2023