Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for tablets #249

Merged
merged 4 commits into from
Jan 16, 2024

Conversation

sylwiaszunejko
Copy link
Collaborator

@sylwiaszunejko sylwiaszunejko commented Aug 2, 2023

This PR introduces changes to the driver that are necessary for shard-awareness and token-awareness to work effectively with the tablets.

Now if driver send the request to the wrong node/shard it will get the correct tablet information from Scylla in custom_payload. It uses this information to obtain target replicas and shard numbers for tables managed by tablet replication.

I also added parsing TABLETS_ROUTING_V1 extension to ProtocolFeatures. In order for Scylla to send the tablet info, the driver must tell the database during connection handshake that it is able to interpret it. This negotiation is added as a part of ProtocolFeatures class.

To test this change I created integration and unit tests.

Fixes: #281

cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
tests/integration/standard/test_tablets.py Outdated Show resolved Hide resolved
tests/integration/__init__.py Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/query.py Outdated Show resolved Hide resolved
cassandra/pool.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/metadata.py Outdated Show resolved Hide resolved
Copy link

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now

cassandra/metadata.py Outdated Show resolved Hide resolved
cassandra/policies.py Outdated Show resolved Hide resolved
cassandra/pool.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/tablets.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/policies.py Outdated Show resolved Hide resolved
cassandra/policies.py Outdated Show resolved Hide resolved
cassandra/pool.py Outdated Show resolved Hide resolved
cassandra/tablets.py Outdated Show resolved Hide resolved
cassandra/policies.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/tablets.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/tablets.py Outdated Show resolved Hide resolved
cassandra/pool.py Outdated Show resolved Hide resolved
cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/tablets.py Outdated Show resolved Hide resolved
@Lorak-mmk
Copy link

oops, it was not supposed to be "Approve"

Copy link

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Please add the comment that I requested and I think we can merge this.

@sylwiaszunejko
Copy link
Collaborator Author

Looks good now. Please add the comment that I requested and I think we can merge this.

In gocql @avelanarius said we should wait with merging until the events are available, I don't know if this also applies to python-driver

tablet = self._session.cluster._load_balancing_policy._cluster_metadata._tablets.get_tablet_for_key(keyspace, table, t)

if tablet is not None:
shard_id = tablet.replicas[0][1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you always take the shard of the first replica? This pool may be for a different replica.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how to know which shard to take? Isn't it redundant? I thought the choice didn't matter so I took the first one, maybe I misunderstood something

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the current two-pass implementation (first we determine the node to send the request to, then we determine the shard to send the request to), I think this code has to go through all replicas, find the replica that matches the HostConnection's host and read its shard number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylwiaszunejko IIUC this function chooses shard-connection to a particular host for a given key. The owning shard of a given key is different on different hosts. It's specified in the <host, shard> tuple in the replica list. So the shard should depend on which host-replica you're going to talk to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I get it, hope now it looks good

@sylwiaszunejko
Copy link
Collaborator Author

@Lorak-mmk @avelanarius I removed experimental setting

@sylwiaszunejko sylwiaszunejko force-pushed the introduce_tablets branch 2 times, most recently from ff00514 to b0a571e Compare January 5, 2024 15:03
Copy link

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some final minor nits, otherwise looks good to me

cassandra/cluster.py Outdated Show resolved Hide resolved
cassandra/metadata.py Outdated Show resolved Hide resolved
cassandra/protocol_features.py Outdated Show resolved Hide resolved
In order for Scylla to send the tablet info, the driver must tell
the database during connection handshake that it is able to
interpret it. This negotation is added as a part of ProtocolFeatures
class.
@@ -1762,6 +1764,8 @@ def connect(self, keyspace=None, wait_for_all_pools=False):
try:
self.control_connection.connect()

self.load_balancing_policy.populate(weakref.proxy(self), self.metadata.all_hosts())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this code was here before most recent pushes, I don't remember seeing it.
Why do you need to pass weakref? Is the cluster saved somewhere?
Why do we need to call this here? I see that populate is called in other places already, so why another call here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we were using experimental setting, it was known from the beginning that we are using tablets. So, in every place where self.load_balancing_policy.populate() was called, it was properly informed to use tablets. But now, we have to get the information about using tablets from connection handshake. Therefore, when we are performing populate on lbp here: https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L1744, we do not know yet if tablets are being used or not, so there is a need to call populate again after calling connect() on control_connection.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but please put proper explanation in the comment, this code is really not obvious

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylwiaszunejko It looks like this line specifically causes the CI failures. Since pretty much all integration tests don't use tablets (and your code shouldn't affect them), I tested progressively removing more and more of your changes and removing this line fixed the failing tests. I don't yet have an explanation why it is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch on which I tested: https://github.com/avelanarius/python-driver/commits/tablets_disabled3/, "no populate" fixed the failing tests ("Build and upload to PyPi" are failing for some other reason)

@avelanarius
Copy link

I'll also look into test failures today and try to debug them.

@avelanarius
Copy link

avelanarius commented Jan 9, 2024

Some initial investigation (still waiting for some runs to finish): master (without this PR) reliably passes, with PR rebased on current master fails. Now looking into what part of this PR causes the test failure.

@sylwiaszunejko sylwiaszunejko force-pushed the introduce_tablets branch 6 times, most recently from d075881 to 3ffc22c Compare January 12, 2024 07:28
@sylwiaszunejko
Copy link
Collaborator Author

@avelanarius I added documentation

@@ -1775,6 +1777,9 @@ def connect(self, keyspace=None, wait_for_all_pools=False):
self.shutdown()
raise

# Update the information about tablet support after connection handshake.
self.load_balancing_policy._tablets_routing_v1 = self.control_connection._tablets_routing_v1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it necessary to set _tablets_routing_v1 also on child policies? This is what populate() also did. Maybe the code doesn't work properly if the set policy is some policy that has token aware policy as child - the child policy (token aware policy) won't have _tablets_routing_v1 set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, do you think that creating another function only to populate _tablets_routing_v1 will be a good solution? Or should I investigate more why calling populate was causing issues and return to calling it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to investigate why populate was causing issues.

sylwiaszunejko and others added 3 commits January 15, 2024 16:03
Add mechanism to parse system.tablets periodically. In TokenAwarePolicy
check if keyspace uses tablets if so try to use them to find replicas.

Make shard awareness work when using tablets.

Everything is wrapped in experimental setting, because tablets are still
experimental in ScyllaDB and changes in the tablets format are possible.
@avelanarius avelanarius merged commit dfccfff into scylladb:master Jan 16, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update driver to populate tablet metadata for request routing
7 participants