forked from apache/cassandra-gocql-driver
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for gocql/gocql issue 1274 #22
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add option to fallback to other DC for given token
This test reproduces the race condition.
tokenAwareHostPolicy keeps track of hosts in the cluster, keeps the token ring and a map of token to replicas that keep the data (which is computed using a placement strategy). One issue that is present in old the code is that it updates the host list and token ring whenever a node goes up or down. While the replica map is only updated in AddHost/RemoveHost, it could be computed based on partial host list in case AddHost/RemoveHost is called whenever some host is down. The placement strategy would not see all nodes in the cluster in that case. In fact, HostUp is never called anywhere in the code now and AddHost is called even in case when a node goes up. So this commit changes tokenAwareHostPolicy to only update token ring and replica map whenever hosts change in AddHost/RemoveHost and not HostUp/HostDown. There is also a race condition in updateKeyspaceMetadata when computing replicas, as it accesses fields that are synchronized separately (tokenRing and hosts), but the placement strategy assumes the values it receives are consistent. This leads to panics triggered in replicaMap in case the race is hit when hosts are rapidly added/removed or getting up/down (such as when we lose connection to multiple nodes at the same time). One sequence of events that might lead to the race condition and panic in networkTopology.replicaMap could be like below: 1. A: calls tokenAwareHostPolicy.AddHost 2. A: calls tokenAwareHostPolicy.HostUp 3. A: calls tokenAwareHostPolicy.updateKeyspaceMetadata 4. A: calls tr, _ := t.tokenRing.Load().(*tokenRing) to read token Ring 5. B: calls tokenAwareHostPolicy.RemoveHost 6. B: calls tokenAwareHostPolicy.HostDown which calls t.hosts.remove(host.ConnectAddress()) 7. A: calls strat.replicaMap(t.hosts.get(), tr.tokens) To fix the inconsistencies, we need to perform the entire state change atomically. I think we can just lock (for writing) during the entire state update as host add/remove should be rare. The remaining reads in Pick still use lock-free access, but we now copy and replace all the metadata with a single atomic.Value store instead of multiple atomic.Value fields. I left the hosts field as cowHostList type, but it could be replaced by a simple unsynchronized type as all accesses to hosts are now protected by `tokenAwareHostPolicy.mu`. Fixes apache#1274
martin-sucha
pushed a commit
to kiwicom/gocql
that referenced
this pull request
Jan 31, 2020
This commit fixes the next error: ``` type *tokenAwareHostPolicy has no field or method keyspaces ``` which was introduced with 47e1472 after merging scylladb#22 Commit 9867ae9 removed field used in the test and later change was not reflected during ongoing merges.
martin-sucha
pushed a commit
to kiwicom/gocql
that referenced
this pull request
Feb 6, 2020
This commit fixes the next error: ``` type *tokenAwareHostPolicy has no field or method keyspaces ``` which was introduced with 47e1472 after merging scylladb#22 Commit 9867ae9 removed field used in the test and later change was not reflected during ongoing merges.
martin-sucha
pushed a commit
to kiwicom/gocql
that referenced
this pull request
Mar 2, 2020
This commit fixes the next error: ``` type *tokenAwareHostPolicy has no field or method keyspaces ``` which was introduced with 47e1472 after merging scylladb#22 Commit 9867ae9 removed field used in the test and later change was not reflected during ongoing merges.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The first commit adds a reproducer test for apache#1274 which runs for 100ms. I'll let you decide if you want to include it or not.
The second commit is almost identical to apache#1332