Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[close #736] Select reachable store #737

Merged
merged 7 commits into from
Apr 27, 2023
Merged

Conversation

shiyuhang0
Copy link
Collaborator

@shiyuhang0 shiyuhang0 commented Apr 10, 2023

What problem does this PR solve?

Issue Number: close #736

What is changed and how does it work?

Judge store status before use. Use the next replica when the current replica is unreachable.

Test

I have tested in TiSpark follower read with the first store killed, its success to request the next store.

@shiyuhang0 shiyuhang0 changed the title [close #736]select reachable store [close #736] Select reachable store Apr 10, 2023
@codecov
Copy link

codecov bot commented Apr 10, 2023

Codecov Report

Patch coverage: 61.53% and project coverage change: -0.22 ⚠️

Comparison is base (91b1439) 38.18% compared to head (ed84147) 37.96%.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #737      +/-   ##
============================================
- Coverage     38.18%   37.96%   -0.22%     
  Complexity     1612     1612              
============================================
  Files           278      278              
  Lines         17482    17493      +11     
  Branches       1986     1989       +3     
============================================
- Hits           6675     6641      -34     
- Misses        10140    10191      +51     
+ Partials        667      661       -6     
Impacted Files Coverage Δ
...ain/java/org/tikv/common/region/RegionManager.java 80.11% <50.00%> (-2.03%) ⬇️
src/main/java/org/tikv/common/region/TiRegion.java 77.50% <100.00%> (+0.57%) ⬆️

... and 9 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

zhangyangyu
zhangyangyu previously approved these changes Apr 25, 2023
@ti-srebot
Copy link
Collaborator

@zhangyangyu, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. get siglist: dial tcp 172.16.4.167:34000: connect: no route to host

Signed-off-by: shiyuhang <1136742008@qq.com>
Signed-off-by: shiyuhang <1136742008@qq.com>
Signed-off-by: shiyuhang <1136742008@qq.com>
Signed-off-by: shiyuhang <1136742008@qq.com>
Signed-off-by: shiyuhang <1136742008@qq.com>
Signed-off-by: shiyuhang <1136742008@qq.com>
@zhangyangyu zhangyangyu merged commit e8feb23 into tikv:master Apr 27, 2023
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: cannot checkout 3.3: error checking out 3.3: exit status 1. output: error: pathspec '3.3' did not match any file(s) known to git

@shiyuhang0
Copy link
Collaborator Author

/cherry-pick release-3.3

@ti-chi-bot
Copy link
Member

@shiyuhang0: new pull request created to branch release-3.3: #742.

In response to this:

/cherry-pick release-3.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/client-java that referenced this pull request Apr 27, 2023
close tikv#736

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
shiyuhang0 added a commit to ti-chi-bot/client-java that referenced this pull request Apr 27, 2023
close tikv#736

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: shiyuhang <1136742008@qq.com>
zhangyangyu pushed a commit that referenced this pull request Apr 27, 2023
Signed-off-by: shiyuhang <1136742008@qq.com>
List<Peer> replicaList = region.getReplicaList();
for (int i = 0; i < replicaList.size(); i++) {
Peer peer = replicaList.get(i);
store = getStoreById(peer.getStoreId(), backOffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make replicas not balance ? As we always try the replica 0 at first.

How about start from getCurrentReplica(), then if the current replica is not available, get next one by getNextReplica and retry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two problems if we getNextReplica.

  1. code will trap into endless loop when there is no available store.
  2. assume we set follower,leader, the replica list will begin with the followers then the leaders. If we getNextReplica, we may miss the recover follower and request the leader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, the original code always select the first one. also has the issue of balance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. code will trap into endless loop when there is no available store.

It is not difficult to be solved by loop no more than replicaList.size() times.

  1. assume we set follower,leader, the replica list will begin with the followers then the leaders. If we getNextReplica, we may miss the recover follower and request the leader

Why we should not "miss the recover follower and request the leader" ? If we don't want to read on leader, we can set the replica selector as "followers only". Otherwise, if "leader and follower" is used, it should be OK to read on leader.

Besides, the original code always select the first one. also has the issue of balance

I also feel strange that why getNextReplica is unused, and except which replicaIdx is changed nowhere. Any idea @iosmanthus ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The customer wants a "fallback" strategy. They want to read followers first to isolate dump traffic. But if all the followers are down, fallback to leaders as a cover.

So if "follower and leader" is specified, it means use followers first, and then leaders. The order matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is stale read enabled ? If not, when all followers are down, the leader is also unavailable (neither readable nor writable as well, see https://www.pingcap.com/blog/lease-read/)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I think we take some terms wrong, it's not follower, it's learner, who will not attend voting. And yes, this is for stale read. The normal TiKV nodes will still take leader and follower and handle online traffic. And there will be extra learner nodes dedicated for offline traffic. When learners are down, goes for the normal nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing fallback mechanism for replica selection
5 participants