Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/tikv: keepalive with pd #14118

Merged
merged 9 commits into from
Dec 24, 2019
Merged

store/tikv: keepalive with pd #14118

merged 9 commits into from
Dec 24, 2019

Conversation

nolouch
Copy link
Member

@nolouch nolouch commented Dec 18, 2019

Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

After all 3 instances of PD are killed in AWS(k8s environment), it takes a long time (15 minutes) for TiDB server instances to reconnect to new PD instances. and we found the stale TCP connection after all pod IP is changed.

 we see the connection is still establish, and that ip does not exist after kill.
10.0.48.116 is not exist, but the tcp is establisted.

Tue Dec 17 12:54:36 UTC 2019
tcp        0      1 10.0.38.181:53424       172.20.62.78:2379       SYN_SENT    1/tidb-server
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0   1140 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 12:54:37 UTC 2019
tcp        0      1 10.0.38.181:53424       172.20.62.78:2379       SYN_SENT    1/tidb-server
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0   1153 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 12:54:37 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0   1153 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server

....

Tue Dec 17 13:07:15 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0  26868 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 13:07:16 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0  26868 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 13:07:16 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0  26868 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 13:07:17 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:41754       10.0.55.94:2379         ESTABLISHED 1/tidb-server

This problem same as #7099. may k8s CNI dropping all packets send to the removed node(Indeterminate), that cause a stall conneciton, until kernel TCP retransmission times out and closes the connection.

What is changed and how it works?

update pd client and use keepalive in gRPC.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    no stale connection appear.

Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch nolouch added the type/bugfix This PR fixes a bug. label Dec 18, 2019
@hicqu hicqu requested a review from lysu December 18, 2019 09:52
@nolouch
Copy link
Member Author

nolouch commented Dec 18, 2019

/rebuild

@lysu
Copy link
Contributor

lysu commented Dec 18, 2019

@nolouch plugin CI will be fixed after PD repo merge

@DanielZhangQD
Copy link

may k8s CNI dropping all packets send to the removed node(Indeterminate), that cause a stall conneciton, @nolouch as we have discussed, it's not the issue of k8s CNI, it's PD that does not close gracefully (no FIN sent to TiDB) that causes the stale connection.

@nolouch
Copy link
Member Author

nolouch commented Dec 20, 2019

@DanielZhangQD I have test kill -9 but not found the same issue in k8s. anyway, the keepalive is needed.

@nolouch nolouch requested a review from bb7133 December 20, 2019 05:51
Copy link
Contributor

@lysu lysu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@siddontang
Copy link
Member

@nolouch

Have we already used keepalive for TiKV?

@disksing
Copy link
Contributor

LGTM

Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

go.sum Show resolved Hide resolved
@bb7133
Copy link
Member

bb7133 commented Dec 23, 2019

/run-all-tests

@disksing
Copy link
Contributor

LGTM

@nolouch
Copy link
Member Author

nolouch commented Dec 24, 2019

/merge

@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 24, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Dec 24, 2019

/run-all-tests

@sre-bot sre-bot merged commit 9e376cf into pingcap:master Dec 24, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Dec 24, 2019

cherry pick to release-2.1 failed

@sre-bot
Copy link
Contributor

sre-bot commented Dec 24, 2019

cherry pick to release-3.0 failed

nolouch added a commit to nolouch/tidb that referenced this pull request Dec 25, 2019
nolouch added a commit to nolouch/tidb that referenced this pull request Dec 25, 2019
@nolouch nolouch deleted the keepalive-pd branch December 27, 2019 09:58
@sre-bot
Copy link
Contributor

sre-bot commented Apr 7, 2020

It seems that, not for sure, we failed to cherry-pick this commit to release-2.1. Please comment '/run-cherry-picker' to try to trigger the cherry-picker if we did fail to cherry-pick this commit before. @nolouch PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/can-merge Indicates a PR has been approved by a committer. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants