Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3s/Flannel? - Pods slow to establish TCP connections #8288

Closed
maxsargentdev opened this issue Sep 1, 2023 · 3 comments
Closed

K3s/Flannel? - Pods slow to establish TCP connections #8288

maxsargentdev opened this issue Sep 1, 2023 · 3 comments

Comments

@maxsargentdev
Copy link

maxsargentdev commented Sep 1, 2023

Environmental Info:
K3s Version:
1.26.4

Node(s) CPU architecture, OS, and Version:
AMD x86_64, AWS EC2 m5.2xlarge, 5.10.167-147.601.amzn2.x86_64

Cluster Configuration:
Single node, cluster created using k3sup

Describe the bug:
TCP Connections between pods in the cluster take a long time to establish, however once established become fast. Example being a database connection taking several retries to connect successfully (the database running in the cluster as well) but once its up the queries happen quickly.

I have done some debugging and by tcpdump'ing cni0 I can see that almost all UDP and TCP packets that are coming into the interface have incorrect checksum errors. Not sure if this is a symptom or the root cause, looking online at flannel it seems like there have been issues in the past when offloading checksum calculations to the NIC, I tried turning off tx-checksum-ip-generic with ethtool as suggested in them posts but got nowhere.

Steps To Reproduce:

  • Provisioned EC2 instance as described above
  • Execute k3sup to install k3s
  • Installed K3s with the following flags:
  • --disable traefik, (set of OIDC flags for kube apiserver), --secrets-encryption

Expected behavior:
TCP connections establish quickly between pods in the cluster.

Actual behavior:
TCP connections take a long time to establish requiring several retries.

Additional context / logs:
As mentioned I have looked through a lot of information about flannel already to try and debug this but cant workout why I am seeing what I am seeing.

Here is a screen capture of tcpdump output:

image

I can do some hacky grepping and see that some checksums are correct:
image

I have no idea if this is the cause of the issue or a symptom of some other misconfig.

I have also tried host-gw backend and see the same.

Thanks!

@maxsargentdev
Copy link
Author

I am going to try and use the new Amazon Linux 2023 when I get home as it uses linux kernel starting at 6.1.

Will update.

@maxsargentdev
Copy link
Author

maxsargentdev commented Sep 2, 2023

I have tried the newer operating system but got the same issue.

Just need someone to confirm that these checksum errors are expected on the cni0 interface, from what I have gathered from further reading they are expected from veth devices as it makes no sense to use a checksum when nothing is going over the wire.

If this is the case I will move on to try some other fixes.

@maxsargentdev
Copy link
Author

I have confirmed the issue here is not with k3s or flannel, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant