K3s/Flannel? - Pods slow to establish TCP connections #8288

maxsargentdev · 2023-09-01T15:59:13Z

Environmental Info:
K3s Version:
1.26.4

Node(s) CPU architecture, OS, and Version:
AMD x86_64, AWS EC2 m5.2xlarge, 5.10.167-147.601.amzn2.x86_64

Cluster Configuration:
Single node, cluster created using k3sup

Describe the bug:
TCP Connections between pods in the cluster take a long time to establish, however once established become fast. Example being a database connection taking several retries to connect successfully (the database running in the cluster as well) but once its up the queries happen quickly.

I have done some debugging and by tcpdump'ing cni0 I can see that almost all UDP and TCP packets that are coming into the interface have incorrect checksum errors. Not sure if this is a symptom or the root cause, looking online at flannel it seems like there have been issues in the past when offloading checksum calculations to the NIC, I tried turning off tx-checksum-ip-generic with ethtool as suggested in them posts but got nowhere.

Steps To Reproduce:

Provisioned EC2 instance as described above
Execute k3sup to install k3s
Installed K3s with the following flags:
--disable traefik, (set of OIDC flags for kube apiserver), --secrets-encryption

Expected behavior:
TCP connections establish quickly between pods in the cluster.

Actual behavior:
TCP connections take a long time to establish requiring several retries.

Additional context / logs:
As mentioned I have looked through a lot of information about flannel already to try and debug this but cant workout why I am seeing what I am seeing.

Here is a screen capture of tcpdump output:

I can do some hacky grepping and see that some checksums are correct:

I have no idea if this is the cause of the issue or a symptom of some other misconfig.

I have also tried host-gw backend and see the same.

Thanks!

maxsargentdev · 2023-09-01T16:49:32Z

I am going to try and use the new Amazon Linux 2023 when I get home as it uses linux kernel starting at 6.1.

Will update.

maxsargentdev · 2023-09-02T10:29:21Z

I have tried the newer operating system but got the same issue.

Just need someone to confirm that these checksum errors are expected on the cni0 interface, from what I have gathered from further reading they are expected from veth devices as it makes no sense to use a checksum when nothing is going over the wire.

If this is the case I will move on to try some other fixes.

maxsargentdev · 2023-09-04T07:11:38Z

I have confirmed the issue here is not with k3s or flannel, closing.

maxsargentdev closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K3s/Flannel? - Pods slow to establish TCP connections #8288

K3s/Flannel? - Pods slow to establish TCP connections #8288

maxsargentdev commented Sep 1, 2023 •

edited

Loading

maxsargentdev commented Sep 1, 2023

maxsargentdev commented Sep 2, 2023 •

edited

Loading

maxsargentdev commented Sep 4, 2023

K3s/Flannel? - Pods slow to establish TCP connections #8288

K3s/Flannel? - Pods slow to establish TCP connections #8288

Comments

maxsargentdev commented Sep 1, 2023 • edited Loading

maxsargentdev commented Sep 1, 2023

maxsargentdev commented Sep 2, 2023 • edited Loading

maxsargentdev commented Sep 4, 2023

maxsargentdev commented Sep 1, 2023 •

edited

Loading

maxsargentdev commented Sep 2, 2023 •

edited

Loading