K3s regularly spikes in cpu usage and crashes in a loop #6095

Blackclaws · 2022-09-02T13:16:01Z

Environmental Info:
K3s Version:
k3s version v1.24.3+k3s1 (990ba0e)
go version go1.18.1

Node(s) CPU architecture, OS, and Version:
Almalinux 8.6, x86_64 8 Cores

Cluster Configuration:
1 Server in HA ready configuration

Describe the bug:
k3s regularly spikes into 700-800% cpu usage until it crashes because the leaderelection is lost (to itself). It then reboots and repeats this loop.

This happens about every 2-3 days.

Additional context / logs:

Manually stopping k3s with systemctl stop k3s and restarting it, fixes it temporarily.

According to the logs etcd apply takes up to 6 seconds sometimes.

Sep 02 15:13:22 miku k3s[2694852]: I0902 15:13:22.250924 2694852 trace.go:205] Trace[1434124888]: "List(recursive=true) etcd3" key:/events,resourceVersion:0,resourceVersionMatch:,limit:500,continue: (02-Sep-2022 15:13:14.773) (total time: 7477ms):
Sep 02 15:13:22 miku k3s[2694852]: Trace[1434124888]: [7.477022009s] [7.477022009s] END

Sep 02 15:13:22 miku k3s[2694852]: E0902 15:13:22.349266 2694852 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
Sep 02 15:13:22 miku k3s[2694852]: I0902 15:13:22.349361 2694852 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
Sep 02 15:13:22 miku k3s[2694852]: E0902 15:13:22.349426 2694852 controllermanager.go:284] "leaderelection lost"

The text was updated successfully, but these errors were encountered:

brandond · 2022-09-02T15:44:20Z

That usually indicates that the storage isn't sufficient to keep up with the IO requirements of the datastore. By "HA ready configuration", I assume you mean a single-node etcd cluster? Etcd has much higher IO requirements than sqlite, and really won't work well on anything except for SSD. What sort of disk are you running k3s on? Are you using the same physical media for the datastore, your workload, and the containerd image store?

Blackclaws · 2022-09-04T10:07:10Z

The thing is that etcd provides native snapshotting and this is a server we might want to expand by worker nodes in the future.

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed. Is there any way to increase the resilience to these effects?

brandond · 2022-09-06T19:22:29Z

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking fsync call for every write operation to the datastore. If your workload is performing a lot of write operations at the same time, all of those will need to be flushed to disk before etcd can commit the write. You may not be exceeding the rated iops, but simply running into problems with bulk IO blocking performance-critical cluster operations.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed.

Is this perhaps occurring during some background operation, such as a scheduled TRIM operation, or some IO-intensive processing job on the part of the workload?

Blackclaws · 2022-09-08T09:23:28Z

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking fsync call for every write operation to the datastore. If your workload is performing a lot of write operations at the same time, all of those will need to be flushed to disk before etcd can commit the write. You may not be exceeding the rated iops, but simply running into problems with bulk IO blocking performance-critical cluster operations.

Yes. Throughput and latency should be more than sufficient.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed.

Is this perhaps occurring during some background operation, such as a scheduled TRIM operation, or some IO-intensive processing job on the part of the workload?

No, it happens seemingly at random. Manually running trim operations doesn't trigger this, writing a lot of data doesn't trigger this. I honestly don't know what triggers this. iostat while it happens shows the storage not particularly under pressure, CPU usage at full all taken by a multitude of k3s processes.

Any sort of process that I might follow to help me debug the cause of this?

brandond · 2022-09-08T10:21:55Z

CPU usage at full all taken by a multitude of k3s processes.

Can you clarify what you mean by that? There should normally only be 1 k3s process (either k3s server or k3s agent), occasionally 2 if k3s is started with the --log flag and has to re-exec itself to redirect output to a log file. Are you actually seeing multiple k3s processes, or just multiple OS threads?

Blackclaws · 2022-09-26T18:35:10Z

Multiple OS Threads I'd say. I've attached a view of htop and iostat of the problem occuring. As I said iostat is rather unproblematic. IoWait is low, but for some reason k3s uses a ton of cpu and times out etcd requests:

brandond · 2022-09-26T18:39:31Z

If you're using iostat, can you look at the f_await and aqu-sz fields? These are more likely to be what's causing etcd problems than just the base IO rate.

https://manpages.debian.org/testing/sysstat/iostat.1.en.html#f_await

Blackclaws · 2022-09-26T18:42:40Z

Will do the next time this problem shows up. Since this is a production server restoring service had priority over waiting to debug this.

Blackclaws · 2022-11-07T08:54:55Z

So aqu-sz is really low, f_await isn't tracked, this might be because the underlying disk is a virtual disk.

What I've also noticed is a spike from iptables-save -t filter that saves current filter rules, I think that's part of the startup/shutdown of k3s.

Just to give you an idea of the thread sprawl that k3s seems to produce in this situation:

The machine basically becomes unusable at this point. IO latency does also spike at this point (if I run a benchmark like bonnie++ at the same time) however its hard to tell whether that is simply because of the high load on the machine. Shutting down k3s returns it to normal after which k3s can be restarted and normal operation resumes.

I won't rule out that this is an issue that crops up due to the underlying machine. What I find weird is that it takes quite some time to happen the first time, even with the full load of cluster startup directly applied at the beginning (which I assume is the point of highest load for most use cases). There seems to be some sort of trigger that puts k3s in this state where a recovery takes a:

k3s-killall.sh
systemctl start k3s

Which then leads to a standard startup that works fine.

I wonder why the cpu load spikes this way however, it seems there is work being done since the processes aren't IO blocked. I wonder what work is being done however.

brandond · 2023-06-13T17:19:37Z

I haven't seen any updates on this in a bit; were you able to identify a root cause of the CPU load?

Blackclaws · 2023-06-13T17:25:58Z

@brandond Unfortunately not. The issue persists and crops up every once in a while. I can't rule out that it has something to do with disk io being too slow, however we've switched VM hosts etc. It might be related to other VMs running on the same host. Since we don't control the host directly its hard to tell whether the problems coincide with high IO load on the host system.

However the fact that this seems to crop up semi-regularly and the only way to make it stop is to reboot the machine (after which the problem goes away) and seeing how a simple restart of k3s without reboot does nothing to alleviate the issue it might be related to some sort of failure in the IO system that gets cleared.

Its still extremely weird and I'd love to go a second round on this if you have any further ideas on what to test. Especially since the very high CPU load is a bit weird. It basically drives the whole system into the ground because k3s takes up so much of the resources that the existing containers start to stall (followed by k3s itself stalling and timing out, restarting and repeating the cycle). I get the feeling that since k3s is supposed to run on even low end systems this is not something that should be happening on this type of host even using the etcd backend.

caroline-suse-rancher · 2023-06-14T21:00:21Z

Converting this to a discussion as (at this point), it doesn't seem to be directly related to a k3s bug.

thxCode mentioned this issue Apr 24, 2023

refactor: make single seal instance deploy stable seal-io/walrus#489

Merged

k3s-io locked and limited conversation to collaborators Jun 14, 2023

caroline-suse-rancher converted this issue into discussion #7786 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

K3s regularly spikes in cpu usage and crashes in a loop #6095

K3s regularly spikes in cpu usage and crashes in a loop #6095

Blackclaws commented Sep 2, 2022

brandond commented Sep 2, 2022

Blackclaws commented Sep 4, 2022

brandond commented Sep 6, 2022

Blackclaws commented Sep 8, 2022 •

edited

Loading

brandond commented Sep 8, 2022 •

edited

Loading

Blackclaws commented Sep 26, 2022

brandond commented Sep 26, 2022

Blackclaws commented Sep 26, 2022

Blackclaws commented Nov 7, 2022

brandond commented Jun 13, 2023

Blackclaws commented Jun 13, 2023 •

edited

Loading

caroline-suse-rancher commented Jun 14, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

K3s regularly spikes in cpu usage and crashes in a loop #6095

K3s regularly spikes in cpu usage and crashes in a loop #6095

Comments

Blackclaws commented Sep 2, 2022

brandond commented Sep 2, 2022

Blackclaws commented Sep 4, 2022

brandond commented Sep 6, 2022

Blackclaws commented Sep 8, 2022 • edited Loading

brandond commented Sep 8, 2022 • edited Loading

Blackclaws commented Sep 26, 2022

brandond commented Sep 26, 2022

Blackclaws commented Sep 26, 2022

Blackclaws commented Nov 7, 2022

brandond commented Jun 13, 2023

Blackclaws commented Jun 13, 2023 • edited Loading

caroline-suse-rancher commented Jun 14, 2023

This issue was moved to a discussion.

Blackclaws commented Sep 8, 2022 •

edited

Loading

brandond commented Sep 8, 2022 •

edited

Loading

Blackclaws commented Jun 13, 2023 •

edited

Loading