Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3s regularly spikes in cpu usage and crashes in a loop #6095

Closed
Blackclaws opened this issue Sep 2, 2022 · 12 comments
Closed

K3s regularly spikes in cpu usage and crashes in a loop #6095

Blackclaws opened this issue Sep 2, 2022 · 12 comments

Comments

@Blackclaws
Copy link

Environmental Info:
K3s Version:
k3s version v1.24.3+k3s1 (990ba0e)
go version go1.18.1

Node(s) CPU architecture, OS, and Version:
Almalinux 8.6, x86_64 8 Cores

Cluster Configuration:
1 Server in HA ready configuration

Describe the bug:
k3s regularly spikes into 700-800% cpu usage until it crashes because the leaderelection is lost (to itself). It then reboots and repeats this loop.

This happens about every 2-3 days.

Additional context / logs:

Manually stopping k3s with systemctl stop k3s and restarting it, fixes it temporarily.

According to the logs etcd apply takes up to 6 seconds sometimes.

Sep 02 15:13:22 miku k3s[2694852]: I0902 15:13:22.250924 2694852 trace.go:205] Trace[1434124888]: "List(recursive=true) etcd3" key:/events,resourceVersion:0,resourceVersionMatch:,limit:500,continue: (02-Sep-2022 15:13:14.773) (total time: 7477ms):
Sep 02 15:13:22 miku k3s[2694852]: Trace[1434124888]: [7.477022009s] [7.477022009s] END
Sep 02 15:13:22 miku k3s[2694852]: E0902 15:13:22.349266 2694852 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
Sep 02 15:13:22 miku k3s[2694852]: I0902 15:13:22.349361 2694852 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
Sep 02 15:13:22 miku k3s[2694852]: E0902 15:13:22.349426 2694852 controllermanager.go:284] "leaderelection lost"
@brandond
Copy link
Member

brandond commented Sep 2, 2022

That usually indicates that the storage isn't sufficient to keep up with the IO requirements of the datastore. By "HA ready configuration", I assume you mean a single-node etcd cluster? Etcd has much higher IO requirements than sqlite, and really won't work well on anything except for SSD. What sort of disk are you running k3s on? Are you using the same physical media for the datastore, your workload, and the containerd image store?

@Blackclaws
Copy link
Author

The thing is that etcd provides native snapshotting and this is a server we might want to expand by worker nodes in the future.

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed. Is there any way to increase the resilience to these effects?

@brandond
Copy link
Member

brandond commented Sep 6, 2022

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking fsync call for every write operation to the datastore. If your workload is performing a lot of write operations at the same time, all of those will need to be flushed to disk before etcd can commit the write. You may not be exceeding the rated iops, but simply running into problems with bulk IO blocking performance-critical cluster operations.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed.

Is this perhaps occurring during some background operation, such as a scheduled TRIM operation, or some IO-intensive processing job on the part of the workload?

@Blackclaws
Copy link
Author

Blackclaws commented Sep 8, 2022

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking fsync call for every write operation to the datastore. If your workload is performing a lot of write operations at the same time, all of those will need to be flushed to disk before etcd can commit the write. You may not be exceeding the rated iops, but simply running into problems with bulk IO blocking performance-critical cluster operations.

Yes. Throughput and latency should be more than sufficient.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed.

Is this perhaps occurring during some background operation, such as a scheduled TRIM operation, or some IO-intensive processing job on the part of the workload?

No, it happens seemingly at random. Manually running trim operations doesn't trigger this, writing a lot of data doesn't trigger this. I honestly don't know what triggers this. iostat while it happens shows the storage not particularly under pressure, CPU usage at full all taken by a multitude of k3s processes.

Any sort of process that I might follow to help me debug the cause of this?

@brandond
Copy link
Member

brandond commented Sep 8, 2022

CPU usage at full all taken by a multitude of k3s processes.

Can you clarify what you mean by that? There should normally only be 1 k3s process (either k3s server or k3s agent), occasionally 2 if k3s is started with the --log flag and has to re-exec itself to redirect output to a log file. Are you actually seeing multiple k3s processes, or just multiple OS threads?

@Blackclaws
Copy link
Author

Multiple OS Threads I'd say. I've attached a view of htop and iostat of the problem occuring. As I said iostat is rather unproblematic. IoWait is low, but for some reason k3s uses a ton of cpu and times out etcd requests:

k3s-usage
k3s-iostat

@brandond
Copy link
Member

If you're using iostat, can you look at the f_await and aqu-sz fields? These are more likely to be what's causing etcd problems than just the base IO rate.

https://manpages.debian.org/testing/sysstat/iostat.1.en.html#f_await

@Blackclaws
Copy link
Author

Will do the next time this problem shows up. Since this is a production server restoring service had priority over waiting to debug this.

@Blackclaws
Copy link
Author

So aqu-sz is really low, f_await isn't tracked, this might be because the underlying disk is a virtual disk.

What I've also noticed is a spike from iptables-save -t filter that saves current filter rules, I think that's part of the startup/shutdown of k3s.

Just to give you an idea of the thread sprawl that k3s seems to produce in this situation:

Screenshot_2022-11-07_09-17-12

The machine basically becomes unusable at this point. IO latency does also spike at this point (if I run a benchmark like bonnie++ at the same time) however its hard to tell whether that is simply because of the high load on the machine. Shutting down k3s returns it to normal after which k3s can be restarted and normal operation resumes.

I won't rule out that this is an issue that crops up due to the underlying machine. What I find weird is that it takes quite some time to happen the first time, even with the full load of cluster startup directly applied at the beginning (which I assume is the point of highest load for most use cases). There seems to be some sort of trigger that puts k3s in this state where a recovery takes a:

k3s-killall.sh
systemctl start k3s

Which then leads to a standard startup that works fine.

I wonder why the cpu load spikes this way however, it seems there is work being done since the processes aren't IO blocked. I wonder what work is being done however.

@brandond
Copy link
Member

I haven't seen any updates on this in a bit; were you able to identify a root cause of the CPU load?

@Blackclaws
Copy link
Author

Blackclaws commented Jun 13, 2023

@brandond Unfortunately not. The issue persists and crops up every once in a while. I can't rule out that it has something to do with disk io being too slow, however we've switched VM hosts etc. It might be related to other VMs running on the same host. Since we don't control the host directly its hard to tell whether the problems coincide with high IO load on the host system.

However the fact that this seems to crop up semi-regularly and the only way to make it stop is to reboot the machine (after which the problem goes away) and seeing how a simple restart of k3s without reboot does nothing to alleviate the issue it might be related to some sort of failure in the IO system that gets cleared.

Its still extremely weird and I'd love to go a second round on this if you have any further ideas on what to test. Especially since the very high CPU load is a bit weird. It basically drives the whole system into the ground because k3s takes up so much of the resources that the existing containers start to stall (followed by k3s itself stalling and timing out, restarting and repeating the cycle). I get the feeling that since k3s is supposed to run on even low end systems this is not something that should be happening on this type of host even using the etcd backend.

@caroline-suse-rancher
Copy link
Contributor

Converting this to a discussion as (at this point), it doesn't seem to be directly related to a k3s bug.

@k3s-io k3s-io locked and limited conversation to collaborators Jun 14, 2023
@caroline-suse-rancher caroline-suse-rancher converted this issue into discussion #7786 Jun 14, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants