-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart of rke2-server slow when using images tarball #773
Comments
Yes, all tarballs in the images directory are imported every time RKE2 is started. This is common behavior across RKE2 and K3s. If you are currently using an uncompressed or gzip-compressed tarball, you might try using the zstd archive as it is optimized for faster decompression, and reduces the IO necessary to import the images. |
Do you think we could improve this somehow i.e. by remembering that this import for a given tarball was done, already? |
Even if you were to say store a hash or checksum of the file and skip importing it, theres no guarantee that the images it contains are still present in the containerd image store. Kubelet garbage collection or even an unknowing user may have deleted them. The only safe way to ensure that everything is available is to process the file every time. |
Ok - so I really hope on the registry namespace mapping so we do not have to replicate the large tarball to all the servers/agents and process it during every restart.. - lots of workload that just can be resolved by the on-premise registry we need, anyway ;-) |
How much longer are you seeing it take? On my dev nodes it adds less than a minute, and RKE2 is pretty slow to start (compared to K3s at least) regardless so it's not been terribly burdensome. I think most folks just distribute the rke2 image tarball alongside the binary, since they already have to solve installation for themselves in an airgap environment. |
In the deployment we did yesterday it took maybe 2-3 minutes - but I did not measure it exactly. Do you know what will be "down / unavailable" during this "systemctl restart rke2-*"? In case there is no downtime during that restart it might not matter much even though we also do not see a similar need to re-install RPMs during every server boot - so not sure why we have to do this for images ;-).. In the architectures I have been deploying so far I always hat the required image source on an on-premise registry.. similar to central RPM repositories.. |
Workload pods will continue running while RKE2 and the Kubelet are stopped. If it's down for too long the cluster will mark the node as NotReady and eventually try to reschedule pods away from it, but that usually takes much longer than simply restarting the service. Well for starters, RPMs don't have an automated garbage collection system removing them from the host when disk space gets low. We've also chosen to bootstrap the host binaries and manifests from an image instead of building them into a self-extracting binary as we did for K3s, so an image is also needed for that. |
In case the disk space gets low and the garbage collection kicks in - the extraction of images would also cause a problem in case there is disk pressure.. however - I might not have enough insights to understand the full picture and sense of re-extraction... |
Kubelet image GC kicks in when the disk is 85% fill, so on larger disks there can in fact be plenty of space left when it starts deleting images that it sees as not currently in use by a running container. |
Just as a datapoint, the tarball takes 31 seconds to import on one of my local test VMs; on server-class hardware this should be even faster.
The zstd tarball is marginally faster to load, but is 460M instead of 2.1G and therefore much easier to redistribute.
|
Thanks - in case the customers /var/lib/rancher is at 85% - that is "out of disk space" and the disk needs to be sized much bigger, anyway.. Just think about one single image to be "updated".. And I am not sure why the GC deletes images that are "in use" - is this how the GC design is? |
Will check this out - hopefully today - do you also have "VMs" on "spinning disks" or on NFS? |
Checked - customer on VM - 42 seconds with "all flash storage". So assuming much longer with spinning disk scenarios. Would that mean "downtime"? As you think zstd would be better - could we release as zstd on https://github.com/rancher/rke2/releases instead of tar.gz then? |
We're wanting to pre-stage not just the rke2 tarball of images, but a lot of other images we need in these high side environments to reduce provisioning time. It would be exceedingly helpful if upon rke2-server or rke2-agent install, the local containerd service and sock path are available outside of the rke2-server or agent services so this can be done. Currently this adds 15 minutes to node provisioning time in an auto-scaling group since import is not happening until we're actually starting the service up. Pre-importing during the AMI creation phase would be ideal here. |
We bundle containerd as part of the RKE2 runtime (which is extracted from the image tarball) so it literally does not exist on the host until RKE2 starts and unpacks the runtime image. This behavior is carried over from K3s, and is part of the value proposition - not having to manage docker or containerd versions on the nodes. If you want to install your own standalone containerd and preload images into that, you can always use As mentioned above, we import all images from tarballs on every startup, since we can't necessarily rely on the Kubelet to not garbage collect them between runs. Honestly for larger deployments with multiple gigs of images, you're probably better off using a private registry instead of trying to preload everything directly onto the nodes. Once RKE2 is up though, you can interact with it directly: [root@centos01 ~]# export CONTAINERD_ADDRESS=/run/k3s/containerd/containerd.sock CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin
[root@centos01 ~]# ctr --namespace k8s.io images ls -q
docker.io/rancher/hardened-calico:v3.13.3
docker.io/rancher/hardened-calico:v3.13.3-build20210223
docker.io/rancher/hardened-coredns:v1.6.9-build20210223
docker.io/rancher/hardened-etcd:v3.4.13-k3s1
docker.io/rancher/hardened-etcd:v3.4.13-k3s1-build20210223
docker.io/rancher/hardened-etcd@sha256:407a417f4dfe8311ceb661c186f95667f8b55fae2b7cff52aa8e062efd4f7f31
docker.io/rancher/hardened-flannel:v0.13.0-rancher1
docker.io/rancher/hardened-flannel:v0.13.0-rancher1-build20210223
docker.io/rancher/hardened-k8s-metrics-server:v0.3.6-build20210223
docker.io/rancher/hardened-kube-proxy:v1.20.2
docker.io/rancher/hardened-kube-proxy:v1.20.4-build20210302
docker.io/rancher/hardened-kube-proxy@sha256:d0600143c23c769d64d031a1f1cfe17bb8ca7b4b7112e1ab69e67f3c55bf2c21
docker.io/rancher/hardened-kubernetes:v1.20.2-beta1-rke2r1
docker.io/rancher/hardened-kubernetes:v1.20.2-dev-0722700c
docker.io/rancher/hardened-kubernetes:v1.20.4-dev-5166e107
docker.io/rancher/hardened-kubernetes:v1.20.4-dev-b2d628f2
docker.io/rancher/hardened-kubernetes@sha256:bd2de5571fb59376cfa872c91078ae1cec912cc7a7b665e81510b904e71b26de
docker.io/rancher/klipper-helm:v0.4.3-build20210225
docker.io/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
docker.io/rancher/nginx-ingress-controller:nginx-0.30.0-rancher1
docker.io/rancher/pause:3.2
docker.io/rancher/pause@sha256:76d9f46511297dd38d760194c1afb38583c10a17e94849d0f8fb5217ce5e20ea
docker.io/rancher/rke2-runtime:v1.20.2-dev-0722700c
docker.io/rancher/rke2-runtime:v1.20.4-dev-5166e107
docker.io/rancher/rke2-runtime:v1.20.4-dev-b2d628f2
sha256:01074024bd5f1315a9a8ac6544f0d6c953b7b872ee3af91a0c064bb732a0e035
sha256:0930f4df54e168d53b57f41cb1a65703f96ab9ca2f3178f9de10b6662b3ee57c
sha256:2166188b0dba13e4be178190391536dd91ff8fd29944322bf3734e6f89856d45
sha256:21c345f3958937e9d1b877956c6c9d52889220633a8138cba3d2786ab4cdd013
sha256:271c0a695260e1caaa39782a55a1cfbd566d03b4897f7b44555dc71a371ee54b
sha256:366c64051af8567f4e741182a0c4e2e649fed70f5a91b92d12f6bce29a31c0e3
sha256:5d05c5a9b5533813a66c3602949532c1f1be011ee61c4213d23f72bfe688dc31
sha256:5e714ee56edaef586b9686e9b9275a7172264cdca82e3353767175d7a11b3499
sha256:6ab71f04f5474a4dbc0f19b5b166be86cceb9dba63349bc8d6a5ca68243ec957
sha256:736cae9d947ba51bc29b44535d72b3ca7f6b422e3b2da02d9602e9ed77dc2578
sha256:aee55f9c6784e238c2f62b7bdc12959d9286e6dded50a732072088b22494359c
sha256:b5af743e598496e8ebd7a6eb3fea76a6464041581520d1c2315c95f993287303
sha256:bd44cab9ef086b587be15c63e3599450b3ff28d660abef8b9c2903a4e0083454
sha256:c184ca8e7ce9bd6af5d43363054ed52b60e013b0d5b5a86a899bf55ba1d2df86
sha256:d1e554409702a3d4625e149214c9210bb6d70e519ad503df143c59a181320e7f
sha256:e004ddc1b078f7f6375b7294b9ebb9026cdda37a880dbd0c9da5e0885ef075cf
sha256:e7c578d94f6d982ee3f4b28fca765f4432c3ae3578c269c83f919fea8cd66fd0
sha256:e8dd7ce55cff31bc380de8b08717cbdf0d97dc85bd72748ba28a84833970bcbb
sha256:f24e72a2007f8f110aa5d63ec7a59a6c5dd687e0a090c9b0f6c7ccd2d270dc83
sha256:f96df637bbc7ed5acbc7ec51b69df4be60156c37b10e67a5cfcc076dcefdc62b
sha256:fe40618686a2fbf0472280f72e6881d1214a739420abf31bdb039920794404b1 |
I believe we might have to discuss two different things, here
|
This is a multi-cluster environment. Usually both AMI image staging and private registry population is done. You'll take the images you need from Docker Hub, Quay.io, P1's Iron Bank, etc and stage them to your registry. This would be one pipeline that would have two triggers...one timed and another via commit of new images you'd like staged. Another pipeline is the AMI creation. This is fed by various cluster ASG requirements for specific workloads or apps. Absent any cluster node pre-warming automation you may have, this is sufficient for performance tuning. This request is all for performance tuning. Because it doesn't look like the container runtime sock will be available until cluster provisioning time, we're going to instead design a node pre-warming automated system to where we'll have a central image pull list on a per ASG basis so that images that we know can be pulled from time to time will have them already pulled and unpacked and ready to go on the said node. We'll have two triggers here...once per day staggered to not monopolize all nodes at once to update what images they should have staged from the private registry - and other trigger that will kick off staggered updates if by code the staging list for that particular ASG gets updated. So in summary, we have the following pod provisioning performance tuning steps that can be taken:
Methods 2 and 3 will be sufficient for keep our cluster "snappy". What is the technical requirement? Snappy. ;) Chad |
FYI - just deployed an other RKE2 with airgap and 1.20.4 - here I see this time during every server or agent restart: Mar 26 10:41:22 rke-test-master-01 rke2[18977]: time="2021-03-26T10:41:22+01:00" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst" 53 seconds Mar 26 11:25:54 rke-test-worker-01 rke2[13944]: time="2021-03-26T11:25:54+01:00" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst" 76 seconds This is on NVMe - so I assume this will be much slower in case someone would use spinning disks.. |
Thanks for all the details - This sounds a bit like "large images" without "shared layers"... |
I'm not sure how server-class hardware with NVMe would be slower than a QEMU VM on my old Ryzen 5 2600X dev box with a SATA SSD. What's your storage configuration look like? |
In my case it is ESXi with VSPhere having the local datastore on NVMe. |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Due to missing registry namespace mapping we went over to copy the images tarball to /var/lib/rancher/agent/images as a workaround. With this way the initial deployment works - but we had to realize that a restart of rke2-server is very slow, now.
Version: v1.20.4+rke2r1
Could it be that during every restart the tar-ball gets extracted / verified to see if there are new / other / changed images included?
The text was updated successfully, but these errors were encountered: