Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart of rke2-server slow when using images tarball #773

Closed
Martin-Weiss opened this issue Mar 12, 2021 · 22 comments
Closed

Restart of rke2-server slow when using images tarball #773

Martin-Weiss opened this issue Mar 12, 2021 · 22 comments

Comments

@Martin-Weiss
Copy link

Due to missing registry namespace mapping we went over to copy the images tarball to /var/lib/rancher/agent/images as a workaround. With this way the initial deployment works - but we had to realize that a restart of rke2-server is very slow, now.

Version: v1.20.4+rke2r1

Could it be that during every restart the tar-ball gets extracted / verified to see if there are new / other / changed images included?

@brandond
Copy link
Member

brandond commented Mar 12, 2021

Yes, all tarballs in the images directory are imported every time RKE2 is started. This is common behavior across RKE2 and K3s. If you are currently using an uncompressed or gzip-compressed tarball, you might try using the zstd archive as it is optimized for faster decompression, and reduces the IO necessary to import the images.

@Martin-Weiss
Copy link
Author

Do you think we could improve this somehow i.e. by remembering that this import for a given tarball was done, already?

@brandond
Copy link
Member

Even if you were to say store a hash or checksum of the file and skip importing it, theres no guarantee that the images it contains are still present in the containerd image store. Kubelet garbage collection or even an unknowing user may have deleted them. The only safe way to ensure that everything is available is to process the file every time.

@Martin-Weiss
Copy link
Author

Ok - so I really hope on the registry namespace mapping so we do not have to replicate the large tarball to all the servers/agents and process it during every restart.. - lots of workload that just can be resolved by the on-premise registry we need, anyway ;-)

@brandond
Copy link
Member

brandond commented Mar 12, 2021

How much longer are you seeing it take? On my dev nodes it adds less than a minute, and RKE2 is pretty slow to start (compared to K3s at least) regardless so it's not been terribly burdensome.

I think most folks just distribute the rke2 image tarball alongside the binary, since they already have to solve installation for themselves in an airgap environment.

@Martin-Weiss
Copy link
Author

In the deployment we did yesterday it took maybe 2-3 minutes - but I did not measure it exactly.

Do you know what will be "down / unavailable" during this "systemctl restart rke2-*"? In case there is no downtime during that restart it might not matter much even though we also do not see a similar need to re-install RPMs during every server boot - so not sure why we have to do this for images ;-)..

In the architectures I have been deploying so far I always hat the required image source on an on-premise registry.. similar to central RPM repositories..

@brandond
Copy link
Member

brandond commented Mar 12, 2021

Workload pods will continue running while RKE2 and the Kubelet are stopped. If it's down for too long the cluster will mark the node as NotReady and eventually try to reschedule pods away from it, but that usually takes much longer than simply restarting the service.

Well for starters, RPMs don't have an automated garbage collection system removing them from the host when disk space gets low. We've also chosen to bootstrap the host binaries and manifests from an image instead of building them into a self-extracting binary as we did for K3s, so an image is also needed for that.

@Martin-Weiss
Copy link
Author

In case the disk space gets low and the garbage collection kicks in - the extraction of images would also cause a problem in case there is disk pressure.. however - I might not have enough insights to understand the full picture and sense of re-extraction...

@brandond
Copy link
Member

Kubelet image GC kicks in when the disk is 85% fill, so on larger disks there can in fact be plenty of space left when it starts deleting images that it sees as not currently in use by a running container.

@brandond
Copy link
Member

brandond commented Mar 12, 2021

Just as a datapoint, the tarball takes 31 seconds to import on one of my local test VMs; on server-class hardware this should be even faster.

microos01:~ # journalctl -u rke2-server | grep Import
Mar 12 20:34:26 microos01 rke2[1353]: time="2021-03-12T20:34:26Z" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-airgap.tar"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-k8s-metrics-server:v0.3.6-build20210223"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/klipper-helm:v0.4.3-build20210225"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/nginx-ingress-controller:nginx-0.30.0-rancher1"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-calico:v3.13.3-build20210223"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-coredns:v1.6.9-build20210223"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-etcd:v3.4.13-k3s1-build20210223"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-flannel:v0.13.0-rancher1-build20210223"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-kube-proxy:v1.20.4-build20210302"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/pause:3.2"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/rke2-runtime:v1.20.4-dev-c29a66b6"
Mar 12 20:34:57 microos01 rke2[1353]: time="2021-03-12T20:34:57Z" level=info msg="Imported docker.io/rancher/hardened-kubernetes:v1.20.4-dev-c29a66b6"

The zstd tarball is marginally faster to load, but is 460M instead of 2.1G and therefore much easier to redistribute.

Mar 12 20:43:53 microos01 rke2[6122]: time="2021-03-12T20:43:53Z" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-airgap.tar.zst"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-k8s-metrics-server:v0.3.6-build20210223"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/klipper-helm:v0.4.3-build20210225"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/nginx-ingress-controller:nginx-0.30.0-rancher1"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-calico:v3.13.3-build20210223"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-coredns:v1.6.9-build20210223"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-etcd:v3.4.13-k3s1-build20210223"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-flannel:v0.13.0-rancher1-build20210223"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-kube-proxy:v1.20.4-build20210302"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/pause:3.2"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/rke2-runtime:v1.20.4-dev-c29a66b6"
Mar 12 20:44:21 microos01 rke2[6122]: time="2021-03-12T20:44:21Z" level=info msg="Imported docker.io/rancher/hardened-kubernetes:v1.20.4-dev-c29a66b6"

@Martin-Weiss
Copy link
Author

Kubelet image GC kicks in when the disk is 85% fill, so on larger disks there can in fact be plenty of space left when it starts deleting images that it sees as not currently in use by a running container.

Thanks - in case the customers /var/lib/rancher is at 85% - that is "out of disk space" and the disk needs to be sized much bigger, anyway.. Just think about one single image to be "updated".. And I am not sure why the GC deletes images that are "in use" - is this how the GC design is?

@Martin-Weiss
Copy link
Author

Just as a datapoint, the tarball takes 31 seconds to import on one of my local test VMs; on server-class hardware this should be even faster.

Will check this out - hopefully today - do you also have "VMs" on "spinning disks" or on NFS?

@Martin-Weiss
Copy link
Author

Just as a datapoint, the tarball takes 31 seconds to import on one of my local test VMs; on server-class hardware this should be even faster.

Will check this out - hopefully today - do you also have "VMs" on "spinning disks" or on NFS?

Checked - customer on VM - 42 seconds with "all flash storage". So assuming much longer with spinning disk scenarios.

Would that mean "downtime"?

As you think zstd would be better - could we release as zstd on https://github.com/rancher/rke2/releases instead of tar.gz then?

@chadningle
Copy link

chadningle commented Mar 17, 2021

We're wanting to pre-stage not just the rke2 tarball of images, but a lot of other images we need in these high side environments to reduce provisioning time. It would be exceedingly helpful if upon rke2-server or rke2-agent install, the local containerd service and sock path are available outside of the rke2-server or agent services so this can be done. Currently this adds 15 minutes to node provisioning time in an auto-scaling group since import is not happening until we're actually starting the service up. Pre-importing during the AMI creation phase would be ideal here.

@brandond
Copy link
Member

brandond commented Mar 17, 2021

We bundle containerd as part of the RKE2 runtime (which is extracted from the image tarball) so it literally does not exist on the host until RKE2 starts and unpacks the runtime image. This behavior is carried over from K3s, and is part of the value proposition - not having to manage docker or containerd versions on the nodes.

If you want to install your own standalone containerd and preload images into that, you can always use --container-runtime-endpoint to point RKE2 at an existing socket.

As mentioned above, we import all images from tarballs on every startup, since we can't necessarily rely on the Kubelet to not garbage collect them between runs. Honestly for larger deployments with multiple gigs of images, you're probably better off using a private registry instead of trying to preload everything directly onto the nodes.

Once RKE2 is up though, you can interact with it directly:

[root@centos01 ~]# export CONTAINERD_ADDRESS=/run/k3s/containerd/containerd.sock CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin

[root@centos01 ~]# ctr --namespace k8s.io images ls -q
docker.io/rancher/hardened-calico:v3.13.3
docker.io/rancher/hardened-calico:v3.13.3-build20210223
docker.io/rancher/hardened-coredns:v1.6.9-build20210223
docker.io/rancher/hardened-etcd:v3.4.13-k3s1
docker.io/rancher/hardened-etcd:v3.4.13-k3s1-build20210223
docker.io/rancher/hardened-etcd@sha256:407a417f4dfe8311ceb661c186f95667f8b55fae2b7cff52aa8e062efd4f7f31
docker.io/rancher/hardened-flannel:v0.13.0-rancher1
docker.io/rancher/hardened-flannel:v0.13.0-rancher1-build20210223
docker.io/rancher/hardened-k8s-metrics-server:v0.3.6-build20210223
docker.io/rancher/hardened-kube-proxy:v1.20.2
docker.io/rancher/hardened-kube-proxy:v1.20.4-build20210302
docker.io/rancher/hardened-kube-proxy@sha256:d0600143c23c769d64d031a1f1cfe17bb8ca7b4b7112e1ab69e67f3c55bf2c21
docker.io/rancher/hardened-kubernetes:v1.20.2-beta1-rke2r1
docker.io/rancher/hardened-kubernetes:v1.20.2-dev-0722700c
docker.io/rancher/hardened-kubernetes:v1.20.4-dev-5166e107
docker.io/rancher/hardened-kubernetes:v1.20.4-dev-b2d628f2
docker.io/rancher/hardened-kubernetes@sha256:bd2de5571fb59376cfa872c91078ae1cec912cc7a7b665e81510b904e71b26de
docker.io/rancher/klipper-helm:v0.4.3-build20210225
docker.io/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
docker.io/rancher/nginx-ingress-controller:nginx-0.30.0-rancher1
docker.io/rancher/pause:3.2
docker.io/rancher/pause@sha256:76d9f46511297dd38d760194c1afb38583c10a17e94849d0f8fb5217ce5e20ea
docker.io/rancher/rke2-runtime:v1.20.2-dev-0722700c
docker.io/rancher/rke2-runtime:v1.20.4-dev-5166e107
docker.io/rancher/rke2-runtime:v1.20.4-dev-b2d628f2
sha256:01074024bd5f1315a9a8ac6544f0d6c953b7b872ee3af91a0c064bb732a0e035
sha256:0930f4df54e168d53b57f41cb1a65703f96ab9ca2f3178f9de10b6662b3ee57c
sha256:2166188b0dba13e4be178190391536dd91ff8fd29944322bf3734e6f89856d45
sha256:21c345f3958937e9d1b877956c6c9d52889220633a8138cba3d2786ab4cdd013
sha256:271c0a695260e1caaa39782a55a1cfbd566d03b4897f7b44555dc71a371ee54b
sha256:366c64051af8567f4e741182a0c4e2e649fed70f5a91b92d12f6bce29a31c0e3
sha256:5d05c5a9b5533813a66c3602949532c1f1be011ee61c4213d23f72bfe688dc31
sha256:5e714ee56edaef586b9686e9b9275a7172264cdca82e3353767175d7a11b3499
sha256:6ab71f04f5474a4dbc0f19b5b166be86cceb9dba63349bc8d6a5ca68243ec957
sha256:736cae9d947ba51bc29b44535d72b3ca7f6b422e3b2da02d9602e9ed77dc2578
sha256:aee55f9c6784e238c2f62b7bdc12959d9286e6dded50a732072088b22494359c
sha256:b5af743e598496e8ebd7a6eb3fea76a6464041581520d1c2315c95f993287303
sha256:bd44cab9ef086b587be15c63e3599450b3ff28d660abef8b9c2903a4e0083454
sha256:c184ca8e7ce9bd6af5d43363054ed52b60e013b0d5b5a86a899bf55ba1d2df86
sha256:d1e554409702a3d4625e149214c9210bb6d70e519ad503df143c59a181320e7f
sha256:e004ddc1b078f7f6375b7294b9ebb9026cdda37a880dbd0c9da5e0885ef075cf
sha256:e7c578d94f6d982ee3f4b28fca765f4432c3ae3578c269c83f919fea8cd66fd0
sha256:e8dd7ce55cff31bc380de8b08717cbdf0d97dc85bd72748ba28a84833970bcbb
sha256:f24e72a2007f8f110aa5d63ec7a59a6c5dd687e0a090c9b0f6c7ccd2d270dc83
sha256:f96df637bbc7ed5acbc7ec51b69df4be60156c37b10e67a5cfcc076dcefdc62b
sha256:fe40618686a2fbf0472280f72e6881d1214a739420abf31bdb039920794404b1

@Martin-Weiss
Copy link
Author

I believe we might have to discuss two different things, here

  1. re-import tarball during each rke2-server restart -> this is what I believe should be „optimized“ as the service restart directly involves the downtime i.e. during reboots or upgrades

  2. the possibility to pre-import images required for „workload“. Here I believe the on-premise registry question needs to be clarified.. why not have a parallel server that provides the images similar to DNS/NTP etc.. and we might need to understand better why the design for containers is not to have all the images „pre-loaded“ and just pull them on-demand.. vs. pre-loading the images... So what exactly are the requirements that lead to the requirement of pre-loading the workload images on all nodes in a cluster?

@chadningle
Copy link

chadningle commented Mar 19, 2021

This is a multi-cluster environment. Usually both AMI image staging and private registry population is done. You'll take the images you need from Docker Hub, Quay.io, P1's Iron Bank, etc and stage them to your registry. This would be one pipeline that would have two triggers...one timed and another via commit of new images you'd like staged.

Another pipeline is the AMI creation. This is fed by various cluster ASG requirements for specific workloads or apps.

Absent any cluster node pre-warming automation you may have, this is sufficient for performance tuning. This request is all for performance tuning.

Because it doesn't look like the container runtime sock will be available until cluster provisioning time, we're going to instead design a node pre-warming automated system to where we'll have a central image pull list on a per ASG basis so that images that we know can be pulled from time to time will have them already pulled and unpacked and ready to go on the said node. We'll have two triggers here...once per day staggered to not monopolize all nodes at once to update what images they should have staged from the private registry - and other trigger that will kick off staggered updates if by code the staging list for that particular ASG gets updated.

So in summary, we have the following pod provisioning performance tuning steps that can be taken:

  1. Stage known images for AMIs for a given ASG targeting specific workloads. (Not viable since the node container runtime isn't available during the AMI pipeline phase).
  2. Mirror all used images to a local private registry for expedient pulls and more reliable continuity of service.
  3. Create an intelligent container image pre-warming automated system similar to what AWS does with Lambda when requresting your functions get pre-warmed. This will be done using S3, Lambda, and SSM and will be integrated into our terraform on a per ASG basis.

Methods 2 and 3 will be sufficient for keep our cluster "snappy". What is the technical requirement? Snappy. ;)

Chad

@Martin-Weiss
Copy link
Author

FYI - just deployed an other RKE2 with airgap and 1.20.4 - here I see this time during every server or agent restart:

Mar 26 10:41:22 rke-test-master-01 rke2[18977]: time="2021-03-26T10:41:22+01:00" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst"
Mar 26 10:42:15 rke-test-master-01 rke2[18977]: time="2021-03-26T10:42:15+01:00" level=info msg="Running kubelet

53 seconds

Mar 26 11:25:54 rke-test-worker-01 rke2[13944]: time="2021-03-26T11:25:54+01:00" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar.zst"
Mar 26 11:27:10 rke-test-worker-01 rke2[13944]: time="2021-03-26T11:27:10+01:00" level=info msg="Imported docker.io/rancher/hardened-kube-proxy:v1.20.4-build20210302"

76 seconds

This is on NVMe - so I assume this will be much slower in case someone would use spinning disks..

@Martin-Weiss
Copy link
Author

3. Create an intelligent container image pre-warming automated system similar to what AWS does with Lambda when requresting your functions get pre-warmed.  This will be done using S3, Lambda, and SSM and will be integrated into our terraform on a per ASG basis.

Methods 2 and 3 will be sufficient for keep our cluster "snappy". What is the technical requirement? Snappy. ;)

Chad

Thanks for all the details - This sounds a bit like "large images" without "shared layers"...
Could you share the number and size of your images?

@brandond
Copy link
Member

I'm not sure how server-class hardware with NVMe would be slower than a QEMU VM on my old Ryzen 5 2600X dev box with a SATA SSD. What's your storage configuration look like?

@Martin-Weiss
Copy link
Author

I'm not sure how server-class hardware with NVMe would be slower than a QEMU VM on my old Ryzen 5 2600X dev box with a SATA SSD. What's your storage configuration look like?

In my case it is ESXi with VSPhere having the local datastore on NVMe.

@stale
Copy link

stale bot commented Sep 25, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants