-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 #563
Comments
Same problem occurs on 1.17 on EKS Platform version eks.4 |
Also on EKS 1.17 platform version eks.2 |
Same issue on EKS 1.18 platform version eks.1 |
We also tested side-by-side deployments, one with liveness and readiness probes as above, and one without. The one without was able to terminate correctly, the one with the probes were stuck in Terminating state. |
We are experiencing the same thing on EKS 1.17 (eks.2) with AMI version 1.17.12-20201112. |
Reverting to |
Not sure if its similar or not, but we are experiencing an issue on EKS 1.15 (eks.4) with AMI version 1.15.12-20201112, where the aws-node pods are repeatedly producing k8s events with the following message, we do not see this on the
|
@paxos-cs We are experiencing the same thing on 1.18-v20201112. I think it's all related. We noticed the issue when we were using some automation to do |
Seems to be related to this moby/moby#41352 (comment). Can someone run this on their node (if its not a production cluster) and let me know if this fixes the issue. I did try on couple of my worker nodes and both upgrading/downgrading containerd seems to fix the issue. I'm just trying to narrow down what might have caused this.
or
|
We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI |
It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI? |
We are rolling back Managed Nodegroup as well. The rollback should complete today. We will try to release the new AMI today as well but I will keep this issue updated. Appreciate the patience. |
Thanks for the quick response. As a stopgap measure, is it possible to update the "latest marker" so that new managed nodegroups get created using the previous, working AMI? |
Even on nodes with old ami, we are seeing this happen because our userdata script runs yum update -y which brings along containerd 1.4.0. we shall try 1.4.1 to see if that helps. |
We've been battling with this ever since performing system updates last week. Our nodes essentially became time bombs after about a day and a half where we started getting this dockerd error in our logs: |
We ran into the same issue. EKS cluster version: 1.18.9 We do create the a custom AMI w/ upgraded kernel version from the eks optimized AMI. But during bootup the instances seems to upgrade docker and containerd versions to 19.3.13 and 1.4.0 respectively. And these version seems to run into the following issue.
Some diagnosis. On the working cluster for docker events for the probes show three events So we decided to pin the version of docker(to 19.3.6) and contairnerd(to 1.3.2) during AMI creation. And we deployed this AMI and we are clear of all SyncErrors, and pod terminating issues. |
Indeed. We have opened an issue for that (see #435), which resulted to a open request in containers-roadmap (see aws/containers-roadmap#810). Given the magnitude of the current problem, this missing feature becomes even more relevant now. |
It looks like a new AMI has been released, hopefully this will solve these issues. |
All managed nodegroups on release version |
anyone still seeing pods stuck terminating? we're running fresh clusters with 1.15.11 ami v20201007, and still getting stuck pods. could the issue be with latest patch version running on the control plane?
we have other clusters still showing 1.15.11 for server version that do not have the issue |
@jpke Can you check the containerd version running on your nodes? We're not aware of issues with the |
We upgraded our cluster two hours after AWS aknowledged the issue #563 (comment) but we got upgraded to this already known broken version :/ |
If you want a quick workaround, you can set the amid in your property-set for the cluster, then deploy the worker node action again (in Windsor). The AMI you want depends on the regions, but for us-east-1, ami-03f1bd4665a6fd084 is the latest patched one.
By setting the amid in the parameter-set you force the automation to use the exact ami. That one is patched up, and has the latest kernel and security patches as well.
The automation will replace your nodes one by one, or the number specified in the asgMaxBatchSize parameter in the parameter-set for the cluster. The default is 1, but it can be overridden if you create/change that parameter.
…-Don
From: Matthieu Paret <notifications@github.com>
Date: Thursday, November 19, 2020 at 5:18 PM
To: awslabs/amazon-eks-ami <amazon-eks-ami@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [awslabs/amazon-eks-ami] Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 (#563)
Sent by an external sender
------------------------------------
We upgraded our cluster two hours after AWS aknowledged the issue #563 (comment)<#563 (comment)> but we got upgraded to this already known broken version :/
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#563 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAA6QAPAGNGWZXTU57OJPU3SQWDTNANCNFSM4TYKUPXQ>.
|
@rtripat I can also confirm that the issue has been resolved for us since upgrading to version However, given the magnitude of this, I think you should increase the priority of aws/containers-roadmap#810. It became apparent that users couldn't follow your proposed workaround of rolling back to version |
Last Saturday, we upgraded our clusters in 4 production regions (AP, AU, EU, US) from v1.14 to v1.18 and nightmares happened.
|
We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as |
That's great to hear, thank you @rtripat. I have also commented on the corresponding issue asking for an ETA. Do you want me to close this issue as resolved, or are you going to do it? |
the same problem for us:
Imagine the nightmares with ❯ kubectl version --short
...
Server Version: v1.18.9-eks-d1db3c ❯ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-62-52-32.eu-west-1.compute.internal Ready <none> 5d22h v1.18.9-eks-d1db3c
ip-10-62-53-209.eu-west-1.compute.internal Ready <none> 22h v1.18.9-eks-d1db3c
ip-10-62-54-119.eu-west-1.compute.internal Ready <none> 22h v1.18.9-eks-d1db3c
ip-10-62-59-119.eu-west-1.compute.internal Ready <none> 22h v1.18.9-eks-d1db3c
ip-10-62-59-182.eu-west-1.compute.internal Ready <none> 47h v1.18.9-eks-d1db3c
ip-10-62-65-227.eu-west-1.compute.internal Ready <none> 2d7h v1.18.9-eks-d1db3c
ip-10-62-71-134.eu-west-1.compute.internal Ready <none> 21h v1.18.9-eks-d1db3c
ip-10-62-72-42.eu-west-1.compute.internal Ready <none> 5d22h v1.18.9-eks-d1db3c
ip-10-62-77-213.eu-west-1.compute.internal Ready <none> 2d6h v1.18.9-eks-d1db3c Thanks to @leokhoa for the command And we are trying the |
@christiangda A new AMI was released last week, so if you can upgrade your AMI to |
Hi @mmerkes, thank you for the information, yesterday I notice this. Today we updated all our EKS clusters and reactivated the Pod Autoscaler. |
All managed nodegroups on release version |
@rtripat, Can you confirm if this is fixed for self managed nodes with "yum update" or "apt update" on user data script? If not, this is still open I guess. |
Are you using EKS optimized AMI? You can upgrade to AMI version |
@rtripat , yes we were using version 20201117 for our self managed nodes and having "yum update -y" in our user data which upgraded the containerd to 1.4.1 and had the same issue. Right now, to come out of the issue, we are not doing update on instance startup. |
20201117 was pinned to containerd 1.3.2 which doesn't have the bug. You dont need to update Can you post the output of |
With "yum update -y" the output is containerd-1.4.1-2.amzn2.x86_64. |
I see. You are asking for an EKS Optimized AMI where you get containerd-1.4.x? We just released one today which has containerd-1.4.1-2 and patch for CVE-2020-15257 https://github.com/awslabs/amazon-eks-ami/releases/tag/v20201126 |
Hi @rtripat, we have used the new AMI (v20201126)which was released yesterday. it looks to be fine and no issues noticed. Thanks. |
FYI: You can create or update a Managed Nodegroup to any AMI release version. |
@rtripat is you reply because something has changed for managed node groups since this issue was active and resolved? |
Right. A corrective action item that came out of this AMI release was to allow customers to rollback to a previous AMI release version. So, I wanted to share the EKS Managed Nodegroup API allows customers to create/upgrade a nodegroup to any AMI release version. Same feature request as in aws/containers-roadmap#810 |
Thanks @rtripat, that's really good to know. |
Thanks @rtripat, this is great news! |
What happened:
Since upgrading to AMI
1.16.15-20201112
(from1.16.13-20201007
), we see a lot of Pods get stuck inTerminating
state. We have noticed that all of these Pods have readiness/liveness probes of typeexec
.What you expected to happen:
The Pods should be deleted.
How to reproduce it (as minimally and precisely as possible):
Apply the following YAML to create a deployment with
exec
type probes for readiness/liveness:and once all Pods become ready, delete the Deployment:
Anything else we need to know?:
We also tried the above with a
1.17
EKS cluster (AMI release version1.17.12-20201112
) and it exhibits the same behavior.Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.4aws eks describe-cluster --name <name> --query cluster.version
): 1.16The text was updated successfully, but these errors were encountered: