Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 #563

Closed
iliastsi opened this issue Nov 17, 2020 · 47 comments
Closed
Assignees

Comments

@iliastsi
Copy link

What happened:
Since upgrading to AMI 1.16.15-20201112 (from 1.16.13-20201007), we see a lot of Pods get stuck in Terminating state. We have noticed that all of these Pods have readiness/liveness probes of type exec.

What you expected to happen:
The Pods should be deleted.

How to reproduce it (as minimally and precisely as possible):
Apply the following YAML to create a deployment with exec type probes for readiness/liveness:

$ cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "true"
          failureThreshold: 5
          initialDelaySeconds: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "true"
          failureThreshold: 5
          initialDelaySeconds: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
EOF

and once all Pods become ready, delete the Deployment:

$ kubectl delete deployment nginx-deployment

Anything else we need to know?:
We also tried the above with a 1.17 EKS cluster (AMI release version 1.17.12-20201112) and it exhibits the same behavior.

Environment:

  • AWS Region: eu-central-1
  • Instance Type(s): m5d.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.4
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.16
  • AMI Version: 1.16.15-20201112
@ugur-akkar
Copy link

Same problem occurs on 1.17 on EKS Platform version eks.4

@jhuntwork
Copy link

Also on EKS 1.17 platform version eks.2

@alexbescond
Copy link

Same issue on EKS 1.18 platform version eks.1

@jhuntwork
Copy link

We also tested side-by-side deployments, one with liveness and readiness probes as above, and one without. The one without was able to terminate correctly, the one with the probes were stuck in Terminating state.

@dgarbus
Copy link

dgarbus commented Nov 17, 2020

We are experiencing the same thing on EKS 1.17 (eks.2) with AMI version 1.17.12-20201112.

@webframp
Copy link

Reverting to amazon-eks-node-1.17-v20201007 ami version seems to resolve it for us

@paxos-cs
Copy link

Not sure if its similar or not, but we are experiencing an issue on EKS 1.15 (eks.4) with AMI version 1.15.12-20201112, where the aws-node pods are repeatedly producing k8s events with the following message, we do not see this on the v20201007 ami

Message:             Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

@jwesolowski-rms
Copy link

jwesolowski-rms commented Nov 17, 2020

@paxos-cs We are experiencing the same thing on 1.18-v20201112. I think it's all related. We noticed the issue when we were using some automation to do kubectl exec commands inside the container and they would hang and never return. Terminating the pod also seems to hang.

@SaranBalaji90
Copy link

SaranBalaji90 commented Nov 17, 2020

Seems to be related to this moby/moby#41352 (comment). Can someone run this on their node (if its not a production cluster) and let me know if this fixes the issue. I did try on couple of my worker nodes and both upgrading/downgrading containerd seems to fix the issue. I'm just trying to narrow down what might have caused this.

cat << EOF > upgrade-containerd.sh
#!/bin/bash
set -eo pipefail
docker ps
systemctl stop docker
systemctl stop containerd
wget https://github.com/containerd/containerd/releases/download/v1.4.1/containerd-1.4.1-linux-amd64.tar.gz
tar xvf containerd-1.4.1-linux-amd64.tar.gz
cp -f bin/c* /bin/
systemctl start docker
systemctl start containerd
systemctl restart kubelet
systemctl status docker
systemctl start containerd
systemctl status kubelet
docker version
docker ps
EOF
chmod +x upgrade-containerd.sh
sudo ./upgrade-containerd.sh

or

cat << EOF > downgrade-containerd.sh
#!/bin/bash
set -eo pipefail
docker ps
sudo yum downgrade containerd-1.3.2-1.amzn2.x86_64
systemctl restart docker
systemctl restart kubelet
docker ps
EOF
chmod +x downgrade-containerd.sh
sudo ./downgrade-containerd.sh

@rtripat rtripat self-assigned this Nov 17, 2020
@rtripat
Copy link
Contributor

rtripat commented Nov 17, 2020

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

@dgarbus
Copy link

dgarbus commented Nov 17, 2020

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

@rtripat
Copy link
Contributor

rtripat commented Nov 17, 2020

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

We are rolling back Managed Nodegroup as well. The rollback should complete today. We will try to release the new AMI today as well but I will keep this issue updated. Appreciate the patience.

@dgarbus
Copy link

dgarbus commented Nov 17, 2020

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

We are rolling back Managed Nodegroup as well. The rollback should complete today. We will try to release the new AMI today as well but I will keep this issue updated. Appreciate the patience.

Thanks for the quick response. As a stopgap measure, is it possible to update the "latest marker" so that new managed nodegroups get created using the previous, working AMI?

@harshal-shah
Copy link

Even on nodes with old ami, we are seeing this happen because our userdata script runs yum update -y which brings along containerd 1.4.0. we shall try 1.4.1 to see if that helps.

@rabidsloth
Copy link

We've been battling with this ever since performing system updates last week. Our nodes essentially became time bombs after about a day and a half where we started getting this dockerd error in our logs: error: write unix /var/run/docker.sock->@: write: broken pipe, and then after a few hours the nodes start becoming unresponsive and workloads go haywire. Once it's in this state, it no longer allows us to terminate the workloads until the node has been restarted. We just reverted back to ami id ami-00651928502cc143d last night and removed yum updates from our user data script in hopes of getting back to a stable system.

@samof76
Copy link

samof76 commented Nov 18, 2020

We ran into the same issue.

EKS cluster version: 1.18.9

We do create the a custom AMI w/ upgraded kernel version from the eks optimized AMI. But during bootup the instances seems to upgrade docker and containerd versions to 19.3.13 and 1.4.0 respectively. And these version seems to run into the following issue.

  1. Deployments works as expected
  2. Start seeing Readiness and Liveness probes failing on some of the containers
  3. No pod restarts
  4. Kubelet logs SyncErrors
  5. Trying delete/evict the pods they get stuck in terminating state

Some diagnosis. On the working cluster for docker events for the probes show three events exec_create, exec_start and exec_die. But on the cluster with above-mentioned docker and containerd version after a while of deploying the pods, we see only exec_create and exec_start, and there are no exec_die events.

So we decided to pin the version of docker(to 19.3.6) and contairnerd(to 1.3.2) during AMI creation. And we deployed this AMI and we are clear of all SyncErrors, and pod terminating issues.

@dimara
Copy link

dimara commented Nov 18, 2020

@dgarbus

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups

Indeed. We have opened an issue for that (see #435), which resulted to a open request in containers-roadmap (see aws/containers-roadmap#810). Given the magnitude of the current problem, this missing feature becomes even more relevant now.

@stevehipwell
Copy link
Contributor

It looks like a new AMI has been released, hopefully this will solve these issues.

@rtripat
Copy link
Contributor

rtripat commented Nov 18, 2020

It looks like a new AMI has been released, hopefully this will solve these issues.

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@jpke
Copy link

jpke commented Nov 18, 2020

anyone still seeing pods stuck terminating? we're running fresh clusters with 1.15.11 ami v20201007, and still getting stuck pods.

could the issue be with latest patch version running on the control plane?

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-eks-31566f", GitCommit:"31566f851673e809d4d667b7235ed87587d37722", GitTreeState:"clean", BuildDate:"2020-10-20T23:25:14Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

we have other clusters still showing 1.15.11 for server version that do not have the issue

@mmerkes
Copy link
Member

mmerkes commented Nov 18, 2020

@jpke Can you check the containerd version running on your nodes? We're not aware of issues with the v20201007 AMI around pods getting stuck terminating, so could be unrelated to the most recent issue.

@mtparet
Copy link

mtparet commented Nov 19, 2020

We upgraded our cluster two hours after AWS aknowledged the issue #563 (comment) but we got upgraded to this already known broken version :/

@dlaidlaw
Copy link

dlaidlaw commented Nov 19, 2020 via email

@iliastsi
Copy link
Author

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@rtripat I can also confirm that the issue has been resolved for us since upgrading to version 20201117. Thanks for fixing this. I guess this issue can be closed now.

However, given the magnitude of this, I think you should increase the priority of aws/containers-roadmap#810. It became apparent that users couldn't follow your proposed workaround of rolling back to version 20201007 (#563 (comment)) because there is no way to choose the version of the AMI to deploy in managed nodegroups.

@leokhoa
Copy link

leokhoa commented Nov 20, 2020

Last Saturday, we upgraded our clusters in 4 production regions (AP, AU, EU, US) from v1.14 to v1.18 and nightmares happened.
The issue caused many of pods in our Zookeeper clusters stuck in "Terminating" state and affected other clusters (Kafka clusters, SolrCloud clusters). Doing the "kubectl delete pod --force --grace-period=0 xxx" sometimes cause filesystem corruptions. We tried our bests to keep our systems up and running but it is a bad experience on upgrading EKS clusters. Positive things:

  1. The issue is fixed
  2. With version 1.18, we have another 1 more happy year living without the need of upgrading EKS clusters :)

@rtripat
Copy link
Contributor

rtripat commented Nov 20, 2020

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@rtripat I can also confirm that the issue has been resolved for us since upgrading to version 20201117. Thanks for fixing this. I guess this issue can be closed now.

However, given the magnitude of this, I think you should increase the priority of aws/containers-roadmap#810. It became apparent that users couldn't follow your proposed workaround of rolling back to version 20201007 (#563 (comment)) because there is no way to choose the version of the AMI to deploy in managed nodegroups.

We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as Degraded if they are on recalled AMI release versions.

@iliastsi
Copy link
Author

We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as Degraded if they are on recalled AMI release versions.

That's great to hear, thank you @rtripat. I have also commented on the corresponding issue asking for an ETA.

Do you want me to close this issue as resolved, or are you going to do it?

@christiangda
Copy link

the same problem for us:

AWS EKS 1.18.9 since ~ 2 weeks ago

Imagine the nightmares with horizontal pod scaler enable! so we disabled it and over-resize our nodes.

❯ kubectl version --short
...
Server Version: v1.18.9-eks-d1db3c
❯ kubectl get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-62-52-32.eu-west-1.compute.internal    Ready    <none>   5d22h   v1.18.9-eks-d1db3c
ip-10-62-53-209.eu-west-1.compute.internal   Ready    <none>   22h     v1.18.9-eks-d1db3c
ip-10-62-54-119.eu-west-1.compute.internal   Ready    <none>   22h     v1.18.9-eks-d1db3c
ip-10-62-59-119.eu-west-1.compute.internal   Ready    <none>   22h     v1.18.9-eks-d1db3c
ip-10-62-59-182.eu-west-1.compute.internal   Ready    <none>   47h     v1.18.9-eks-d1db3c
ip-10-62-65-227.eu-west-1.compute.internal   Ready    <none>   2d7h    v1.18.9-eks-d1db3c
ip-10-62-71-134.eu-west-1.compute.internal   Ready    <none>   21h     v1.18.9-eks-d1db3c
ip-10-62-72-42.eu-west-1.compute.internal    Ready    <none>   5d22h   v1.18.9-eks-d1db3c
ip-10-62-77-213.eu-west-1.compute.internal   Ready    <none>   2d6h    v1.18.9-eks-d1db3c

Thanks to @leokhoa for the command kubectl delete pod --force --grace-period=0 xxx we mitigated a little bit the issue.

And we are trying the workaround described by @SaranBalaji90 in our ASG user data to mitigate until you release the new AMI.

@mmerkes
Copy link
Member

mmerkes commented Nov 23, 2020

@christiangda A new AMI was released last week, so if you can upgrade your AMI to 20201117, you should see the issue resolved.

@christiangda
Copy link

Hi @mmerkes, thank you for the information, yesterday I notice this.

Today we updated all our EKS clusters and reactivated the Pod Autoscaler.

@rtripat
Copy link
Contributor

rtripat commented Nov 30, 2020

All managed nodegroups on release version 20201112 can now be upgraded to 20201117 or higher. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@rtripat rtripat closed this as completed Nov 30, 2020
@fazith27
Copy link

@rtripat, Can you confirm if this is fixed for self managed nodes with "yum update" or "apt update" on user data script? If not, this is still open I guess.

@rtripat
Copy link
Contributor

rtripat commented Nov 30, 2020

@rtripat, Can you confirm if this is fixed for self managed nodes with "yum update" or "apt update" on user data script? If not, this is still open I guess.

Are you using EKS optimized AMI? You can upgrade to AMI version 20201117 or higher for self managed nodes too.

@fazith27
Copy link

@rtripat , yes we were using version 20201117 for our self managed nodes and having "yum update -y" in our user data which upgraded the containerd to 1.4.1 and had the same issue. Right now, to come out of the issue, we are not doing update on instance startup.

@rtripat
Copy link
Contributor

rtripat commented Nov 30, 2020

20201117 was pinned to containerd 1.3.2 which doesn't have the bug. You dont need to update yum update -y on startup.

Can you post the output of rpm -q containerd from an instance where you did run yum update -y?

@fazith27
Copy link

fazith27 commented Dec 1, 2020

With "yum update -y" the output is containerd-1.4.1-2.amzn2.x86_64.
Without that the output is containerd-1.3.2-1.amzn2.x86_64.
But the actual issue is not fixed, right?
The solution given here is more like a workaround by downgrading the containerd version.

@rtripat
Copy link
Contributor

rtripat commented Dec 1, 2020

I see. You are asking for an EKS Optimized AMI where you get containerd-1.4.x? We just released one today which has containerd-1.4.1-2 and patch for CVE-2020-15257

https://github.com/awslabs/amazon-eks-ami/releases/tag/v20201126

@fazith27
Copy link

fazith27 commented Dec 2, 2020

Hi @rtripat, we have used the new AMI (v20201126)which was released yesterday. it looks to be fine and no issues noticed. Thanks.

@rtripat
Copy link
Contributor

rtripat commented Mar 30, 2021

FYI: You can create or update a Managed Nodegroup to any AMI release version.

@stevehipwell
Copy link
Contributor

@rtripat is you reply because something has changed for managed node groups since this issue was active and resolved?

@rtripat
Copy link
Contributor

rtripat commented Mar 31, 2021

@rtripat is you reply because something has changed for managed node groups since this issue was active and resolved?

Right. A corrective action item that came out of this AMI release was to allow customers to rollback to a previous AMI release version. So, I wanted to share the EKS Managed Nodegroup API allows customers to create/upgrade a nodegroup to any AMI release version.

Same feature request as in aws/containers-roadmap#810

@stevehipwell
Copy link
Contributor

Thanks @rtripat, that's really good to know.

@iliastsi
Copy link
Author

iliastsi commented Apr 2, 2021

Thanks @rtripat, this is great news!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests