Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 #563

iliastsi · 2020-11-17T09:53:20Z

What happened:
Since upgrading to AMI 1.16.15-20201112 (from 1.16.13-20201007), we see a lot of Pods get stuck in Terminating state. We have noticed that all of these Pods have readiness/liveness probes of type exec.

What you expected to happen:
The Pods should be deleted.

How to reproduce it (as minimally and precisely as possible):
Apply the following YAML to create a deployment with exec type probes for readiness/liveness:

$ cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "true"
          failureThreshold: 5
          initialDelaySeconds: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "true"
          failureThreshold: 5
          initialDelaySeconds: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
EOF

and once all Pods become ready, delete the Deployment:

$ kubectl delete deployment nginx-deployment

Anything else we need to know?:
We also tried the above with a 1.17 EKS cluster (AMI release version 1.17.12-20201112) and it exhibits the same behavior.

Environment:

AWS Region: eu-central-1
Instance Type(s): m5d.xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.4
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.16
AMI Version: 1.16.15-20201112

The text was updated successfully, but these errors were encountered:

ugur-akkar · 2020-11-17T10:29:46Z

Same problem occurs on 1.17 on EKS Platform version eks.4

jhuntwork · 2020-11-17T15:01:26Z

Also on EKS 1.17 platform version eks.2

alexbescond · 2020-11-17T15:31:45Z

Same issue on EKS 1.18 platform version eks.1

jhuntwork · 2020-11-17T15:38:12Z

We also tested side-by-side deployments, one with liveness and readiness probes as above, and one without. The one without was able to terminate correctly, the one with the probes were stuck in Terminating state.

dgarbus · 2020-11-17T15:47:19Z

We are experiencing the same thing on EKS 1.17 (eks.2) with AMI version 1.17.12-20201112.

webframp · 2020-11-17T15:51:38Z

Reverting to amazon-eks-node-1.17-v20201007 ami version seems to resolve it for us

paxos-cs · 2020-11-17T15:54:01Z

Not sure if its similar or not, but we are experiencing an issue on EKS 1.15 (eks.4) with AMI version 1.15.12-20201112, where the aws-node pods are repeatedly producing k8s events with the following message, we do not see this on the v20201007 ami

Message:             Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

jwesolowski-rms · 2020-11-17T15:55:49Z

@paxos-cs We are experiencing the same thing on 1.18-v20201112. I think it's all related. We noticed the issue when we were using some automation to do kubectl exec commands inside the container and they would hang and never return. Terminating the pod also seems to hang.

SaranBalaji90 · 2020-11-17T15:57:53Z

Seems to be related to this moby/moby#41352 (comment). Can someone run this on their node (if its not a production cluster) and let me know if this fixes the issue. I did try on couple of my worker nodes and both upgrading/downgrading containerd seems to fix the issue. I'm just trying to narrow down what might have caused this.

cat << EOF > upgrade-containerd.sh
#!/bin/bash
set -eo pipefail
docker ps
systemctl stop docker
systemctl stop containerd
wget https://github.com/containerd/containerd/releases/download/v1.4.1/containerd-1.4.1-linux-amd64.tar.gz
tar xvf containerd-1.4.1-linux-amd64.tar.gz
cp -f bin/c* /bin/
systemctl start docker
systemctl start containerd
systemctl restart kubelet
systemctl status docker
systemctl start containerd
systemctl status kubelet
docker version
docker ps
EOF
chmod +x upgrade-containerd.sh
sudo ./upgrade-containerd.sh

or

cat << EOF > downgrade-containerd.sh
#!/bin/bash
set -eo pipefail
docker ps
sudo yum downgrade containerd-1.3.2-1.amzn2.x86_64
systemctl restart docker
systemctl restart kubelet
docker ps
EOF
chmod +x downgrade-containerd.sh
sudo ./downgrade-containerd.sh

rtripat · 2020-11-17T17:02:31Z

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

dgarbus · 2020-11-17T17:15:25Z

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

rtripat · 2020-11-17T17:16:50Z

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

We are rolling back Managed Nodegroup as well. The rollback should complete today. We will try to release the new AMI today as well but I will keep this issue updated. Appreciate the patience.

dgarbus · 2020-11-17T17:22:47Z

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

We are rolling back Managed Nodegroup as well. The rollback should complete today. We will try to release the new AMI today as well but I will keep this issue updated. Appreciate the patience.

Thanks for the quick response. As a stopgap measure, is it possible to update the "latest marker" so that new managed nodegroups get created using the previous, working AMI?

harshal-shah · 2020-11-17T19:26:08Z

Even on nodes with old ami, we are seeing this happen because our userdata script runs yum update -y which brings along containerd 1.4.0. we shall try 1.4.1 to see if that helps.

rabidsloth · 2020-11-17T20:30:18Z

We've been battling with this ever since performing system updates last week. Our nodes essentially became time bombs after about a day and a half where we started getting this dockerd error in our logs: error: write unix /var/run/docker.sock->@: write: broken pipe, and then after a few hours the nodes start becoming unresponsive and workloads go haywire. Once it's in this state, it no longer allows us to terminate the workloads until the node has been restarted. We just reverted back to ami id ami-00651928502cc143d last night and removed yum updates from our user data script in hopes of getting back to a stable system.

…s#563

…564)

samof76 · 2020-11-18T03:40:36Z

We ran into the same issue.

EKS cluster version: 1.18.9

We do create the a custom AMI w/ upgraded kernel version from the eks optimized AMI. But during bootup the instances seems to upgrade docker and containerd versions to 19.3.13 and 1.4.0 respectively. And these version seems to run into the following issue.

Deployments works as expected
Start seeing Readiness and Liveness probes failing on some of the containers
No pod restarts
Kubelet logs SyncErrors
Trying delete/evict the pods they get stuck in terminating state

Some diagnosis. On the working cluster for docker events for the probes show three events exec_create, exec_start and exec_die. But on the cluster with above-mentioned docker and containerd version after a while of deploying the pods, we see only exec_create and exec_start, and there are no exec_die events.

So we decided to pin the version of docker(to 19.3.6) and contairnerd(to 1.3.2) during AMI creation. And we deployed this AMI and we are clear of all SyncErrors, and pod terminating issues.

dimara · 2020-11-18T13:42:56Z

@dgarbus

It's not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups

Indeed. We have opened an issue for that (see #435), which resulted to a open request in containers-roadmap (see aws/containers-roadmap#810). Given the magnitude of the current problem, this missing feature becomes even more relevant now.

stevehipwell · 2020-11-18T16:24:14Z

It looks like a new AMI has been released, hopefully this will solve these issues.

rtripat · 2020-11-18T16:29:53Z

It looks like a new AMI has been released, hopefully this will solve these issues.

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

jpke · 2020-11-18T19:50:04Z

anyone still seeing pods stuck terminating? we're running fresh clusters with 1.15.11 ami v20201007, and still getting stuck pods.

could the issue be with latest patch version running on the control plane?

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-eks-31566f", GitCommit:"31566f851673e809d4d667b7235ed87587d37722", GitTreeState:"clean", BuildDate:"2020-10-20T23:25:14Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

we have other clusters still showing 1.15.11 for server version that do not have the issue

mmerkes · 2020-11-18T22:21:27Z

@jpke Can you check the containerd version running on your nodes? We're not aware of issues with the v20201007 AMI around pods getting stuck terminating, so could be unrelated to the most recent issue.

mtparet · 2020-11-19T21:18:29Z

We upgraded our cluster two hours after AWS aknowledged the issue #563 (comment) but we got upgraded to this already known broken version :/

dlaidlaw · 2020-11-19T21:59:49Z

If you want a quick workaround, you can set the amid in your property-set for the cluster, then deploy the worker node action again (in Windsor). The AMI you want depends on the regions, but for us-east-1, ami-03f1bd4665a6fd084 is the latest patched one. By setting the amid in the parameter-set you force the automation to use the exact ami. That one is patched up, and has the latest kernel and security patches as well. The automation will replace your nodes one by one, or the number specified in the asgMaxBatchSize parameter in the parameter-set for the cluster. The default is 1, but it can be overridden if you create/change that parameter.

…

-Don From: Matthieu Paret <notifications@github.com> Date: Thursday, November 19, 2020 at 5:18 PM To: awslabs/amazon-eks-ami <amazon-eks-ami@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [awslabs/amazon-eks-ami] Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 (#563) Sent by an external sender ------------------------------------ We upgraded our cluster two hours after AWS aknowledged the issue #563 (comment)<#563 (comment)> but we got upgraded to this already known broken version :/ — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#563 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAA6QAPAGNGWZXTU57OJPU3SQWDTNANCNFSM4TYKUPXQ>.

iliastsi · 2020-11-20T11:13:49Z

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@rtripat I can also confirm that the issue has been resolved for us since upgrading to version 20201117. Thanks for fixing this. I guess this issue can be closed now.

However, given the magnitude of this, I think you should increase the priority of aws/containers-roadmap#810. It became apparent that users couldn't follow your proposed workaround of rolling back to version 20201007 (#563 (comment)) because there is no way to choose the version of the AMI to deploy in managed nodegroups.

leokhoa · 2020-11-20T11:25:50Z

Last Saturday, we upgraded our clusters in 4 production regions (AP, AU, EU, US) from v1.14 to v1.18 and nightmares happened.
The issue caused many of pods in our Zookeeper clusters stuck in "Terminating" state and affected other clusters (Kafka clusters, SolrCloud clusters). Doing the "kubectl delete pod --force --grace-period=0 xxx" sometimes cause filesystem corruptions. We tried our bests to keep our systems up and running but it is a bad experience on upgrading EKS clusters. Positive things:

The issue is fixed
With version 1.18, we have another 1 more happy year living without the need of upgrading EKS clusters :)

rtripat · 2020-11-20T16:08:58Z

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@rtripat I can also confirm that the issue has been resolved for us since upgrading to version 20201117. Thanks for fixing this. I guess this issue can be closed now.

However, given the magnitude of this, I think you should increase the priority of aws/containers-roadmap#810. It became apparent that users couldn't follow your proposed workaround of rolling back to version 20201007 (#563 (comment)) because there is no way to choose the version of the AMI to deploy in managed nodegroups.

We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as Degraded if they are on recalled AMI release versions.

iliastsi · 2020-11-20T20:50:05Z

We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as Degraded if they are on recalled AMI release versions.

That's great to hear, thank you @rtripat. I have also commented on the corresponding issue asking for an ETA.

Do you want me to close this issue as resolved, or are you going to do it?

christiangda · 2020-11-23T07:42:51Z

the same problem for us:

AWS EKS 1.18.9 since ~ 2 weeks ago

Imagine the nightmares with horizontal pod scaler enable! so we disabled it and over-resize our nodes.

❯ kubectl version --short
...
Server Version: v1.18.9-eks-d1db3c

❯ kubectl get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-62-52-32.eu-west-1.compute.internal    Ready    <none>   5d22h   v1.18.9-eks-d1db3c
ip-10-62-53-209.eu-west-1.compute.internal   Ready    <none>   22h     v1.18.9-eks-d1db3c
ip-10-62-54-119.eu-west-1.compute.internal   Ready    <none>   22h     v1.18.9-eks-d1db3c
ip-10-62-59-119.eu-west-1.compute.internal   Ready    <none>   22h     v1.18.9-eks-d1db3c
ip-10-62-59-182.eu-west-1.compute.internal   Ready    <none>   47h     v1.18.9-eks-d1db3c
ip-10-62-65-227.eu-west-1.compute.internal   Ready    <none>   2d7h    v1.18.9-eks-d1db3c
ip-10-62-71-134.eu-west-1.compute.internal   Ready    <none>   21h     v1.18.9-eks-d1db3c
ip-10-62-72-42.eu-west-1.compute.internal    Ready    <none>   5d22h   v1.18.9-eks-d1db3c
ip-10-62-77-213.eu-west-1.compute.internal   Ready    <none>   2d6h    v1.18.9-eks-d1db3c

Thanks to @leokhoa for the command kubectl delete pod --force --grace-period=0 xxx we mitigated a little bit the issue.

And we are trying the workaround described by @SaranBalaji90 in our ASG user data to mitigate until you release the new AMI.

mmerkes · 2020-11-23T16:24:53Z

@christiangda A new AMI was released last week, so if you can upgrade your AMI to 20201117, you should see the issue resolved.

christiangda · 2020-11-24T07:55:52Z

Hi @mmerkes, thank you for the information, yesterday I notice this.

Today we updated all our EKS clusters and reactivated the Pod Autoscaler.

rtripat · 2020-11-30T20:28:32Z

All managed nodegroups on release version 20201112 can now be upgraded to 20201117 or higher. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

fazith27 · 2020-11-30T23:15:57Z

@rtripat, Can you confirm if this is fixed for self managed nodes with "yum update" or "apt update" on user data script? If not, this is still open I guess.

rtripat · 2020-11-30T23:20:34Z

@rtripat, Can you confirm if this is fixed for self managed nodes with "yum update" or "apt update" on user data script? If not, this is still open I guess.

Are you using EKS optimized AMI? You can upgrade to AMI version 20201117 or higher for self managed nodes too.

fazith27 · 2020-11-30T23:25:41Z

@rtripat , yes we were using version 20201117 for our self managed nodes and having "yum update -y" in our user data which upgraded the containerd to 1.4.1 and had the same issue. Right now, to come out of the issue, we are not doing update on instance startup.

rtripat · 2020-11-30T23:53:23Z

20201117 was pinned to containerd 1.3.2 which doesn't have the bug. You dont need to update yum update -y on startup.

Can you post the output of rpm -q containerd from an instance where you did run yum update -y?

fazith27 · 2020-12-01T00:34:25Z

With "yum update -y" the output is containerd-1.4.1-2.amzn2.x86_64.
Without that the output is containerd-1.3.2-1.amzn2.x86_64.
But the actual issue is not fixed, right?
The solution given here is more like a workaround by downgrading the containerd version.

rtripat · 2020-12-01T01:09:00Z

I see. You are asking for an EKS Optimized AMI where you get containerd-1.4.x? We just released one today which has containerd-1.4.1-2 and patch for CVE-2020-15257

https://github.com/awslabs/amazon-eks-ami/releases/tag/v20201126

fazith27 · 2020-12-02T09:05:38Z

Hi @rtripat, we have used the new AMI (v20201126)which was released yesterday. it looks to be fine and no issues noticed. Thanks.

rtripat · 2021-03-30T22:28:39Z

FYI: You can create or update a Managed Nodegroup to any AMI release version.

stevehipwell · 2021-03-31T07:30:29Z

@rtripat is you reply because something has changed for managed node groups since this issue was active and resolved?

rtripat · 2021-03-31T19:43:25Z

@rtripat is you reply because something has changed for managed node groups since this issue was active and resolved?

Right. A corrective action item that came out of this AMI release was to allow customers to rollback to a previous AMI release version. So, I wanted to share the EKS Managed Nodegroup API allows customers to create/upgrade a nodegroup to any AMI release version.

Same feature request as in aws/containers-roadmap#810

stevehipwell · 2021-03-31T19:45:13Z

Thanks @rtripat, that's really good to know.

iliastsi · 2021-04-02T08:05:03Z

Thanks @rtripat, this is great news!

rtripat self-assigned this Nov 17, 2020

stevehipwell mentioned this issue Nov 17, 2020

[aws-vpc-cni] Liveness/Readiness DeadlineExceeded aws/eks-charts#355

Closed

mmerkes added a commit to mmerkes/amazon-eks-ami that referenced this issue Nov 17, 2020

Downgrades containerd to containerd-1.3.2-1.amzn2 to fix issue awslab…

1300e1f

…s#563

mmerkes mentioned this issue Nov 17, 2020

Downgrades containerd to containerd-1.3.2-1.amzn2 to fix issue #563 #564

Merged

abeer91 pushed a commit that referenced this issue Nov 17, 2020

Downgrades containerd to containerd-1.3.2-1.amzn2 to fix issue #563 (#…

a6313d5

…564)

michaelbeaumont mentioned this issue Nov 18, 2020

upgrade nodegroup fails if recommended AMI has been rolled back eksctl-io/eksctl#2850

Closed

artem-nefedov mentioned this issue Nov 18, 2020

"create cluster" command uses latest AMI instead of recommended for managed node groups eksctl-io/eksctl#2851

Closed

dgarbus mentioned this issue Nov 18, 2020

[EKS] [request]: Managed Node Groups support for all AMI release versions aws/containers-roadmap#810

Closed

karthimohan mentioned this issue Nov 23, 2020

Pod stuck in Terminating state rabbitmq/cluster-operator#409

Closed

deliahu mentioned this issue Nov 25, 2020

Pin the AWS AMI version cortexlabs/cortex#1615

Closed

This was referenced Nov 27, 2020

docker kill hangs, pod stuck in terminating kubernetes/kubernetes#25456

Closed

Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded kubernetes/kubernetes#82987

Closed

rtripat closed this as completed Nov 30, 2020

jakubdyszkiewicz mentioned this issue Jan 12, 2021

chore(kuma-cp) change exec probes to http kumahq/kuma#1407

Merged

Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 #563

Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112 #563

Comments

iliastsi commented Nov 17, 2020

ugur-akkar commented Nov 17, 2020

jhuntwork commented Nov 17, 2020

alexbescond commented Nov 17, 2020

jhuntwork commented Nov 17, 2020

dgarbus commented Nov 17, 2020

webframp commented Nov 17, 2020

paxos-cs commented Nov 17, 2020

jwesolowski-rms commented Nov 17, 2020 • edited Loading

SaranBalaji90 commented Nov 17, 2020 • edited Loading

rtripat commented Nov 17, 2020

dgarbus commented Nov 17, 2020

rtripat commented Nov 17, 2020

dgarbus commented Nov 17, 2020

harshal-shah commented Nov 17, 2020

rabidsloth commented Nov 17, 2020

samof76 commented Nov 18, 2020

dimara commented Nov 18, 2020

stevehipwell commented Nov 18, 2020

rtripat commented Nov 18, 2020 • edited Loading

jpke commented Nov 18, 2020 • edited Loading

mmerkes commented Nov 18, 2020

mtparet commented Nov 19, 2020 • edited Loading

dlaidlaw commented Nov 19, 2020 via email

iliastsi commented Nov 20, 2020

leokhoa commented Nov 20, 2020 • edited Loading

rtripat commented Nov 20, 2020

iliastsi commented Nov 20, 2020

christiangda commented Nov 23, 2020

mmerkes commented Nov 23, 2020

christiangda commented Nov 24, 2020

rtripat commented Nov 30, 2020

fazith27 commented Nov 30, 2020

rtripat commented Nov 30, 2020

fazith27 commented Nov 30, 2020

rtripat commented Nov 30, 2020

fazith27 commented Dec 1, 2020 • edited Loading

rtripat commented Dec 1, 2020

fazith27 commented Dec 2, 2020

rtripat commented Mar 30, 2021

stevehipwell commented Mar 31, 2021

rtripat commented Mar 31, 2021

stevehipwell commented Mar 31, 2021

iliastsi commented Apr 2, 2021

jwesolowski-rms commented Nov 17, 2020 •

edited

Loading

SaranBalaji90 commented Nov 17, 2020 •

edited

Loading

rtripat commented Nov 18, 2020 •

edited

Loading

jpke commented Nov 18, 2020 •

edited

Loading

mtparet commented Nov 19, 2020 •

edited

Loading

leokhoa commented Nov 20, 2020 •

edited

Loading

fazith27 commented Dec 1, 2020 •

edited

Loading