Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to pull and unpack sandbox image (i/o timeout) #1633

Closed
soutar opened this issue Feb 7, 2024 · 12 comments
Closed

Failed to pull and unpack sandbox image (i/o timeout) #1633

soutar opened this issue Feb 7, 2024 · 12 comments

Comments

@soutar
Copy link

soutar commented Feb 7, 2024

What happened:
sandbox-image.service failed with an i/o timeout error when pulling the 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5 image and the instance therefore failed to join the EKS cluster. We first observed this at Feb 3, 2024 02:50:13.167 (UTC) and have seen multiple instances fail this way each day since then. We have observed the same problem on Kubernetes 1.27, 1.28, and 1.29.

-- Logs begin at Wed 2024-02-07 16:50:22 UTC, end at Wed 2024-02-07 16:59:59 UTC. --
Feb 07 16:50:31 ip-10-34-46-213.ec2.internal systemd[1]: Starting pull sandbox image defined in containerd config.toml...
Feb 07 16:50:31 ip-10-34-46-213.ec2.internal sudo[4047]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ctr#040--namespace#040k8s.io#040image#040ls
Feb 07 16:50:33 ip-10-34-46-213.ec2.internal sudo[4097]:     root : TTY=unknown ; PWD=/ ; USER=root ;
Feb 07 16:50:33 ip-10-34-46-213.ec2.internal sudo[4097]:     root : (command continued) COMMAND=/bin/crictl#040pull#040--creds#040AWS:<redacted>
Feb 07 16:50:34 ip-10-34-46-213.ec2.internal pull-sandbox-image.sh[4042]: time="2024-02-07T16:50:34Z" level=warning msg="image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead."
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal pull-sandbox-image.sh[4042]: E0207 16:51:04.434803    4098 remote_image.go:171] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause/manifests/sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2\": dial tcp 34.198.77.233:443: i/o timeout" image="602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal pull-sandbox-image.sh[4042]: time="2024-02-07T16:51:04Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause/manifests/sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2\": dial tcp 34.198.77.233:443: i/o timeout"
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: sandbox-image.service: main process exited, code=exited, status=1/FAILURE
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: Failed to start pull sandbox image defined in containerd config.toml.
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: Unit sandbox-image.service entered failed state.
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: sandbox-image.service failed.

What you expected to happen:
sandbox-image.service should pull 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5 successfully and allow the instance to join the EKS cluster

How to reproduce it (as minimally and precisely as possible):
This is not easily reproducible and has only affected 7 of the 271 instances we launched via Karpenter in the last 24 hours. The other 264 instances successfully joined our cluster. As the instances are automatically terminated by Karpenter after 15 minutes, it is only possible to collect debug information if we catch this happening.

Anything else we need to know?:

Environment:

  • AWS Region: us-east-1 (observed in AZs: a, b, c, d, f)
  • Instance Type(s): c4.2xlarge, c5.xlarge, c5d.2xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.29
  • AMI Version: amazon-eks-node-1.29-v20240202
  • Kernel (e.g. uname -a): Linux ip-10-34-46-213.ec2.internal 5.10.205-195.807.amzn2.x86_64 #1 SMP Tue Jan 16 18:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
@lsowen
Copy link

lsowen commented Feb 7, 2024

I'm also seeing the same thing in us-east-1, on a 1.25 EKS cluster

@cartermckinnon
Copy link
Member

cartermckinnon commented Feb 7, 2024

Can you open a case with AWS Support so the ECR team can look into the timeouts? It seems like our retry logic in this script isn't working properly, and that should mitigate this in most cases. I'll get a PR out 👍

@soutar
Copy link
Author

soutar commented Feb 8, 2024

@cartermckinnon Will do. Any improvement to the retry logic here as a mitigation would be great, thank you!

@covidium
Copy link

covidium commented Feb 9, 2024

Cluster details:
EKS: 1.29
Region: us-east-1
AMI version: v20240117

I've had a similar issue where pods were stuck in ContainerCreating with the warning in the Pod description:
Warning FailedCreatePodSandBox 43s (x141 over 30m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

I've checked the releases of the Amazon EKS AMI and noticed that in v20240202 there were some changes with the Sandbox image, therefore I've decided to upgrade to the latest version, which, for me solved the issue.

@soutar
Copy link
Author

soutar commented Feb 9, 2024

@covidium I believe the issue you were experiencing is described in #1597 and fixed in #1605. Unfortunately, even in v20240202, a failed pull on the sandbox image during node startup can result in the node not joining the cluster because the retry mechanism isn't working as expected.

@hknerts
Copy link

hknerts commented Feb 9, 2024

@soutar @cartermckinnon We encountered the same type of errors in v20240202, Also I saw that the "You must specify a region" errors in sandbox services.
There is a fix by below commit but think not in v20240202

107df3f#diff-57a6aadbbb1d3df65f4675ae80c562f7e406bcb11e41f6afb974043a2ede0aa0R32

@soutar
Copy link
Author

soutar commented Feb 13, 2024

@cartermckinnon FYI I built an AMI from 976fe67 and it seems to have fixed the retry behaviour. Once that commit makes it into a release I'd be happy to close this issue — what do you think?

@vitaly-dt
Copy link

I can still see this happening with the latest release amazon/amazon-eks-node-1.26-v20240209
According to the release notes:
v20240202...v20240209
This commit is this release - so it doesn't fix the problem just yet.

@soutar
Copy link
Author

soutar commented Feb 13, 2024

@vitaly-dt 976fe67 is not in v20240209 as far as I can see so that makes sense. I was able to deploy it in our infra by checking out the commit directly and using the build scripts in the repo to publish a private AMI.

@vitaly-dt
Copy link

You are correct, my bad.
Waiting for this fix to be released ASAP

@ryehowell
Copy link

ryehowell commented Feb 14, 2024

Wanted to validate that the latest release v20240209 fixed the issue for me.

@cartermckinnon
Copy link
Member

#1649 went out in yesterday's release 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants