-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some instance types using incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227 #1697
Comments
This instance type is being incorrectly detected as supporting the open-source NVIDIA kernel module, and the wrong kmod is loaded as a result. I have a fix out for review and it will land in the next AMI release. After you've force-loaded the proprietary kmod, do you see any issues with your workloads? Feel free to open an AWS Support case if you can't share the details here, I'll track it down. 👍 |
Thanks, @cartermckinnon. After I force-load the NVIDIA kernel module, everything appears to behave normally. I'm going to roll back to the previous AMI though so I won't have exhaustive insight into the stability of the modified image. |
same issue here. |
This issue should be fixed in https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240307. What release are you using? |
@cartermckinnon The error
|
I've run into the same issue as @korjek. I'm on EKS 1.29 with AMI It appears the containerd config.toml is not being updated to use the nvidia runtime. I found the /etc/eks/configure-nvidia.sh
+ gpu-ami-util has-nvidia-devices
true
+ /etc/eks/nvidia-kmod-load.sh
true
0x2237 NVIDIA A10G
Disabling GSP for instance type: g5.xlarge
2024-03-15T21:59:42+0000 [kmod-util] unpacking: nvidia-open
Error! nvidia-open-535.161.07 is already added!
Aborting. As a workaround, I patched apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: nvidia-a10g
spec:
# ... bunch of other stuff
userData: |
cat <<EOF > /etc/eks/configure-nvidia.sh
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o xtrace
if ! gpu-ami-util has-nvidia-devices; then
echo >&2 "no NVIDIA devices are present, nothing to do!"
exit 0
fi
# patched with "|| true" to avoid failing on startup
/etc/eks/nvidia-kmod-load.sh || true
# add 'nvidia' runtime to containerd config, and set it as the default
# otherwise, all Pods need to speciy the runtimeClassName
nvidia-ctk runtime configure --runtime=containerd --set-as-default
EOF |
Can you grab the logs from the initial execution of the script? |
I did more testing and found that my workaround accidentally and incorrectly fixes the issue. What I think is really happening is that the
The configure-nvidia service sets nvidia as the runtime in So how does my workaround "fix" this? From what I can tell the I've stitched together logs from my observations. The "current.txt" doesn't have my extra userdata. |
This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to Note I'm setting ---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: nvidia-a10g
spec:
# ...
userData: |
cat <<EOF > /etc/eks/containerd/containerd-config.toml
imports = ["/etc/containerd/config.d/*.toml"]
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2
[grpc]
address = "/run/containerd/containerd.sock"
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
discard_unpacked_layers = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
EOF Ideally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source? |
We also seem to be experiencing the same issue with |
They are not open source, though they are built off this repo and modified by AWS internally as far as they have communicated. |
The GPU AMI template is not open source at the moment; but you can always use an existing GPU AMI as a base image in a Packer template if you want to apply a patched
Yep, this should work for now. I intend to have a proper fix out in the next AMI release. |
I have hit the same issue with latest EKS AMI where NVIDIA Device plugin(
|
I have hit the same issue. The I resolved the problem by removing the Hopefully, a new patch of EKS AMI should resolve the issue with NVIDIA Device plugin. |
We're also seeing the same issue, we're using |
Both of the issues mentioned here (incorrect NVIDIA kmod being loaded, race condition between |
What happened:
I run a p3.2xlarge node group in my 1.29 EKS cluster. I updated the node group's AMI image to AMI ID ami-07c8bc6b0bb890e9e (amazon-eks-gpu-node-1.29-v20240227). After the update I was unable to deploy my CUDA containers to the node. I ssh'd into the node and found
nvidia-smi
couldn't communicate with the GPU:What you expected to happen:
Should be able to communicate with the Tesla GPU without manual intervention
How to reproduce it (as minimally and precisely as possible):
Deploy a p3.2xlarge node on a 1.29 cluster using the latest AMI image.
Anything else we need to know?:
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.1aws eks describe-cluster --name <name> --query cluster.version
): 1.29uname -a
):Linux ip-10-20-40-96.us-east-2.compute.internal 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/eks/release
on a node):Everything should work out of the box, but I can manually fix this by removing the default nvidia-dkms files and reinstalling the dkms module for the stated version of the nvidia driver this latest AMI version purportedly supports:
Then if I run
nvidia-smi
I get:The text was updated successfully, but these errors were encountered: