Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-cli reports incorrect CUDA driver version on WSL2 #148

Open
7 of 9 tasks
danfairs opened this issue Nov 8, 2020 · 15 comments
Open
7 of 9 tasks

nvidia-container-cli reports incorrect CUDA driver version on WSL2 #148

danfairs opened this issue Nov 8, 2020 · 15 comments

Comments

@danfairs
Copy link

danfairs commented Nov 8, 2020

1. Issue or feature description

nvidia-container-cli on WSL2 is reporting CUDA 11.0 (and thus refusing to run containers with cuda>=11.1) even though CUDA toolkit 11.1 is installed in Linux. Windows 10 is build 20251.fe_release.201030-1438. Everything is installed as per the install guide, and CUDA containers do actually work (for example docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark successfully returns a benchmark).

Machine is a Dell XPS 15 9500 with an i9-10885H CPU, 64 GB RAM and an NVIDIA GeForce GTX 1650 Ti.

2. Steps to reproduce the issue

  1. Install Windows 10 on the insider program with a version at or later than 20251.fe_release.201030-1438
  2. Install the Windows CUDA drivers from here (this is 460.20 for me)
  3. Install Ubuntu 20.04, the CUDA toolkit 11.1 and the container runtime as per the nvidia docs
  4. Run nvidia-smi on the host - it should give a CUDA version of 11.2.
  5. Check docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark correctly outputs benchmarks
  6. In Linux, run nvidia-container-cli info. It incorrectly outputs CUDA version 11.0.

This command will also fail:

$ docker run --gpus all --rm -it nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04 /bin/bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.1, please update your driver to a newer version, or use an earlier cuda container\\\\n\\\"\"": unknown.

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info ncc.txt

  • Kernel version from uname -a Linux aphid 5.4.72-microsoft-standard-WSL2 NVIDIA/nvidia-docker#1 SMP Wed Oct 28 23:40:43 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg

  • Driver information from nvidia-smi -a nvidia-smi.txt

  • Docker version from docker version 19.03.13

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*' packages.txt

  • NVIDIA container library version from nvidia-container-cli -V ncc-version.txt

  • NVIDIA container library logs (see troubleshooting)

  • Docker command, image and tag used

$ docker run --gpus all --rm -it nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04 /bin/bash 2>&1 docker-run.txt
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.1, please update your driver to a newer version, or use an earlier cuda container\\\\n\\\"\"": unknown.
@opptimus
Copy link

opptimus commented Nov 12, 2020

The same with me

Status: Downloaded newer image for nvidia/cuda:10.2-base
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""": unknown.

@klueska
Copy link
Contributor

klueska commented Nov 12, 2020

@opptimus seems to have a different issue, but the original issue may be related to:
NVIDIA/libnvidia-container#117 (comment)

@danfairs
Copy link
Author

@klueska To be fair, @opptimus' issue is the one I actually bumped into to start with. It was only after further digging I realised nvidia-container-cli was also reporting the wrong version. I may be getting the cart before the horse, I'm pretty new to this :)

@opptimus
Copy link

@danfairs I solve my problems with upgrading my Win10 to version 20257.1. Follow official WSL2 guidelines.

@elezar
Copy link
Member

elezar commented Feb 12, 2021

Hey @danfairs . Thanks for reporting the issue. We have a fix in progress to address the fact that we report CUDA version 11.0 on WSL.

In the meantime you could use the NVIDIA_DISABLE_REQUIRE environment to skip the CUDA version check.

docker run --rm --gpus=all --env NVIDIA_DISABLE_REQUIRE=1 -it nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04 nvidia-smi

For reference: here is the merge request extending WSL support.

@archee8
Copy link

archee8 commented May 4, 2021

Hey @danfairs . Thanks for reporting the issue. We have a fix in progress to address the fact that we report CUDA version 11.0 on WSL.

In the meantime you could use the NVIDIA_DISABLE_REQUIRE environment to skip the CUDA version check.

docker run --rm --gpus=all --env NVIDIA_DISABLE_REQUIRE=1 -it nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04 nvidia-smi

For reference: here is the merge request extending WSL support.

Hi. I have some problem with nvidia-container-cli. I run this

archee8@DESKTOP-HR2MA0D:~$ docker run --rm --gpus=all --env NVIDIA_DISABLE_REQUIRE=1 -it nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04 nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.

@elezar
Copy link
Member

elezar commented May 5, 2021

@archee8 which version of the NVIDIA container toolkit is this?

The version 1.4.0 of libnvidia-container should address this issue.

@archee8
Copy link

archee8 commented May 5, 2021

@ archee8 какая это версия инструментария контейнера NVIDIA?

Версия 1.4.0 libnvidia-containerдолжна решить эту проблему.

archee8@DESKTOP-HR2MA0D:~$ sudo apt-cache policy libnvidia-container-tools
libnvidia-container-tools:
  Installed: 1.4.0-1

@klueska
Copy link
Contributor

klueska commented May 5, 2021

@archee8 Your issue appears to be related to this:
NVIDIA/nvidia-docker#1496 (comment)

@Keiku
Copy link

Keiku commented Mar 24, 2022

The following command works, but it doesn't work with docker-compose. Does anyone know the cause?

docker run --rm --gpus=all --env NVIDIA_DISABLE_REQUIRE=1 -it nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04 nvidia-smi

I have the following environment. The reason for Ubuntu 16.04 is that it cannot be upgraded due to company security issues.

⋊> ~ lsb_release -a                                                                                                                                                13:29:20
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.7 LTS
Release:        16.04
Codename:       xenial
⋊> ~ docker --version                                                                                                                                              13:29:20
Docker version 20.10.7, build f0df350
⋊> ~ docker-compose --version                                                                                                                                      13:29:38
docker-compose version 1.29.2, build unknown
⋊> ~ nvidia-container-cli info                                                                                                                                     13:30:27
NVRM version:   440.118.02
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          TITAN X (Pascal)
Brand:          GeForce
GPU UUID:       GPU-fcae2b3c-b6c0-c0c6-1eef-4f25809d16f9
Bus Location:   00000000:01:00.0
Architecture:   6.1
⋊> ~ 

@andresgalaviz
Copy link

This issue is still present when following the current instructions on the official nvidia documentation for this: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#ch05-running-containers

@psychofisch
Copy link

While trying to run https://github.com/borisdayma/dalle-mini in WSL2 I encountered the same error message as @danfairs

root@DESKTOP-DEADBEEF:/mnt/g/github/dalle-mini# docker run --rm --name dallemini --gpus all -it -p 8888:88
88 -v "${PWD}":/workspace dalle-mini:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.6, please update your driver to a newer version, or use an earlier cuda container: unknown.

When I check my currently installed version with nvidia-smi I see that I have version 11.7 installed (the error meesage above requires 11.6):

root@DESKTOP-DEADBEEF:/mnt/g/github/dalle-mini# nvidia-smi
Mon Jun 13 23:34:16 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 516.01       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:26:00.0  On |                  N/A |
|  0%   38C    P8     8W / 175W |   1082MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I'm kinda stuck right now. Any advice?

@elezar
Copy link
Member

elezar commented Jun 14, 2022

@psychofisch as a workaround please start the container with NVIDIA_DISABLE_REQUIRE=true:

docker run --rm --name dallemini --gpus all -it -p 8888:8888 -v "${PWD}":/workspace -e NVIDIA_DISABLE_REQUIRE=true dalle-mini:latest

@TheFrator
Copy link

@psychofisch as a workaround please start the container with NVIDIA_DISABLE_REQUIRE=true:

docker run --rm --name dallemini --gpus all -it -p 8888:8888 -v "${PWD}":/workspace -e NVIDIA_DISABLE_REQUIRE=true dalle-mini:latest

I ran into this issue and this work around worked. Thank you @elezar

@mirekphd
Copy link

mirekphd commented Jan 7, 2023

Sorry, but I'm not at all convinced NVIDIA_DISABLE_REQUIRE should be used. The container will start, true, but ML algos will fail to train the model later on (if they are properly directed to use the GPU, without automatic failover to the CPU). CUDA versions on the host and in the container must be in sync in my experience, just like glibc versions. IOW, CUDA Minor Versions Compatibility (as described in the docs here) is a bit of wishful thinking...

The most precise error message resulting from the use of NVIDIA_DISABLE_REQUIRE is given by Catboost:

CatBoostError: catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 803: system has unsupported display driver / cuda driver combination

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants