Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Update CUDA versions for CI #6539

Merged
merged 5 commits into from
Oct 6, 2024
Merged

[ci] Update CUDA versions for CI #6539

merged 5 commits into from
Oct 6, 2024

Conversation

StrikerRUS
Copy link
Collaborator

Fixed #6520.

@StrikerRUS
Copy link
Collaborator Author

@shiyu1994 Hi! May I kindly ask you to update NVIDIA drivers at the host machine where CUDA CI jobs are executed? It will allow us to run tests against the most recent CUDA version 12.5. The current installed driver is 525.147.05 which is insufficient to run CUDA 12.5 containers:

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown

Refer to #6520 for the context of this PR.

Some related external links:

@jameslamb
Copy link
Collaborator

Based on https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers, I think we want R535 (the latest long-term support release).

@StrikerRUS
Copy link
Collaborator Author

I think we want R535 (the latest long-term support release).

Agree.

* Based on my personal experience, R530 driver doesn't support CUDA 12.5.

@StrikerRUS
Copy link
Collaborator Author

Gently ping @shiyu1994 for fresh NVIDIA driver installation.

@StrikerRUS
Copy link
Collaborator Author

Can confirm that R535 is enough to run containers with CUDA 12.5.
Host:

Tue Aug  6 22:16:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   27C    P8              27W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.5.1-cudnn-devel-ubuntu20.04 nvidia-smi
==========
== CUDA ==
==========

CUDA Version 12.5.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Aug  6 22:14:16 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   28C    P8              28W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@jameslamb
Copy link
Collaborator

I'll try to contact @shiyu1994 in the maintainer Slack.

@StrikerRUS
Copy link
Collaborator Author

@jameslamb Did you succeed? 👼

@jameslamb
Copy link
Collaborator

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

@StrikerRUS
Copy link
Collaborator Author

Just learned that CUDA Forward Compatibility feature is available only for server cards (e.g. Tesla A100) and not for domestic ones (e.g. RTX 4090).

Forward Compatibility is applicable only for systems with NVIDIA Data Center GPUs or select NGC Server Ready SKUs of RTX cards.

For example, on domestic card RTX 4090 with R535 driver you'll get cuda runtime error (804) : forward compatibility was attempted on non supported HW while trying to run Docker image with CUDA 12.4.

@shiyu1994
Copy link
Collaborator

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

Sorry I cannot login to my slack account, since it is registered with a @qq.com email. I will update the CUDA version of the CI agent.

@jameslamb
Copy link
Collaborator

Thank you!!

@StrikerRUS
Copy link
Collaborator Author

@shiyu1994

I will update the CUDA version of the CI agent.

Thanks a lot!
Please ping me when agent will be ready.

@shiyu1994
Copy link
Collaborator

@StrikerRUS Done with upgrading the Nvidia driver to 535. Please ping me if there's anything that I need to do.
截屏2024-10-01 15 21 45

@StrikerRUS
Copy link
Collaborator Author

@shiyu1994 Thank you very much!
Right now all CUDA jobs are failing with

Checking docker version
  /usr/bin/docker version --format '{{.Server.APIVersion}}'
  permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v[1](https://github.com/microsoft/LightGBM/actions/runs/11128263178/job/30922597065?pr=6539#step:2:1).24/version": dial unix /var/run/docker.sock: connect: permission denied
  '
  Error: Exit code 1 returned from process: file name '/usr/bin/docker', arguments 'version --format '{{.Server.APIVersion}}''.

error at the Initialize containers phase.

I'll try to trigger set up docker job manually.

@StrikerRUS
Copy link
Collaborator Author

I'll try to trigger set up docker job manually.

It helped! 🎉

@StrikerRUS StrikerRUS changed the title Update CUDA versions for CI [ci] Update CUDA versions for CI Oct 1, 2024
@StrikerRUS StrikerRUS marked this pull request as ready for review October 1, 2024 17:19
@StrikerRUS
Copy link
Collaborator Author

I think this PR is ready for review.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I checked the logs and everything looks good to me. I support merging this.

I'm glad the option to restart docker manually was helpful!


One thing that surprised me in the logs... the wheels are only 11.1 MB uncompressed?

checking './dist/lightgbm-4.5.0.99-py3-none-linux_x86_64.whl'
----- package inspection summary -----
file size
  * compressed size: 11.1M
  * uncompressed size: 23.0M
  * compression space saving: 51.8%

(build link)

If that's true, then maybe we should consider trying to distribute compile in CUDA support for the wheels on PyPI. We could do something like XGBoost does, just supporting 1 major version of CUDA at a time on PyPI (ref: dmlc/xgboost#10807).

Anyway, just thinking out loud... it absolutely should not block this PR, just maybe something to think about for the future.

@StrikerRUS
Copy link
Collaborator Author

If that's true, then maybe we should consider trying to distribute compile in CUDA support for the wheels on PyPI.

I like your idea about publishing CUDA version at PyPI! But maybe we should wait for #6138 where we'll get NCCL as a new dependency and there is the following diff in that PR so far:

- --max-allowed-size-uncompressed '100M' \
+ --max-allowed-size-uncompressed '500M' \

@jameslamb
Copy link
Collaborator

Yes good point!

We could also maybe explore what xgboost does, relying on the NCCL wheels that NVIDIA publishes:

@StrikerRUS
Copy link
Collaborator Author

Hmmm...
Looks like the latest conda (mamba) is broken:

Traceback (most recent call last):
  File "/opt/miniforge//bin/mamba", line 7, in <module>
    from mamba.mamba import main
  File "/opt/miniforge/lib/python3.10/site-packages/mamba/mamba.py", line 18, in <module>
    from conda.cli.main import generate_parser, init_loggers
ImportError: cannot import name 'generate_parser' from 'conda.cli.main' (/opt/miniforge/lib/python3.10/site-packages/conda/cli/main.py)

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17108&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf&l=40

Same problem in another project 2 days ago: All-Hands-AI/OpenHands#4153.

@jameslamb
Copy link
Collaborator

Still not certain what the root cause is, but @StrikerRUS I found that switching from mamba to conda can fix it: #6663

@jameslamb jameslamb reopened this Oct 6, 2024
@jameslamb
Copy link
Collaborator

Sorry, this was accidentally closed because of language I used in the description of #6663. I've reopened it and updated it to latest master.

@StrikerRUS StrikerRUS merged commit 718da7d into master Oct 6, 2024
45 checks passed
@StrikerRUS StrikerRUS deleted the ci/cuda branch October 6, 2024 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Sync supported CUDA versions with a new support policy for CUDA Container Images
3 participants