[ci] Update CUDA versions for CI #6539

StrikerRUS · 2024-07-13T16:58:38Z

StrikerRUS · 2024-07-13T21:28:07Z

@shiyu1994 Hi! May I kindly ask you to update NVIDIA drivers at the host machine where CUDA CI jobs are executed? It will allow us to run tests against the most recent CUDA version 12.5. The current installed driver is 525.147.05 which is insufficient to run CUDA 12.5 containers:

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown

Refer to #6520 for the context of this PR.

Some related external links:

jameslamb · 2024-07-13T22:38:05Z

Based on https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers, I think we want R535 (the latest long-term support release).

StrikerRUS · 2024-07-14T12:29:46Z

I think we want R535 (the latest long-term support release).

Agree.

* Based on my personal experience, R530 driver doesn't support CUDA 12.5.

StrikerRUS · 2024-07-24T13:42:23Z

Gently ping @shiyu1994 for fresh NVIDIA driver installation.

StrikerRUS · 2024-08-06T22:19:38Z

Can confirm that R535 is enough to run containers with CUDA 12.5.
Host:

Tue Aug  6 22:16:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   27C    P8              27W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

docker run --rm --gpus all nvcr.io/nvidia/cuda:12.5.1-cudnn-devel-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.5.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Aug  6 22:14:16 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   28C    P8              28W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

jameslamb · 2024-08-07T03:54:36Z

I'll try to contact @shiyu1994 in the maintainer Slack.

StrikerRUS · 2024-09-05T13:26:03Z

@jameslamb Did you succeed? 👼

jameslamb · 2024-09-06T01:06:12Z

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

StrikerRUS · 2024-09-12T15:54:07Z

Just learned that CUDA Forward Compatibility feature is available only for server cards (e.g. Tesla A100) and not for domestic ones (e.g. RTX 4090).

Forward Compatibility is applicable only for systems with NVIDIA Data Center GPUs or select NGC Server Ready SKUs of RTX cards.

For example, on domestic card RTX 4090 with R535 driver you'll get cuda runtime error (804) : forward compatibility was attempted on non supported HW while trying to run Docker image with CUDA 12.4.

shiyu1994 · 2024-09-19T02:00:47Z

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

Sorry I cannot login to my slack account, since it is registered with a @qq.com email. I will update the CUDA version of the CI agent.

jameslamb · 2024-09-19T18:38:10Z

Thank you!!

StrikerRUS · 2024-09-24T18:38:12Z

@shiyu1994

I will update the CUDA version of the CI agent.

Thanks a lot!
Please ping me when agent will be ready.

shiyu1994 · 2024-10-01T07:22:07Z

@StrikerRUS Done with upgrading the Nvidia driver to 535. Please ping me if there's anything that I need to do.

StrikerRUS · 2024-10-01T15:33:26Z

@shiyu1994 Thank you very much!
Right now all CUDA jobs are failing with

Checking docker version
  /usr/bin/docker version --format '{{.Server.APIVersion}}'
  permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v[1](https://github.com/microsoft/LightGBM/actions/runs/11128263178/job/30922597065?pr=6539#step:2:1).24/version": dial unix /var/run/docker.sock: connect: permission denied
  '
  Error: Exit code 1 returned from process: file name '/usr/bin/docker', arguments 'version --format '{{.Server.APIVersion}}''.

error at the Initialize containers phase.

I'll try to trigger set up docker job manually.

StrikerRUS · 2024-10-01T16:02:26Z

I'll try to trigger set up docker job manually.

It helped! 🎉

StrikerRUS · 2024-10-01T17:19:51Z

I think this PR is ready for review.

jameslamb

Excellent! I checked the logs and everything looks good to me. I support merging this.

I'm glad the option to restart docker manually was helpful!

One thing that surprised me in the logs... the wheels are only 11.1 MB uncompressed?

checking './dist/lightgbm-4.5.0.99-py3-none-linux_x86_64.whl'
----- package inspection summary -----
file size
  * compressed size: 11.1M
  * uncompressed size: 23.0M
  * compression space saving: 51.8%

(build link)

If that's true, then maybe we should consider trying to distribute compile in CUDA support for the wheels on PyPI. We could do something like XGBoost does, just supporting 1 major version of CUDA at a time on PyPI (ref: dmlc/xgboost#10807).

Anyway, just thinking out loud... it absolutely should not block this PR, just maybe something to think about for the future.

StrikerRUS · 2024-10-03T15:58:29Z

If that's true, then maybe we should consider trying to distribute compile in CUDA support for the wheels on PyPI.

I like your idea about publishing CUDA version at PyPI! But maybe we should wait for #6138 where we'll get NCCL as a new dependency and there is the following diff in that PR so far:

- --max-allowed-size-uncompressed '100M' \
+ --max-allowed-size-uncompressed '500M' \

jameslamb · 2024-10-03T16:04:11Z

Yes good point!

We could also maybe explore what xgboost does, relying on the NCCL wheels that NVIDIA publishes:

StrikerRUS · 2024-10-03T23:41:23Z

Hmmm...
Looks like the latest conda (mamba) is broken:

Traceback (most recent call last):
  File "/opt/miniforge//bin/mamba", line 7, in <module>
    from mamba.mamba import main
  File "/opt/miniforge/lib/python3.10/site-packages/mamba/mamba.py", line 18, in <module>
    from conda.cli.main import generate_parser, init_loggers
ImportError: cannot import name 'generate_parser' from 'conda.cli.main' (/opt/miniforge/lib/python3.10/site-packages/conda/cli/main.py)

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17108&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf&l=40

Same problem in another project 2 days ago: All-Hands-AI/OpenHands#4153.

jameslamb · 2024-10-05T06:24:32Z

Still not certain what the root cause is, but @StrikerRUS I found that switching from mamba to conda can fix it: #6663

jameslamb · 2024-10-06T07:02:58Z

Sorry, this was accidentally closed because of language I used in the description of #6663. I've reopened it and updated it to latest master.

Update CUDA versions for CI

ef8e09d

StrikerRUS added the maintenance label Jul 13, 2024

use latest CUDA

98202b9

Merge branch 'master' into ci/cuda

fc13a68

StrikerRUS changed the title ~~Update CUDA versions for CI~~ [ci] Update CUDA versions for CI Oct 1, 2024

StrikerRUS added the awaiting review label Oct 1, 2024

StrikerRUS marked this pull request as ready for review October 1, 2024 17:19

StrikerRUS requested review from guolinke, jameslamb, shiyu1994, jmoralez and borchero as code owners October 1, 2024 17:19

jameslamb approved these changes Oct 1, 2024

View reviewed changes

Merge branch 'master' into ci/cuda

84e2963

StrikerRUS removed the awaiting review label Oct 3, 2024

This was referenced Oct 5, 2024

[ci] replace uses of 'mamba' with 'conda', use Python 12 for test-python-latest-job #6663

Merged

BUG: Some nightly dev wheels failed to upload scipy/scipy#21623

Open

jameslamb closed this in #6663 Oct 6, 2024

jameslamb reopened this Oct 6, 2024

Merge branch 'master' into ci/cuda

35dee0b

StrikerRUS merged commit 718da7d into master Oct 6, 2024
45 checks passed

StrikerRUS deleted the ci/cuda branch October 6, 2024 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] Update CUDA versions for CI #6539

[ci] Update CUDA versions for CI #6539

StrikerRUS commented Jul 13, 2024

StrikerRUS commented Jul 13, 2024

jameslamb commented Jul 13, 2024

StrikerRUS commented Jul 14, 2024

StrikerRUS commented Jul 24, 2024

StrikerRUS commented Aug 6, 2024

jameslamb commented Aug 7, 2024

StrikerRUS commented Sep 5, 2024

jameslamb commented Sep 6, 2024

StrikerRUS commented Sep 12, 2024

shiyu1994 commented Sep 19, 2024

jameslamb commented Sep 19, 2024

StrikerRUS commented Sep 24, 2024

shiyu1994 commented Oct 1, 2024

StrikerRUS commented Oct 1, 2024

StrikerRUS commented Oct 1, 2024

StrikerRUS commented Oct 1, 2024

jameslamb left a comment

StrikerRUS commented Oct 3, 2024

jameslamb commented Oct 3, 2024

StrikerRUS commented Oct 3, 2024

jameslamb commented Oct 5, 2024

jameslamb commented Oct 6, 2024

[ci] Update CUDA versions for CI #6539

[ci] Update CUDA versions for CI #6539

Conversation

StrikerRUS commented Jul 13, 2024

StrikerRUS commented Jul 13, 2024

jameslamb commented Jul 13, 2024

StrikerRUS commented Jul 14, 2024

StrikerRUS commented Jul 24, 2024

StrikerRUS commented Aug 6, 2024

jameslamb commented Aug 7, 2024

StrikerRUS commented Sep 5, 2024

jameslamb commented Sep 6, 2024

StrikerRUS commented Sep 12, 2024

shiyu1994 commented Sep 19, 2024

jameslamb commented Sep 19, 2024

StrikerRUS commented Sep 24, 2024

shiyu1994 commented Oct 1, 2024

StrikerRUS commented Oct 1, 2024

StrikerRUS commented Oct 1, 2024

StrikerRUS commented Oct 1, 2024

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Oct 3, 2024

jameslamb commented Oct 3, 2024

StrikerRUS commented Oct 3, 2024

jameslamb commented Oct 5, 2024

jameslamb commented Oct 6, 2024