[Core][Distributed] fix _is_full_nvlink detection #4233

youkaichao · 2024-04-21T06:03:33Z

After carefully reading the documentation of nvml at https://developer.nvidia.com/nvidia-management-library-nvml , I realized that pynvml works on physical device id, rather than relative index inside CUDA_VISIBLE_DEVICES.

For example, prior to this PR, _is_full_nvlink(0, 4) will query nvlink information on device 0-1, 0-2, 0-3, no matter what is the value of CUDA_VISIBLE_DEVICES.

Thus, in the case of custom allreduce in vllm, we need to use real physical device ids for _is_full_nvlink.

FYI: @hanzhi713 @esmeetu

youkaichao · 2024-04-21T06:04:51Z

~~@hanzhi713 According to the documentation, I feel like as long as two GPUs have nvlink connection, they can do p2p access. Do you think it is possible to remove _can_p2p test?~~

okay, forget about it. i find these levels are independent:

NVML_P2P_CAPS_INDEX_READ = 0
NVML_P2P_CAPS_INDEX_WRITE = 1
NVML_P2P_CAPS_INDEX_NVLINK = 2
NVML_P2P_CAPS_INDEX_ATOMICS = 3
NVML_P2P_CAPS_INDEX_PROP = 4
NVML_P2P_CAPS_INDEX_LOOPBACK = 5
NVML_P2P_CAPS_INDEX_UNKNOWN = 6

it is possible, although rare, that 2 gpus are connected via nvlink, but they cannot do p2p access due to various reasons (e.g. cuda runtime does not support it).

hanzhi713 · 2024-04-22T05:24:35Z

@youkaichao good catch.

Also I don't think there's a situation where two GPUs are connected by nvlink but they can't p2p. For safey though, we can always check p2p via _can_p2p anyway in case some driver version is broken again. Is there a particular reason that you want to remove _can_p2p test for nvlink GPUs?

youkaichao · 2024-04-22T05:53:21Z

Is there a particular reason that you want to remove _can_p2p test for nvlink GPUs?

No, let's just keep it.

fix _is_full_nvlink

fd99079

add comments

30b4b41

esmeetu approved these changes Apr 22, 2024

View reviewed changes

youkaichao merged commit 747b1a7 into vllm-project:main Apr 22, 2024
47 checks passed

youkaichao deleted the fix_nvlink branch April 22, 2024 06:04

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024

[Core][Distributed] fix _is_full_nvlink detection (vllm-project#4233)

8192e30

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Core][Distributed] fix _is_full_nvlink detection (vllm-project#4233)

3ca56cf

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

[Core][Distributed] fix _is_full_nvlink detection (vllm-project#4233)

b5ece4a

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Core][Distributed] fix _is_full_nvlink detection (vllm-project#4233)

6b1cfd6

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core][Distributed] fix _is_full_nvlink detection (vllm-project#4233)

26900f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] fix _is_full_nvlink detection #4233

[Core][Distributed] fix _is_full_nvlink detection #4233

youkaichao commented Apr 21, 2024

youkaichao commented Apr 21, 2024 •

edited

Loading

hanzhi713 commented Apr 22, 2024 •

edited

Loading

youkaichao commented Apr 22, 2024

[Core][Distributed] fix _is_full_nvlink detection #4233

[Core][Distributed] fix _is_full_nvlink detection #4233

Conversation

youkaichao commented Apr 21, 2024

youkaichao commented Apr 21, 2024 • edited Loading

hanzhi713 commented Apr 22, 2024 • edited Loading

youkaichao commented Apr 22, 2024

youkaichao commented Apr 21, 2024 •

edited

Loading

hanzhi713 commented Apr 22, 2024 •

edited

Loading