-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] fix _is_full_nvlink detection #4233
Conversation
okay, forget about it. i find these levels are independent: NVML_P2P_CAPS_INDEX_READ = 0
NVML_P2P_CAPS_INDEX_WRITE = 1
NVML_P2P_CAPS_INDEX_NVLINK = 2
NVML_P2P_CAPS_INDEX_ATOMICS = 3
NVML_P2P_CAPS_INDEX_PROP = 4
NVML_P2P_CAPS_INDEX_LOOPBACK = 5
NVML_P2P_CAPS_INDEX_UNKNOWN = 6 it is possible, although rare, that 2 gpus are connected via nvlink, but they cannot do p2p access due to various reasons (e.g. cuda runtime does not support it). |
@youkaichao good catch. Also I don't think there's a situation where two GPUs are connected by nvlink but they can't p2p. For safey though, we can always check p2p via |
No, let's just keep it. |
After carefully reading the documentation of nvml at https://developer.nvidia.com/nvidia-management-library-nvml , I realized that
pynvml
works on physical device id, rather than relative index insideCUDA_VISIBLE_DEVICES
.For example, prior to this PR,
_is_full_nvlink(0, 4)
will query nvlink information on device0-1, 0-2, 0-3
, no matter what is the value ofCUDA_VISIBLE_DEVICES
.Thus, in the case of custom allreduce in vllm, we need to use real physical device ids for
_is_full_nvlink
.FYI: @hanzhi713 @esmeetu