Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Distributed] fix _is_full_nvlink detection #4233

Merged
merged 2 commits into from
Apr 22, 2024

Conversation

youkaichao
Copy link
Member

After carefully reading the documentation of nvml at https://developer.nvidia.com/nvidia-management-library-nvml , I realized that pynvml works on physical device id, rather than relative index inside CUDA_VISIBLE_DEVICES.

For example, prior to this PR, _is_full_nvlink(0, 4) will query nvlink information on device 0-1, 0-2, 0-3, no matter what is the value of CUDA_VISIBLE_DEVICES.

Thus, in the case of custom allreduce in vllm, we need to use real physical device ids for _is_full_nvlink.

FYI: @hanzhi713 @esmeetu

@youkaichao
Copy link
Member Author

youkaichao commented Apr 21, 2024

@hanzhi713 According to the documentation, I feel like as long as two GPUs have nvlink connection, they can do p2p access. Do you think it is possible to remove _can_p2p test?

okay, forget about it. i find these levels are independent:

NVML_P2P_CAPS_INDEX_READ = 0
NVML_P2P_CAPS_INDEX_WRITE = 1
NVML_P2P_CAPS_INDEX_NVLINK = 2
NVML_P2P_CAPS_INDEX_ATOMICS = 3
NVML_P2P_CAPS_INDEX_PROP = 4
NVML_P2P_CAPS_INDEX_LOOPBACK = 5
NVML_P2P_CAPS_INDEX_UNKNOWN = 6

it is possible, although rare, that 2 gpus are connected via nvlink, but they cannot do p2p access due to various reasons (e.g. cuda runtime does not support it).

@hanzhi713
Copy link
Contributor

hanzhi713 commented Apr 22, 2024

@youkaichao good catch.

Also I don't think there's a situation where two GPUs are connected by nvlink but they can't p2p. For safey though, we can always check p2p via _can_p2p anyway in case some driver version is broken again. Is there a particular reason that you want to remove _can_p2p test for nvlink GPUs?

@youkaichao
Copy link
Member Author

Is there a particular reason that you want to remove _can_p2p test for nvlink GPUs?

No, let's just keep it.

@youkaichao youkaichao merged commit 747b1a7 into vllm-project:main Apr 22, 2024
47 checks passed
@youkaichao youkaichao deleted the fix_nvlink branch April 22, 2024 06:04
xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024
alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants