Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not error on NCCL library load failure during CUDA driver creation #13403

Closed

Conversation

sogartar
Copy link
Contributor

@sogartar sogartar commented May 4, 2023

If a library symbol is not found or the NCCL library version does not match the requirements do not fail during CUDA driver creation.

If a library symbol is not found or the NCCL library version does not
match the requirements do not fail during CUDA driver creation.
@sogartar sogartar requested a review from ThomasRaoux as a code owner May 4, 2023 01:38
@sogartar
Copy link
Contributor Author

sogartar commented May 4, 2023

I started a discussion here #13400

@powderluv powderluv requested review from benvanik and okkwon May 4, 2023 01:42
@@ -153,7 +153,7 @@ iree_status_t iree_hal_cuda_nccl_dynamic_symbols_initialize(
if (nccl_version < NCCL_VERSION(NCCL_MAJOR, NCCL_MINOR, 0) ||
major != NCCL_MAJOR) {
status = iree_make_status(
IREE_STATUS_INTERNAL,
IREE_STATUS_NOT_FOUND,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's return UNAVAILABLE here instead so that we have a consistent status code to look for

@stellaraccident
Copy link
Collaborator

I think this needs more of a re-organization. See my notes here: #13422

@okkwon
Copy link
Member

okkwon commented May 5, 2023

I agree. I thought about making the nccl loading more compatible with the old version, which does not have ncclCommInitRankConfig, but I remembered that there were some perf improvement by tweaking the config. So let's keep the version as is.

@sogartar Do you want to make a stab? Let me know.

Note that 2.17 is not backward compatible to 2.15 because it extended the config structure, and also has a bug that makes the program hang. @trevor-m reported the issue to the NCCL team.

@trevor-m
Copy link
Contributor

trevor-m commented May 5, 2023

I agree. I thought about making the nccl loading more compatible with the old version, which does not have ncclCommInitRankConfig, but I remembered that there were some perf improvement by tweaking the config. So let's keep the version as is.

@sogartar Do you want to make a stab? Let me know.

Note that 2.17 is not backward compatible to 2.15 because it extended the config structure, and also has a bug that makes the program hang. @trevor-m reported the issue to the NCCL team.

I realized the config structure change is safe since nccl checks the version field before trying to read those new fields.
The hang issue is still remaining.

@okkwon
Copy link
Member

okkwon commented May 19, 2023

This is done at #13422. Closing.

@okkwon okkwon closed this May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants