-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not error on NCCL library load failure during CUDA driver creation #13403
Conversation
If a library symbol is not found or the NCCL library version does not match the requirements do not fail during CUDA driver creation.
I started a discussion here #13400 |
@@ -153,7 +153,7 @@ iree_status_t iree_hal_cuda_nccl_dynamic_symbols_initialize( | |||
if (nccl_version < NCCL_VERSION(NCCL_MAJOR, NCCL_MINOR, 0) || | |||
major != NCCL_MAJOR) { | |||
status = iree_make_status( | |||
IREE_STATUS_INTERNAL, | |||
IREE_STATUS_NOT_FOUND, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's return UNAVAILABLE here instead so that we have a consistent status code to look for
I think this needs more of a re-organization. See my notes here: #13422 |
I agree. I thought about making the nccl loading more compatible with the old version, which does not have @sogartar Do you want to make a stab? Let me know. Note that 2.17 is not backward compatible to 2.15 because it extended the config structure, and also has a bug that makes the program hang. @trevor-m reported the issue to the NCCL team. |
I realized the config structure change is safe since nccl checks the version field before trying to read those new fields. |
This is done at #13422. Closing. |
If a library symbol is not found or the NCCL library version does not match the requirements do not fail during CUDA driver creation.