Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing dynamic linking of NCCL #4445

Closed
jakirkham opened this issue May 7, 2019 · 12 comments · Fixed by #4447
Closed

Allowing dynamic linking of NCCL #4445

jakirkham opened this issue May 7, 2019 · 12 comments · Fixed by #4447

Comments

@jakirkham
Copy link
Contributor

jakirkham commented May 7, 2019

Appears that NCCL's static library is used when building xgboost. Am curious if it would be reasonable to allow users to dynamically link instead or as an alternative to static linking.

@hcho3
Copy link
Collaborator

hcho3 commented May 7, 2019

@jakirkham Static linking is used here so that multi-GPU training works "out of the box". Many users will install XGBoost by running pip install xgboost, without necessarily installing NCCL first. Is there a scenario where dynamic linking would be beneficial?

@trivialfis
Copy link
Member

@hcho3 Actually I would like to do that. We stay with static by default, but add an option for dynamic linking. To me, the reason is the same with dmlc-core, see #4360 .

@hcho3
Copy link
Collaborator

hcho3 commented May 7, 2019

@trivialfis I see that dynamic linking of NCCL may be desirable when XGBoost is used as a dependency to another application (via CMake exported target). Right now, if an application using NCCL imports XGBoost, the NCCL library will be duplicated and cause code bloat.

@trivialfis
Copy link
Member

trivialfis commented May 7, 2019

Code bloat is fine, it's nothing close to DL library. The problem is it leads to undefined behaviour when we have functions with same signature but different implementation. The linker can fail to detect that, as illustrated in above issue and related reproducible code.

@hcho3
Copy link
Collaborator

hcho3 commented May 7, 2019

@trivialfis Meaning two versions of NCCL co-existing?

@trivialfis
Copy link
Member

@hcho3
Copy link
Collaborator

hcho3 commented May 7, 2019

@trivialfis I don't think that link is exactly applicable to the case of NCCL, since we are not including the NCCL source code. We are importing the static library pre-compiled by NVIDIA.

@trivialfis
Copy link
Member

trivialfis commented May 7, 2019

The example above imports statically compiled version of lib.h in middle user, lib.h can be nccl, middleuser.cc can be XGBoost. And finaluser.cc can be users of XGBoost. It's not completely the same situation here since nccl is more than just a header. But the general idea is that having duplicated library is dangerous.

@hcho3
Copy link
Collaborator

hcho3 commented May 7, 2019

Got it. So to prevent undefined behavior, we want to de-duplicate dependencies, including dmlc-core and NCCL.

@trivialfis
Copy link
Member

@hcho3

@trivialfis
Copy link
Member

Yes. That would be my reason.

@jakirkham
Copy link
Contributor Author

Another use case for shared libraries is in package management. Namely we are able to create smaller packages and better track relationships between them by building shared libraries and creating explicit dependencies between upstream and downstream packages that need these shared libraries.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants