-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick question: is llama.cpp supporting model parallelism? #4014
Comments
Check out the |
It should be parallel not sequential when using tensor split, though I've not checked the latest code how synchronization is done now. The downside is that there are quite some slowdowns with llama.cpp as soon as you use two GPUs, so currently it is only useful to load large models. When a model fits into the VRAM of one card, you should use CUDA_VISIBLE_DEVICES to restrict the use of the other GPU. In my case using two GPUs comes with a almost 10x slowdown in speed. |
The MPI build might support model parallelism with GPUs, but it's currently broken (see #3334). I don't have any Nvidia GPUs to test with and my AMD setup is broken, but I can't think of anything off the top of my head that would preclude further offloading layers to the GPU |
Is MPI for multi-node deployment? |
Same here. There might be communication overhead between multiple gpus. |
It seems |
Hi all,
A quick question about current
llama.cpp
project.Is llama.cpp supporting model parallelism?
I have two V100 gpus and want to specify how many layers run on
cuda:0
and rest of layers run oncuda:1
.Does current
llama.cpp
support this feature?Thanks in advance!
The text was updated successfully, but these errors were encountered: