Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick question: is llama.cpp supporting model parallelism? #4014

Closed
jingyao-zhang opened this issue Nov 10, 2023 · 6 comments
Closed

Quick question: is llama.cpp supporting model parallelism? #4014

jingyao-zhang opened this issue Nov 10, 2023 · 6 comments

Comments

@jingyao-zhang
Copy link

Hi all,

A quick question about current llama.cpp project.

Is llama.cpp supporting model parallelism?

I have two V100 gpus and want to specify how many layers run on cuda:0 and rest of layers run on cuda:1.
Does current llama.cpp support this feature?

Thanks in advance!

@KerfuffleV2
Copy link
Collaborator

Check out the -ts or --tensor-split option. Layers have to run sequentially though, so it's not exactly "parallel".

@cmp-nct
Copy link
Contributor

cmp-nct commented Nov 10, 2023

Check out the -ts or --tensor-split option. Layers have to run sequentially though, so it's not exactly "parallel".

It should be parallel not sequential when using tensor split, though I've not checked the latest code how synchronization is done now.
Tensor split splits the tensor along two cards at N percent, so they can be processed in parallel.

The downside is that there are quite some slowdowns with llama.cpp as soon as you use two GPUs, so currently it is only useful to load large models. When a model fits into the VRAM of one card, you should use CUDA_VISIBLE_DEVICES to restrict the use of the other GPU.

In my case using two GPUs comes with a almost 10x slowdown in speed.
2500 tokens/sec on the 4090 is slowed down to 300 tokens/sec when adding a second 3090
generation speed drops from 80/sec down to 20/sec

@AutonomicPerfectionist
Copy link
Contributor

The MPI build might support model parallelism with GPUs, but it's currently broken (see #3334). I don't have any Nvidia GPUs to test with and my AMD setup is broken, but I can't think of anything off the top of my head that would preclude further offloading layers to the GPU

@jingyao-zhang
Copy link
Author

The MPI build might support model parallelism with GPUs, but it's currently broken (see #3334). I don't have any Nvidia GPUs to test with and my AMD setup is broken, but I can't think of anything off the top of my head that would preclude further offloading layers to the GPU

Is MPI for multi-node deployment?
I am thinking of using single-node with multi-gpu.
Would MPI work for this scenario?

@jingyao-zhang
Copy link
Author

Check out the -ts or --tensor-split option. Layers have to run sequentially though, so it's not exactly "parallel".

It should be parallel not sequential when using tensor split, though I've not checked the latest code how synchronization is done now. Tensor split splits the tensor along two cards at N percent, so they can be processed in parallel.

The downside is that there are quite some slowdowns with llama.cpp as soon as you use two GPUs, so currently it is only useful to load large models. When a model fits into the VRAM of one card, you should use CUDA_VISIBLE_DEVICES to restrict the use of the other GPU.

In my case using two GPUs comes with a almost 10x slowdown in speed. 2500 tokens/sec on the 4090 is slowed down to 300 tokens/sec when adding a second 3090 generation speed drops from 80/sec down to 20/sec

Same here. There might be communication overhead between multiple gpus.

@jingyao-zhang
Copy link
Author

It seems -ts can partition model into multiple gpus. If there is no other comments, I will close this issue for the space of discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants