Quick question: is llama.cpp supporting model parallelism? #4014

jingyao-zhang · 2023-11-10T00:59:07Z

Hi all,

A quick question about current llama.cpp project.

Is llama.cpp supporting model parallelism?

I have two V100 gpus and want to specify how many layers run on cuda:0 and rest of layers run on cuda:1.
Does current llama.cpp support this feature?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

KerfuffleV2 · 2023-11-10T01:19:13Z

Check out the -ts or --tensor-split option. Layers have to run sequentially though, so it's not exactly "parallel".

cmp-nct · 2023-11-10T03:40:43Z

Check out the -ts or --tensor-split option. Layers have to run sequentially though, so it's not exactly "parallel".

It should be parallel not sequential when using tensor split, though I've not checked the latest code how synchronization is done now.
Tensor split splits the tensor along two cards at N percent, so they can be processed in parallel.

The downside is that there are quite some slowdowns with llama.cpp as soon as you use two GPUs, so currently it is only useful to load large models. When a model fits into the VRAM of one card, you should use CUDA_VISIBLE_DEVICES to restrict the use of the other GPU.

In my case using two GPUs comes with a almost 10x slowdown in speed.
2500 tokens/sec on the 4090 is slowed down to 300 tokens/sec when adding a second 3090
generation speed drops from 80/sec down to 20/sec

AutonomicPerfectionist · 2023-11-10T14:50:55Z

The MPI build might support model parallelism with GPUs, but it's currently broken (see #3334). I don't have any Nvidia GPUs to test with and my AMD setup is broken, but I can't think of anything off the top of my head that would preclude further offloading layers to the GPU

jingyao-zhang · 2023-11-10T20:08:01Z

The MPI build might support model parallelism with GPUs, but it's currently broken (see #3334). I don't have any Nvidia GPUs to test with and my AMD setup is broken, but I can't think of anything off the top of my head that would preclude further offloading layers to the GPU

Is MPI for multi-node deployment?
I am thinking of using single-node with multi-gpu.
Would MPI work for this scenario?

jingyao-zhang · 2023-11-10T20:09:17Z

Check out the -ts or --tensor-split option. Layers have to run sequentially though, so it's not exactly "parallel".

It should be parallel not sequential when using tensor split, though I've not checked the latest code how synchronization is done now. Tensor split splits the tensor along two cards at N percent, so they can be processed in parallel.

The downside is that there are quite some slowdowns with llama.cpp as soon as you use two GPUs, so currently it is only useful to load large models. When a model fits into the VRAM of one card, you should use CUDA_VISIBLE_DEVICES to restrict the use of the other GPU.

In my case using two GPUs comes with a almost 10x slowdown in speed. 2500 tokens/sec on the 4090 is slowed down to 300 tokens/sec when adding a second 3090 generation speed drops from 80/sec down to 20/sec

Same here. There might be communication overhead between multiple gpus.

jingyao-zhang · 2023-11-10T20:10:39Z

It seems -ts can partition model into multiple gpus. If there is no other comments, I will close this issue for the space of discussion.

jingyao-zhang closed this as completed Nov 10, 2023

Chocobi-1129 mentioned this issue Aug 20, 2024

Feature Request: Tensor Parallelism support #9086

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick question: is llama.cpp supporting model parallelism? #4014

Quick question: is llama.cpp supporting model parallelism? #4014

jingyao-zhang commented Nov 10, 2023

KerfuffleV2 commented Nov 10, 2023

cmp-nct commented Nov 10, 2023

AutonomicPerfectionist commented Nov 10, 2023

jingyao-zhang commented Nov 10, 2023

jingyao-zhang commented Nov 10, 2023

jingyao-zhang commented Nov 10, 2023

Quick question: is llama.cpp supporting model parallelism? #4014

Quick question: is llama.cpp supporting model parallelism? #4014

Comments

jingyao-zhang commented Nov 10, 2023

KerfuffleV2 commented Nov 10, 2023

cmp-nct commented Nov 10, 2023

AutonomicPerfectionist commented Nov 10, 2023

jingyao-zhang commented Nov 10, 2023

jingyao-zhang commented Nov 10, 2023

jingyao-zhang commented Nov 10, 2023