Allow using multiple GPUs without tensor parallelism #1031

gjurdzinski-deepsense · 2023-09-15T08:20:01Z

Feature request

Currently to use multiple GPUs, one must set --num-shards to >1. The enables tensor parallelism but using multiple GPUs can be done in other ways as well.

In fact, in the code from_pretrained already have an argument device_map set to "auto" which would use multiple GPUs if the single shard had them available. This means that most likely it's not much work to rework TGI to allow that.

Motivation

This would allow more customization of the LLM deployment.

Also some models don't work with tensor parallelism. Eg. falcon-7b-instruct has 71 heads, what means that it can work only on 1 or 71 shards. With eg. two Nvidia Tesla T4 available, Falcon 7b won't fit on a single one, it would fit on two but we can't do it with TGI.

Your contribution

I'm happy to test the solution.

The text was updated successfully, but these errors were encountered:

Narsil · 2023-09-20T15:40:43Z

Pipeline parallelism is trash for performance (latency, for throughput it's probably the best).

You can try it relatively easily by removing the capture of model_type here https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/__init__.py#L205
This will gracefully degrade to using transformers + device_map implementation.

Expect bad latency imho (but I'd be happy to revisit my opinion if it's actually ok).

Also there might be ways to actually allow the splitting on non divisble heads using some zero padding (you can check out TensorParallelEmbeddings and TensorParallelHead for ideas).
falcon-7b is relatively odd being non divisible, therefore we're not sure it' s worth the effort to support those (also there's some overhead of doing that, since you have new checks+RAM during load, and the padded weights are still "wasted compute")

If you're trying something like that we'd be happy to review !

Hannibal046 · 2023-10-13T06:09:23Z

Hi, I am wondering if it is possible to do data parallel with TGI? For example, I have 8 GPUS and I want to have 8 separate LLMs loaded in each of them. Would it be possible for TGI to handle this? Of course I could have 8 dockers, but I need a central control for the balance between each GPUs.

Thank you in advance

lapp0 · 2023-11-25T18:39:00Z

Would be great to see pipeline parallel in TGI for applications that require high throughput, but don't care about latency.

Here is my intuition on why a cluster of 4080s / 4090s combined with pipeline parallel would achieve the best possible cost per token for larger models. Please correct me if I'm wrong:

The most cost effective hardware for inference is the 4080 / 4090 (chart via source article)
The 4080 / 4090 have limited VRAM and cannot individually run larger (e.g. 34B, 70B) models, meaning we need to split the model between GPUs.
The 4080 / 4090 does not support NVLink and with tensor parallel PCI-e communication is a bottleneck.
Per this blog post quoted below , Pipeline Parallel would allow negligible PCI-e overhead.

Next, let's consider network latency. The advantage of pipeline parallelism compared to tensor parallelism is that it requires less network transmission. Between pipeline stages, only batch size * embedding size data needs to be transmitted. For example, with a batch size of 8 and an embedding size of 8192, only 128 KB of data needs to be transmitted. On a PCIe Gen4 x16 with a transfer rate of 32 GB/s, it only takes 4 microseconds to complete the transmission. However, we need to consider the overhead of the communication library and the fact that the 4090 does not support direct GPU-to-GPU peer-to-peer transfers, requiring CPU intermediation. In practice, it may take tens of microseconds. Compared to the token latency of tens of milliseconds in the computation part, this can be neglected.

Even with a batch size of 330, the transmission of this 5.28 MB data over PCIe only takes 0.16 milliseconds, which is still negligible compared to the 17.5 milliseconds of computation time.

If anyone has experimented with the method laid out by @Narsil, please share your results. Otherwise I'll be experimenting soon.

Narsil · 2023-12-05T10:45:49Z

@Hannibal046 Indeed that's what you need to do.

@lapp0 For point 2. you can always use a GPTQ/AWQ version of those models on a single 4090, I think that's probably the best of solution.

Also don't assume too many numbers based on theory, they are usually quite far from reality very fast.

If you manage to experiment, I'd be glad to hear if you manage to pull something nice off. Godspeed !

BrightXiaoHan · 2023-12-20T07:14:33Z

@Hannibal046 Indeed that's what you need to do.

@lapp0 For point 2. you can always use a GPTQ/AWQ version of those models on a single 4090, I think that's probably the best of solution.

Also don't assume too many numbers based on theory, they are usually quite far from reality very fast.

If you manage to experiment, I'd be glad to hear if you manage to pull something nice off. Godspeed !

Does it mean that if I have 8 GPUs and I deploy a model that doesn't support Tensor Parallel. Then I should start 8 TGI instances and then do another layer of load balancing myself?

github-actions · 2024-02-24T01:43:36Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

scse-l · 2024-03-22T03:09:38Z

@lapp0 hello, may I know if you have experimented with pipeline parallel in TGI?

If anyone has experimented with the method laid out by @Narsil, please share your results. Otherwise I'll be experimenting soon.

lapp0 · 2024-03-22T08:08:33Z

@scse-l unfortunately it appears that TGI doesn't fall back to pipeline parallel under the conditions Narsil described. In my review of the code and documentation a few months ago I found that TGI cannot support "true" pipeline parallel.

I didn't take good notes. Here's some resources though

scse-l · 2024-03-22T08:20:37Z

@lapp0 Get it. I'll check the refs. Thanks a lot.

@scse-l unfortunately it appears that TGI doesn't fall back to pipeline parallel under the conditions Narsil described. In my review of the code and documentation a few months ago I found that TGI cannot support "true" pipeline parallel.

I didn't take good notes. Here's some resources though

Model Parallelism and Big Models transformers#8771

https://huggingface.co/docs/transformers/v4.15.0/en/parallelism

github-actions bot added the Stale label Feb 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using multiple GPUs without tensor parallelism #1031

Allow using multiple GPUs without tensor parallelism #1031

gjurdzinski-deepsense commented Sep 15, 2023 •

edited

Loading

Narsil commented Sep 20, 2023 •

edited

Loading

Hannibal046 commented Oct 13, 2023

lapp0 commented Nov 25, 2023 •

edited

Loading

Narsil commented Dec 5, 2023

BrightXiaoHan commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Feb 24, 2024

scse-l commented Mar 22, 2024

lapp0 commented Mar 22, 2024

scse-l commented Mar 22, 2024

Allow using multiple GPUs without tensor parallelism #1031

Allow using multiple GPUs without tensor parallelism #1031

Comments

gjurdzinski-deepsense commented Sep 15, 2023 • edited Loading

Feature request

Motivation

Your contribution

Narsil commented Sep 20, 2023 • edited Loading

Hannibal046 commented Oct 13, 2023

lapp0 commented Nov 25, 2023 • edited Loading

Narsil commented Dec 5, 2023

BrightXiaoHan commented Dec 20, 2023 • edited Loading

github-actions bot commented Feb 24, 2024

scse-l commented Mar 22, 2024

lapp0 commented Mar 22, 2024

scse-l commented Mar 22, 2024

gjurdzinski-deepsense commented Sep 15, 2023 •

edited

Loading

Narsil commented Sep 20, 2023 •

edited

Loading

lapp0 commented Nov 25, 2023 •

edited

Loading

BrightXiaoHan commented Dec 20, 2023 •

edited

Loading