Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using multiple GPUs without tensor parallelism #1031

Closed
gjurdzinski-deepsense opened this issue Sep 15, 2023 · 9 comments
Closed

Allow using multiple GPUs without tensor parallelism #1031

gjurdzinski-deepsense opened this issue Sep 15, 2023 · 9 comments
Labels

Comments

@gjurdzinski-deepsense
Copy link

gjurdzinski-deepsense commented Sep 15, 2023

Feature request

Currently to use multiple GPUs, one must set --num-shards to >1. The enables tensor parallelism but using multiple GPUs can be done in other ways as well.

In fact, in the code from_pretrained already have an argument device_map set to "auto" which would use multiple GPUs if the single shard had them available. This means that most likely it's not much work to rework TGI to allow that.

Motivation

This would allow more customization of the LLM deployment.

Also some models don't work with tensor parallelism. Eg. falcon-7b-instruct has 71 heads, what means that it can work only on 1 or 71 shards. With eg. two Nvidia Tesla T4 available, Falcon 7b won't fit on a single one, it would fit on two but we can't do it with TGI.

Your contribution

I'm happy to test the solution.

@Narsil
Copy link
Collaborator

Narsil commented Sep 20, 2023

Pipeline parallelism is trash for performance (latency, for throughput it's probably the best).

You can try it relatively easily by removing the capture of model_type here https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/__init__.py#L205
This will gracefully degrade to using transformers + device_map implementation.

Expect bad latency imho (but I'd be happy to revisit my opinion if it's actually ok).

Also there might be ways to actually allow the splitting on non divisble heads using some zero padding (you can check out TensorParallelEmbeddings and TensorParallelHead for ideas).
falcon-7b is relatively odd being non divisible, therefore we're not sure it' s worth the effort to support those (also there's some overhead of doing that, since you have new checks+RAM during load, and the padded weights are still "wasted compute")

If you're trying something like that we'd be happy to review !

@Hannibal046
Copy link

Hi, I am wondering if it is possible to do data parallel with TGI? For example, I have 8 GPUS and I want to have 8 separate LLMs loaded in each of them. Would it be possible for TGI to handle this? Of course I could have 8 dockers, but I need a central control for the balance between each GPUs.

Thank you in advance

@lapp0
Copy link

lapp0 commented Nov 25, 2023

Would be great to see pipeline parallel in TGI for applications that require high throughput, but don't care about latency.

Here is my intuition on why a cluster of 4080s / 4090s combined with pipeline parallel would achieve the best possible cost per token for larger models. Please correct me if I'm wrong:

  1. The most cost effective hardware for inference is the 4080 / 4090 (chart via source article)
  2. The 4080 / 4090 have limited VRAM and cannot individually run larger (e.g. 34B, 70B) models, meaning we need to split the model between GPUs.
  3. The 4080 / 4090 does not support NVLink and with tensor parallel PCI-e communication is a bottleneck.
  4. Per this blog post quoted below , Pipeline Parallel would allow negligible PCI-e overhead.

Next, let's consider network latency. The advantage of pipeline parallelism compared to tensor parallelism is that it requires less network transmission. Between pipeline stages, only batch size * embedding size data needs to be transmitted. For example, with a batch size of 8 and an embedding size of 8192, only 128 KB of data needs to be transmitted. On a PCIe Gen4 x16 with a transfer rate of 32 GB/s, it only takes 4 microseconds to complete the transmission. However, we need to consider the overhead of the communication library and the fact that the 4090 does not support direct GPU-to-GPU peer-to-peer transfers, requiring CPU intermediation. In practice, it may take tens of microseconds. Compared to the token latency of tens of milliseconds in the computation part, this can be neglected.

Even with a batch size of 330, the transmission of this 5.28 MB data over PCIe only takes 0.16 milliseconds, which is still negligible compared to the 17.5 milliseconds of computation time.

If anyone has experimented with the method laid out by @Narsil, please share your results. Otherwise I'll be experimenting soon.

@Narsil
Copy link
Collaborator

Narsil commented Dec 5, 2023

@Hannibal046 Indeed that's what you need to do.

@lapp0 For point 2. you can always use a GPTQ/AWQ version of those models on a single 4090, I think that's probably the best of solution.

Also don't assume too many numbers based on theory, they are usually quite far from reality very fast.

If you manage to experiment, I'd be glad to hear if you manage to pull something nice off. Godspeed !

@BrightXiaoHan
Copy link

BrightXiaoHan commented Dec 20, 2023

@Hannibal046 Indeed that's what you need to do.

@lapp0 For point 2. you can always use a GPTQ/AWQ version of those models on a single 4090, I think that's probably the best of solution.

Also don't assume too many numbers based on theory, they are usually quite far from reality very fast.

If you manage to experiment, I'd be glad to hear if you manage to pull something nice off. Godspeed !

Does it mean that if I have 8 GPUs and I deploy a model that doesn't support Tensor Parallel. Then I should start 8 TGI instances and then do another layer of load balancing myself?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Feb 24, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 29, 2024
@scse-l
Copy link

scse-l commented Mar 22, 2024

@lapp0 hello, may I know if you have experimented with pipeline parallel in TGI?

If anyone has experimented with the method laid out by @Narsil, please share your results. Otherwise I'll be experimenting soon.

@lapp0
Copy link

lapp0 commented Mar 22, 2024

@scse-l unfortunately it appears that TGI doesn't fall back to pipeline parallel under the conditions Narsil described. In my review of the code and documentation a few months ago I found that TGI cannot support "true" pipeline parallel.

I didn't take good notes. Here's some resources though

@scse-l
Copy link

scse-l commented Mar 22, 2024

@lapp0 Get it. I'll check the refs. Thanks a lot.

@scse-l unfortunately it appears that TGI doesn't fall back to pipeline parallel under the conditions Narsil described. In my review of the code and documentation a few months ago I found that TGI cannot support "true" pipeline parallel.

I didn't take good notes. Here's some resources though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants