Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

Open
0xymoro opened this issue Nov 8, 2023 · 4 comments
Open

[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

0xymoro opened this issue Nov 8, 2023 · 4 comments
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@0xymoro
Copy link
Contributor

0xymoro commented Nov 8, 2023

Hi, putting this here:
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

The latency & throughput increase is significant though the comparisons are against vLLM. It seems like TRT does batching a bit differently so unsure if this can equally apply here.

@byshiue byshiue added feature request New feature or request triaged Issue has been triaged by maintainers labels Nov 9, 2023
@Shixiaowei02 Shixiaowei02 self-assigned this Nov 14, 2023
@ncomly-nvidia ncomly-nvidia removed the triaged Issue has been triaged by maintainers label Nov 14, 2023
@ncomly-nvidia
Copy link
Collaborator

Hey @0xymoro Thanks for sharing!!

Yes, splitfuse is an impressive advancement from Deepspeed, one we were also working as well! Our implementation will be a little different due to difference in batching strategies, but the idea of chunking the prefill is the same! Likely will land in the next few releases.

@ncomly-nvidia ncomly-nvidia added the triaged Issue has been triaged by maintainers label Nov 14, 2023
@ncomly-nvidia
Copy link
Collaborator

ncomly-nvidia commented Jan 3, 2024

Hey @0xymoro, Chunked attention is now part of v0.7.1! We're still working on an example so I'll leave this open until that's done!

Edit: The kernels were added in v0.7.1, the full feature will be in v0.8!

@Shixiaowei02
Copy link
Collaborator

Shixiaowei02 commented Jan 8, 2024

Hi @0xymoro ! Chunked context will be part of TensorRT-LLM v0.8. Thank you for your support!

@littletomatodonkey
Copy link

Hi @0xymoro ! Chunked context will be part of TensorRT-LLM v0.8. Thank you for your support!

Hi, @Shixiaowei02 Thanks for your great job! How can i use Chunked context in TensorRT-LLM? Is there any docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

7 participants