[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

0xymoro · 2023-11-08T07:59:18Z

Hi, putting this here:
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

The latency & throughput increase is significant though the comparisons are against vLLM. It seems like TRT does batching a bit differently so unsure if this can equally apply here.

ncomly-nvidia · 2023-11-14T18:02:08Z

Hey @0xymoro Thanks for sharing!!

Yes, splitfuse is an impressive advancement from Deepspeed, one we were also working as well! Our implementation will be a little different due to difference in batching strategies, but the idea of chunking the prefill is the same! Likely will land in the next few releases.

ncomly-nvidia · 2024-01-03T22:25:52Z

Hey @0xymoro, ~~Chunked attention is now part of v0.7.1!~~ We're still working on an example so I'll leave this open until that's done!

Edit: The kernels were added in v0.7.1, the full feature will be in v0.8!

Shixiaowei02 · 2024-01-08T06:43:54Z

Hi @0xymoro ! Chunked context will be part of TensorRT-LLM v0.8. Thank you for your support!

littletomatodonkey · 2024-04-16T13:37:07Z

Hi @0xymoro ! Chunked context will be part of TensorRT-LLM v0.8. Thank you for your support!

Hi, @Shixiaowei02 Thanks for your great job! How can i use Chunked context in TensorRT-LLM? Is there any docs?

byshiue assigned ncomly-nvidia Nov 9, 2023

byshiue added feature request New feature or request triaged Issue has been triaged by maintainers labels Nov 9, 2023

Shixiaowei02 self-assigned this Nov 14, 2023

ncomly-nvidia removed the triaged Issue has been triaged by maintainers label Nov 14, 2023

ncomly-nvidia added the triaged Issue has been triaged by maintainers label Nov 14, 2023

ncomly-nvidia mentioned this issue Dec 11, 2023

TensorRT-LLM Requests #632

Open

41 tasks

ncomly-nvidia assigned anjalibshah and schetlur-nv Jan 3, 2024

siddhatiwari mentioned this issue Jun 2, 2024

Chunked context incomplete outputs #1377

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

0xymoro commented Nov 8, 2023

ncomly-nvidia commented Nov 14, 2023

ncomly-nvidia commented Jan 3, 2024 •

edited

Loading

Shixiaowei02 commented Jan 8, 2024 •

edited

Loading

littletomatodonkey commented Apr 16, 2024

[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

[Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317

Comments

0xymoro commented Nov 8, 2023

ncomly-nvidia commented Nov 14, 2023

ncomly-nvidia commented Jan 3, 2024 • edited Loading

Shixiaowei02 commented Jan 8, 2024 • edited Loading

littletomatodonkey commented Apr 16, 2024

ncomly-nvidia commented Jan 3, 2024 •

edited

Loading

Shixiaowei02 commented Jan 8, 2024 •

edited

Loading