FA3 kvcache + split kv + gqa parallelization #1236

jayhshah · 2024-09-18T02:48:11Z

This PR adds split KV ("Flash decoding") and GQA parallelization improvements for FA3. Some essential parts of the KV cache API are added as well, including the cache_seqlens and cache_batch_idx arguments.

Up to 15x improvement over FA2 measured on my H100 PCIe in exceptional cases, e.g.

DTYPE: FLOAT16, CAUSAL, QHEADS:16, KVHEADS:1, HEADDIM:128
CONTEXT:16384, BSZ:4, QLEN:4, FA2:402.86, FA3:26.93, NUM SPLITS:22, RATIO:14.96, GB/s:1245.77

Times given in microseconds. GB/s is measured in terms of loading the KV cache. Note that theoretical max bandwidth is 2 TB/s for H100 PCIe.

TODO on this PR before merge: add split kv heuristic, implement for FP8.

fa3-decoding-times-091724.log

KV cache functionality not added yet.

ipiszy · 2024-09-19T07:22:51Z

hopper/flash_attn_interface.py

@@ -174,7 +175,8 @@ def forward(
            causal,
            descale_q=descale_q,
            descale_k=descale_k,
-            descale_v=descale_v,
+            descale_v=descale_v,     
+            gqa_decoding=False,


I wonder does it make sense to give user an option to enable GQA optimization for general use cases outside of decoding?

e.g. It's generally useful for small seq_len prefill. In this case we don't really need split-kv, but we want to have each threadblock handle multiple Q heads with the same KV head.

Furthermore, does it make sense to just enable GQA optimization by default when input is GQA? I feel it won't cause perf regressions even for long sequence length.

I feel it might slow things down a bit, but I haven't tried

…ded template params

…ly matters for fp8 support

…n into fa3-kvcache-gqa

ganeshcolfax and others added 27 commits September 17, 2024 12:02

Adding the flash3 kv cache API. Just compiling for now.

e6841cb

KV cache functionality not added yet.

start extending seqlen traits for kv cache

98610d6

added cache_batch_idx.

671af33

adding python interface.

c671c3f

add test_kvcache.py.

faafc59

enable use of actual seqlen for kv cache

e4fe20d

add new param to handle cache_batch_size

bf50391

add semaphore for kv cache causal

a08ad4c

add comparision with fa2.

e33f107

change template parameter for SeqLenTraits for ease of further extension

bf64e86

modify seqlentraits for gqa parallelism

7c363c1

modify Ktraits for decoding QO layouts

5576742

decouple types of seqlen traits q and k

7e6cf1e

change logic of Q loads for gqa parallelization

2433b2f

fix o strides

ee8b320

complete gqa parallel changes for non-causal

6618ab5

fix some errors

59cdccb

add causal logic

b6e8f10

add to kv cache api

7996455

add in lse writeout and store zero

b2a09fd

refactor for split kv

22bbff0

re-enable fp16/bf16 fwd

fb84142

add 1 mma warpgroup option, enable splitkv for hdim 256

12558b3

fix bug with finalize for split kv

42427a8

delete unused files

81d7bdb

add hid=64.

1c38e5b

change flash api for rebase

18cbd9c

ipiszy reviewed Sep 19, 2024

View reviewed changes

jayhshah added 2 commits September 19, 2024 09:38

avoid redundant compilation with combine kernel by only including nee…

c6b1c1f

…ded template params

change Element to OutputType for template param in combine kernel. On…

986247a

…ly matters for fp8 support

jayhshah and others added 12 commits September 19, 2024 11:25

fix wrong tile size for hdim 64

020ecf8

revert OutputType change

f07dcdd

changes for correct lse write out for splits=1 and splits > 1 case.

0375bad

update parameters

a52d64c

Merge branch 'fa3-kvcache-gqa' of github.com:Dao-AILab/flash-attentio…

9dd6742

…n into fa3-kvcache-gqa

remove unused code

267628f

added num_split_heuristics.

8cb226b

Merge branch 'fa3-kvcache-gqa' of github.com:Dao-AILab/flash-attentio…

feacec5

…n into fa3-kvcache-gqa

add num_split_heuristics.

a5db3c1

adding block_n and block_m for different headdim.

d9bd088

initialize semaphore when num splits != 1

d7ca643

add gqa decoding logic.

7876a02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 kvcache + split kv + gqa parallelization #1236

FA3 kvcache + split kv + gqa parallelization #1236

jayhshah commented Sep 18, 2024

ipiszy Sep 19, 2024

ipiszy Sep 19, 2024

tridao Sep 19, 2024

FA3 kvcache + split kv + gqa parallelization #1236

Are you sure you want to change the base?

FA3 kvcache + split kv + gqa parallelization #1236

Conversation

jayhshah commented Sep 18, 2024

ipiszy Sep 19, 2024

Choose a reason for hiding this comment

ipiszy Sep 19, 2024

Choose a reason for hiding this comment

tridao Sep 19, 2024

Choose a reason for hiding this comment