Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add Falcon LLM support #1602

Closed
someone13574 opened this issue May 26, 2023 · 210 comments
Closed

llama : add Falcon LLM support #1602

someone13574 opened this issue May 26, 2023 · 210 comments
Labels
help wanted Extra attention is needed model Model specific

Comments

@someone13574
Copy link

someone13574 commented May 26, 2023

Falcon LLM 40b and 7b were just open sourced under a license which allows commercial use (with royalties for over $1 million revenue per year) and have are topping the Huggingface Open LLM leaderboard. It seems to be based on a modified gpt3 architecture. I’m wondering if support in llama.cpp would be considered.

https://huggingface.co/tiiuae/falcon-40b

@cmp-nct
Copy link
Contributor

cmp-nct commented May 28, 2023

First we need to implement ggml
Mind elaborating on that, it does not seem to make sense in context.

From what I read, I've not tested it, the model seems significantly better than llama, while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.

It's using flash attention and multiquery. gg already has branches with flashattention.
I don't see that "implementation barrier" ?

@cmp-nct
Copy link
Contributor

cmp-nct commented May 28, 2023

I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned.
It is smarter than Turbo when it comes to some tests I ran, not as good as Turbo overall but I need to develop new tests now as Falcon-40B can beat all of those I currently had in the "Legacy/GPT-4 only" section.

@dseddah
Copy link

dseddah commented May 28, 2023

there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?

https://github.com/Birch-san/falcon-play

@cmp-nct
Copy link
Contributor

cmp-nct commented May 28, 2023

there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?

https://github.com/Birch-san/falcon-play

Falcon has the full precision binaries available here:
https://huggingface.co/tiiuae/falcon-40b/tree/main
https://huggingface.co/tiiuae/falcon-40b-instruct
https://huggingface.co/tiiuae/falcon-7b
https://huggingface.co/tiiuae/falcon-7b-instruct
https://huggingface.co/tiiuae/falcon-rw-1b

From there it should start, the pre-quantized versions are not useful imho.

I'm not 100% sure yet but from my tests I believe that we have a superior successor to llama at our hands that covers all our use cases (from small to large).
I also tried some bias tests (given it's origin), the instruct Falcon 40B instruct is surprisingly unbiased, it felt like a bit of Turbo or GPT-4 "tuning" went into it 'As an AI model'.
It remains to be tested and compared in detail of course.

It solved riddles Turbo, Alpaca and OpenAssist 30B can not solve.

Carefully said: It looks like the 40B Falcon might outperform the largest 65B llama (it does so in the benchmarks).

@danmaxis
Copy link

I don't know why I'm not able to convert it to .ggml, like other models.

Loading model file /mnt/m/llama_model/falcon-40b/pytorch_model-00009-of-00009.bin
Traceback (most recent call last):
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1168, in <module>
    main()
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1148, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1076, in load_some_model
    model_plus = merge_multifile_models(models_plus)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 583, in merge_multifile_models
    model = merge_sharded([mp.model for mp in models_plus])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in merge_sharded
    return {name: convert(name) for name in names}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in <dictcomp>
    return {name: convert(name) for name in names}
                  ^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in convert
    lazy_tensors: List[LazyTensor] = [model[name] for model in models]
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in <listcomp>
    lazy_tensors: List[LazyTensor] = [model[name] for model in models]
                                      ~~~~~^^^^^^
KeyError: 'transformer.word_embeddings.weight'

@KerfuffleV2
Copy link
Collaborator

@danmaxis

I don't know why I'm not able to convert it to .ggml, like other models.

Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.

The conversion app can't handle Falcon models yet.

@jessejohnson
Copy link
Contributor

@danmaxis

I don't know why I'm not able to convert it to .ggml, like other models.

Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.

The conversion app can't handle Falcon models yet.

@KerfuffleV2 can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3? Will be super grateful!

@klosax
Copy link
Collaborator

klosax commented May 30, 2023

How much of all the work done in this repo could easily be transferred to future models and architectures?

It looks like the happy days of the original LLaMA models may soon be over, as it starts to get beaten by models with different architectures and more attractive licensing. Open LLM Leaderboard

As the flora of LLM architectures will continue to grow and new ones will replace the old, I think this repo and the LLM examples in the ggml repo should be merged into something like ggml_llm.

The ggml_llm would contain all the common LLM code (main inference / perplexity / filehandling / quantization / sampling ..) and the code for each architecture could be like plugins added at compile time. The gpt4all-backend may be a good starting point for how such structure could be built.

ggerganov/ggml#185
ggerganov/ggml#145 (comment)

@KerfuffleV2
Copy link
Collaborator

@jessejohnson

can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3?

I don't want to get too offtopic here so if you want detailed information you'd probably be better off creating a discussion. I also don't really know the specific architecture of GP-3, etc, so I can't tell you the exact way two specific types of model differ, just provide some general information.

This is a bit simplified, but a model consists of a bunch of tensors (just big arrays of numbers in various dimensions). The tensors generally have names, like transformer.word_embeddings.weight. Models also usually are set up with some main level tensors and then a set of tensors that are repeated in a number of layers. So you might have main_tensor and then layer.0.tensor1, layer.0.tensor2, layer.1.tensor1 etc. How the tensors are named depends on both the model architecture and the file format. GGML might call the same tensor a different thing from the HuggingFace format.

Anyway, to actually run a model one performs a bunch of math operations on those tensors. Some of the operations are simple like addition, multiplication, some are more complex and can have complicated logic internally like rope, alibi, matrix multiplication, etc.

Which tensors exist in a model and what sequence of those math operations are used to evaluate the model depends on the model architecture. While a LLaMA based model might have main_tensor + layer.0.tensor2 * layer.0.tensor1 * 1.321 a FALCON model might have layer.0.first.weight / (main_bias * 0.5) + layer.0.second.bias or whatever. I just made up completely random names there, they don't actually relate to anything.

The code in something like this project which evaluates a type of model it supports (say LLaMA for example) is set up to look for tensors with specific names, grab that data, perform the various operations in the correct order and then it also expects the result from those operations to be in a specific format as well.

Hopefully this makes it more clear why specific support needs to be added to ML tools to support models that actually have a different architecture.

@jessejohnson
Copy link
Contributor

Thanks @KerfuffleV2, this is exactly what I was looking for!

@cmp-nct
Copy link
Contributor

cmp-nct commented May 30, 2023

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation
https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699
It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.

Though looks like a bit of a nightmare to adapt everything :(

@iHaagcom
Copy link

iHaagcom commented May 30, 2023

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.

Though looks like a bit of a nightmare to adapt everything :(

Can bloomz.cpp run this model?

@cmp-nct
Copy link
Contributor

cmp-nct commented May 30, 2023

Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some.
Also bloomz is barebones, no GPU support, etc.
It would be a nice first step to get it running there but llama.cpp is the platform with all the features.

@real-andrew
Copy link

while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.

As of 3 hours ago, they tweeted that they will forgo any royalties for commercial and research uses. I don't know what this means in practice but Falcon might become the first capable genuinly-opensource model we get.

@logicchains
Copy link
Contributor

They've just updated their Huggingface to confirm that the models are now available under Apache 2.0: https://huggingface.co/tiiuae .

@jessejohnson
Copy link
Contributor

According to their announcement on the official site, it's the Falcon 40B that is now under Apache 2.0. Not sure if they intend to do same for the smaller models, or if they plan an even larger, license-restricted one.

https://www.tii.ae/news/uaes-falcon-40b-worlds-top-ranked-ai-model-technology-innovation-institute-now-royalty-free

@cmp-nct
Copy link
Contributor

cmp-nct commented May 31, 2023

They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model. Best thing ever exported from Abu Dhabi ?

@Googulator
Copy link

Googulator commented May 31, 2023

All models and datasets from them are now confirmed to be Apache 2.0. The model repositories still contain the old license.txt, but the models themselves are tagged Apache.

@JohnAlcatraz
Copy link

With Falcon-40B being significantly better than LLaMA-65B, and actually being fully open source under Apache 2.0, it's definitely the new king of open source LLMs. It would be great to see support for it in llama.cpp!

image

@nikisalli
Copy link

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

I can give you the quantized model if you want to continue my work.

https://github.com/nikisalli/falcon.cpp

image

@klosax
Copy link
Collaborator

klosax commented May 31, 2023

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

Great work!
Why dont you start with the 7B model instead? It should require less memory.

@nikisalli
Copy link

@klosax it is still too big! To debug the weights the model needs to be loaded in fp16 on the gpu. this means that a 24GB gpu is needed in the case of the 7B model and I do not posses one

@ghost
Copy link

ghost commented May 31, 2023

Truthfully though the initial Falcon work should be done on 7B to ease development; I think the architecture is the same regardless of model size. If it gets traction I'm sure someone with a big GPU will hop in and help with the 40B 🤗

Like it or not, Llama is limited by its legality and truly open models like Falcon are the way forwards for llama.cpp.

@klosax
Copy link
Collaborator

klosax commented May 31, 2023

@nikisalli : On the model card it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?

@Green-Sky
Copy link
Collaborator

@nikisalli what do you need the gpu for? why not cpu?, ggml/llama.cpp is known for its ability to run on cpu after all...

@nikisalli
Copy link

I find it useful to run the pytorch model with many print statements here and there to check that ggml is giving me the same numbers so that I know what operations to touch

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 26, 2023

I synced up my original GGML fork created for Falcon40B (https://github.com/jploski/ggml/tree/falcon40b) to the current ggerganov/ggml origin master and merged in the fix-mul-mat branch from ggerganov/ggml#224 into it. On top of that I created a new branch https://github.com/jploski/ggml/tree/falcon40b-norepeat, which demonstrates the change to the broadcast logic which would be necessary to get rid of the ggml_repeat2 invocations:

ggerganov/ggml@2e30a2b

I did not commit it into https://github.com/cmp-nct/ggllm.cpp because at this stage I consider it as an early proof-of-concept:

  1. Not sure if that is the only place of PR/224 logic which would need to be modified in this way (e.g. does not cover BLAS, non-F32 and what not)
  2. Not sure about any undesired side effects on other potential users of PR/224 (but since it is still unmerged into master, maybe there are not so many users yet)
  3. Not sure about how elegant or general it is (but I suspect not very)
  4. I find it hard to elaborate why this exact change works. The short story is that I evaluated which K-Q vectors are multiplied together in the original ggml_repeat2 version and hammered on it long enough to obtain the same pairing up of the vectors for each attention head as in the original (and tested that the outputs match with two different falcon40b mini-model configs so far).

Awesome to see! I was half way there, glad I stopped given you were successful!
I'll see to integrate it today, into the falcon repo and see how well it behaves.

Just to be sure I get all: You modified the broadcast PR to source for the kv head, right ?
If that's the case maybe a "broadcast_mode" flag in ggml_tensor could switch the behavior ? (not broadcast_mode_2 :) )
same with repeat, I believe both modes are useful for GGML.
repeat() could be changed to support that mode instead of a repeat2 function
mode could be block_repeat and interleaving_repeat
broadcast could be block_broadcast and interleaving_broadcast

Tensor math is not my strong point, no idea if those words are fitting ?

@jploski
Copy link
Contributor

jploski commented Jun 26, 2023

I synced up my original GGML fork created for Falcon40B (https://github.com/jploski/ggml/tree/falcon40b) to the current ggerganov/ggml origin master and merged in the fix-mul-mat branch from ggerganov/ggml#224 into it. On top of that I created a new branch https://github.com/jploski/ggml/tree/falcon40b-norepeat, which demonstrates the change to the broadcast logic which would be necessary to get rid of the ggml_repeat2 invocations:
ggerganov/ggml@2e30a2b
I did not commit it into https://github.com/cmp-nct/ggllm.cpp because at this stage I consider it as an early proof-of-concept:

  1. Not sure if that is the only place of PR/224 logic which would need to be modified in this way (e.g. does not cover BLAS, non-F32 and what not)
  2. Not sure about any undesired side effects on other potential users of PR/224 (but since it is still unmerged into master, maybe there are not so many users yet)
  3. Not sure about how elegant or general it is (but I suspect not very)
  4. I find it hard to elaborate why this exact change works. The short story is that I evaluated which K-Q vectors are multiplied together in the original ggml_repeat2 version and hammered on it long enough to obtain the same pairing up of the vectors for each attention head as in the original (and tested that the outputs match with two different falcon40b mini-model configs so far).

Awesome to see! I was half way there, glad I stopped given you were successful! I'll see to integrate it today, into the falcon repo and see how well it behaves.

Just to be sure I get all: You modified the broadcast PR to source for the kv head, right ?

I guess so. Consider a case of n_head=128 and n_head_kv=8. There are 8 kv head groups, each with 16 queries and 1 kv pair reused by all the 16 queries within the same kv head group. So what this modification achieves is that for any fixed head group index, the same key row index is picked from the src0 matrix to mulitply with any queries that belong to the same kv head group.

If that's the case maybe a "broadcast_mode" flag in ggml_tensor could switch the behavior ? (not broadcast_mode_2 :) ) same with repeat, I believe both modes are useful for GGML. repeat() could be changed to support that mode instead of a repeat2 function mode could be block_repeat and interleaving_repeat broadcast could be block_broadcast and interleaving_broadcast

Tensor math is not my strong point, no idea if those words are fitting ?

I'm not sure at this point whether two different modes are really needed or if we could accept the "falcon hack" (maybe generalizing it further) as the "right" way of broadcasting. I think we should aim to reproduce the behavior of torch.broadcast_to, which my repeat2 function was intended to imitate (but we do have to maintain backward compatibility with repeat as well, so maybe you're right about the modes). But I can't claim I am much familiar with the exact semantics of torch's implementation of broadcasting either.

Perhaps @ggerganov can comment as well - the fix-mul-mat branch was originally intended for another use case, and it would be good to find out whether the "falcon hack" would negatively affect it, be neutral, or maybe even positive.

@ggerganov
Copy link
Owner

@jploski I'll test if the modified broadcast works for SAM - I think it might.
Just remind me if I don't respond in 1-2 days, as I might forget about this

@linuxmagic-mp
Copy link

@ggerganov Just some thoughts.. working with @cmp-nct branch, got the Falcon-40b running using local hardware, and following this branch as well, but it is getting confusing to keep up with the latest. Is it not maybe time to create a single ggmlLLM.cpp (vs llama.cpp), where all the contributors can make all pull requests from? Time to start centralizing all this work? Running converts from 3 repos every other day to keep up ;) Want to move on to some of the 40b-instruct, and start into fine tuning the ggml versions. But was hoping to see the consolidation first. Great work from all parties. Might help to bring more good people into the effort if the llama.cpp project was a little more generic to more models?

@slaren
Copy link
Collaborator

slaren commented Jun 27, 2023

@linuxmagic-mp that's the plan already, check #1991 and ggerganov/ggml#220.

@ggerganov
Copy link
Owner

@linuxmagic-mp

Yes, as @slaren mentioned - llama.cpp will eventually support most of the LLMs out there after we complete these issues.

Regarding ggml changes - I still don't know what to do. Git submodule will not work - if you want to make a change in llama.cpp that involves updating ggml then you will have to push in the ggml repo and wait for the submodule to get synced - too complicated.

Anyway, I am thinking about this and eventually will figure out how to improve it

@linuxmagic-mp
Copy link

@ggerganov Understood of course, I guess the actual suggestion was I think your work, and the contributors work has now gone far beyond "Llama" now, so maybe a name change is in order to bring even more contributors to the main branch.. ;) It'll get confusing to the masses when llama.cpp can/will be used for so many other models. ;) Hey, it's a work in progress, and I think everyone is already amazed at the leaps and bounds almost daily.

@apage43
Copy link
Contributor

apage43 commented Jun 28, 2023

git submodule will not work - if you want to make a change in llama.cpp that involves updating ggml then you will have to push in the ggml repo and wait for the submodule to get synced - too complicated.

git subtree might be a better fit here

@Green-Sky
Copy link
Collaborator

Regarding ggml changes - I still don't know what to do. Git submodule will not work - if you want to make a change in llama.cpp that involves updating ggml then you will have to push in the ggml repo and wait for the submodule to get synced - too complicated.

Also impossible for downstream projects. Image doing llava.cpp, now you need clip.cpp and llama.cpp both not having ggml as a submodule.

git subtree might be a better fit here

yea, the more we talk about it, the more git subtree and a monorepo crystalize as the only viable solutions.

@howard0su
Copy link
Collaborator

sounds a mono repo make everything easier if we can create a right folder structure even integrating SAM and Wisper. We can have multi header files and multi libs for the downstream applications to pick.

@philpax
Copy link

philpax commented Jul 5, 2023

Not sure if this is the right issue for it or if it should be a separate issue, but I'd also like to +1 a monorepo. ggml is almost a monorepo itself - it'd be a good idea to bring whisper.cpp and llama.cpp into the fold.

Subtrees are very troublesome and prone to breaking, and the semantics are hard to understand, while submodules are annoying for developers who are working on both repositories at the same time.

@linuxmagic-mp
Copy link

Just curious, I see this is on the roadmap, and @ggerganov has flagged it for more help, might be nice to get a headstart by creating a new list of what's needed for the 'more help' part.. @cmp-nct what do you think?

@ggerganov
Copy link
Owner

We need to finish the gguf spec (ggerganov/ggml#302) and implement the new format in llama.cpp so that all existing functionality works correctly with it. When this is ready, we can start integrating Falcon.

In the mean time, we should simplify the convert / load implementation in convert.py and llama.cpp by removing obsolete formats / features and in general prepare to integrate gguf in the project (see #1991)

@ggerganov
Copy link
Owner

@jploski I'll test if the modified broadcast works for SAM - I think it might. Just remind me if I don't respond in 1-2 days, as I might forget about this

Following up on this - tested with current SAM inference and it still works so I think the change is good and will upstream it.

@skirodev
Copy link

skirodev commented Jul 13, 2023

@jploski It appears that the Falcon model utilizes the FlashAttention technique as far as my understanding goes, as mentioned in FlashAttention. I was wondering why your code in this context does not incorporate ggml_flash_attn() when performing the QKV calculation?

ggml/examples/falcon/main.cpp#539

@jploski
Copy link
Contributor

jploski commented Jul 13, 2023

@jploski It appears that the Falcon model utilizes the FlashAttention technique as far as my understanding goes, as mentioned in FlashAttention. I was wondering why your code in this context does not incorporate ggml_flash_attn() when performing the QKV calculation?

ggml/examples/falcon/main.cpp#539

Simply, I was unaware of FlashAttention.

The Python version appears to only utilize it through the scaled_dot_product_attention function, which can be backed by FlashAttention depending on PyTorch version. Anyhow, I suspect that in order to use FlashAttention with Multi-Query Attention (for n_head_kv > 1, i.e. in 40B model) the key vector would need to be explicitly ggml_repeat2-ed again, which is something we managed to get rid of through a fused broadcast-matrix-multiplication kernel. Note that the most current version of the Falcon implementation is in here:

https://github.com/cmp-nct/ggllm.cpp/blob/master/libfalcon.cpp#L2153

It would be interesting to see if ggml_repeat2 + ggml_flash_attn would work and make it perform better or worse, but I cannot examine it myself at present.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jul 13, 2023

@jploski It appears that the Falcon model utilizes the FlashAttention technique as far as my understanding goes, as mentioned in FlashAttention. I was wondering why your code in this context does not incorporate ggml_flash_attn() when performing the QKV calculation?
ggml/examples/falcon/main.cpp#539

Simply, I was unaware of FlashAttention.

The Python version appears to only utilize it through the scaled_dot_product_attention function, which can be backed by FlashAttention depending on PyTorch version. Anyhow, I suspect that in order to use FlashAttention with Multi-Query Attention (for n_head_kv > 1, i.e. in 40B model) the key vector would need to be explicitly ggml_repeat2-ed again, which is something we managed to get rid of through a fused broadcast-matrix-multiplication kernel. Note that the most current version of the Falcon implementation is in here:

https://github.com/cmp-nct/ggllm.cpp/blob/master/libfalcon.cpp#L2153

It would be interesting to see if ggml_repeat2 + ggml_flash_attn would work and make it perform better or worse, but I cannot examine it myself at present.

I'll need to dig into the original modelling.py again to verify but I believe you did not miss out any flash attention code. It wasn't in there afaik.
I've not looked in detail into the flash attention paper, I thought it was only training related. Falcon did mention it but they also mentioned alibi

@jploski
Copy link
Contributor

jploski commented Jul 13, 2023

@jploski It appears that the Falcon model utilizes the FlashAttention technique as far as my understanding goes, as mentioned in FlashAttention. I was wondering why your code in this context does not incorporate ggml_flash_attn() when performing the QKV calculation?
ggml/examples/falcon/main.cpp#539

Simply, I was unaware of FlashAttention.
The Python version appears to only utilize it through the scaled_dot_product_attention function, which can be backed by FlashAttention depending on PyTorch version. Anyhow, I suspect that in order to use FlashAttention with Multi-Query Attention (for n_head_kv > 1, i.e. in 40B model) the key vector would need to be explicitly ggml_repeat2-ed again, which is something we managed to get rid of through a fused broadcast-matrix-multiplication kernel. Note that the most current version of the Falcon implementation is in here:
https://github.com/cmp-nct/ggllm.cpp/blob/master/libfalcon.cpp#L2153
It would be interesting to see if ggml_repeat2 + ggml_flash_attn would work and make it perform better or worse, but I cannot examine it myself at present.

I'll need to dig into the original modelling.py again to verify but I believe you did not miss out any flash attention code. It wasn't in there afaik. I've not looked in detail into the flash attention paper, I thought it was only training related. Falcon did mention it but they also mentioned alibi

What I mean is that modelling_RW.py uses torch.nn.functional.scaled_dot_product_attention, and this in turn uses the FlashAttention algorithm by default in newer torch versions (which can be disabled - https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.enable_flash_sdp)

@cmp-nct
Copy link
Contributor

cmp-nct commented Jul 13, 2023

What I mean is that modelling_RW.py uses torch.nn.functional.scaled_dot_product_attention, and this in turn uses the FlashAttention algorithm by default in newer torch versions (which can be disabled - https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.enable_flash_sdp)

darn. I didn't know

@jploski
Copy link
Contributor

jploski commented Jul 15, 2023

It would be interesting to see if ggml_repeat2 + ggml_flash_attn would work and make it perform better or worse, but I cannot examine it myself at present.

I implemented it here: jploski/ggml@fac72a28 (new branch falcon40b-flash based off falcon40b in my original ggml fork) - the ggml_repeat2 + ggml_flash_attn version turns out as 16% slower than the falcon40b-norepeat branch on CPU for 2048 token generation using a falcon40b mini-model. I can't easily test on GPU right now, but I suspect it won't be much better due to the ggml_repeat2 overhead.

@dseddah
Copy link

dseddah commented Jul 20, 2023

Hi,
does anyone know if llama.cpp supports the FalconLM 40B out of the box or do I need to apply a patch ?

@maddes8cht
Copy link
Contributor

Llama.cpp does only support Llama-based models - hence the fancy name ;)
You need to use
https://github.com/cmp-nct/ggllm.cpp/ to use falcon models.

@dseddah
Copy link

dseddah commented Jul 20, 2023

Llama.cpp does only support Llama-based models - hence the fancy name ;)
You need to use
https://github.com/cmp-nct/ggllm.cpp/ to use falcon models.

Oh, thanks :) I hope they support METAL already !

@ggerganov
Copy link
Owner

Close via #2717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed model Model specific
Projects
Status: Done
Development

No branches or pull requests