-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. #734
Comments
@BadisG This is fantastic! If you happen to have the CUDA version too, can you possible tell us how fast your inference times are between CUDA vs Triton? I had to look up |
@tensiondriven @oobabooga @qwopqwop200 Here's my CUDA vs TRITON comparaison. 1) Context
The TRITON model has 2) Size of the models
3) RAM load size (How much RAM you need to load the model)
Conclusion : They have a similar behavior on the RAM load size 4) VRAM load size (What is the minimum amount of VRAM needed to run the model)
Conclusion : They have a similar behavior on the VRAM load size 5) Inference speedNote : The very first inference will alaways be slower than the other ones... especially for the TRITON model -> I'll consider them as outliners and won't count them for the comparaison.
6) Conclusion?My comparaison doesn't really goes apples vs apples (as the CUDA model doesn't have the If that's the case, we should focus on CUDA as it gives us the output significantly faster than TRITON. |
Aren't --act-order + --groupsize compatible in the latest GPTQ CUDA branch? I managed to run llama 7B quantized with both arguments using the latest GPTQ and modifying GPTQ_loader.py Also, @oobabooga I don't know if you're aware of this so I'll just ping you. (Sorry btw) |
I have just tried the new CUDA branch and the performance seems to be significantly slower for the models that I currently have. I am not sure if I am doing something wrong. I have tested these models:
For the second one, these were the results:
This branch contains the necessary changes to run the upstream cuda branch: https://github.com/oobabooga/text-generation-webui/tree/new-qwop I'll also tag @qwopqwop200 |
Currently cuda is changed to implement act_order and groupsize at the same time. These changes make cuda inefficient. Therefore, triton is currently recommended, and it is normal for triton to be approximately twice as fast. |
@oobabooga how do you make the cuda model that has all the implementations to work on the webui? I have errors when I try to load a cuda model that has "act_order" in it @qwopqwop200 To be honest, if you believe the Triton branch is the superior version, I don't understand why you maintain the CUDA branch. Users will inevitably convert high-quality models using the CUDA branch, and we'll be forced to rely on its performance. By discontinuing the CUDA branch, you would encourage future users to adopt the Triton version exclusively, ensuring that everyone benefits from the enhanced version. Even if you choose not to remove the CUDA branch, I suggest updating your README to emphasize the advantages of converting models using the Triton branch, including a performance comparison. This would incentivize users to favor the Triton model for their conversions. |
(Aimed at the #785 side of the conversation, not CUDA vs triton) While performance isn't that much lower at small context sizes on the new CUDA branch, it scales extremely poorly with large context size. It's not exactly a 1:1 comparison, but on the same 3090, same prompt, text-generation-webui settings, etc, with with two setups:
~1848 context size is what --cai-chat mode grows to with default settings after not all that long, so it's by no means an unusual use case. Users are not going to tolerate first having to redownload newly requantized models, then finding out that performance has dropped by a factor of 80, in exchange for a marginal improvement to output quality. It would lead to a ton of complaints, probably with instructions to roll everything back to some specific commit where performance is acceptable. I have no idea what the cause is, someone with more pytorch knowledge would have to poke around at it. I don't think it's just my setup, I've fiddled with it + some GPTQ-for-LLaMA code a lot and never managed to make any improvement myself. |
@EyeDeck it looks like it got updated to make triton faster, you should make a git pull on the gptq for llama repository and try it again, maybe that'll fix your problem. Edit, I tried the new commit, and it makes triton run really fast, regardless of the number of tokens (just the very first inference is very slow, I don't know if that can be fixed) |
I have little doubt that triton is an improvement over the current CUDA branch, but I'm commenting on the proposal to support the latest GPTQ-for-LLaMA CUDA branch, which on my machine is up to almost 2 orders of magnitude slower than the older code that's supported here now. Unless someone works out some optimizations for the new CUDA branch, I don't think it's viable to update. Switching over to triton as recommended by qwopqwop200 would be great, but it means entirely dropping native Windows compatibility until someone makes the necessary tweaks to get triton compiling on Windows (edit: and cards older than GeForce RTX 20-series). Which, who knows, might not be that hard to do; I've read a few comments mentioning custom Windows builds, but at best they list some 6-month+ out of date instructions for the necessary tweaks required to get it compiling, never with a proper git repo or pull request to look at. Also, running tests with a tiny context size is about as interesting to me as how many FPS a GPU can render with the monitor turned off. |
@EyeDeck I also have windows but I can make run triton through WSL2, you should do that aswell it's not that hard to implement |
Making GPTQ kernel triton exclusive and abandoning CUDA would be bad for people with cards with compute capability 7.0 and below, since anything below is unsupported... Unless I'm misunderstanding something? |
Triton is only 1/10th the speed or slower than CUDA for AMD GPUs I went from 18 tokens/second to 1.5 tokens/s on a 13b model at 4 bits on a 6800xt Getting rid of the cuda branch is bad for AMD users |
And if we can't use it, due to old GPU or windows? I suppose we should just forget about 4-bit then? I say make branches for older cuda, newer cuda and triton. Textgen can be made compatible with either. Heck.. I don't even understand why we abandoned the old model format. All it takes is running an older version of GPTQ. These things are not all mutually exclusive where the code MUST change for the sake of novelty and everyone else is on their own. https://github.com/johnsmith0031/alpaca_lora_4bit can also be used for inference, btw. Not just for loading and training loras. I have not touched the v2 version yet but I'm guessing it will be the same. And it supported parallelism and offload too. It's model loading can be made generic like gptq_loader.py was. I am still using it with v1 models (llama/opt/gpt-j) and it's fast with no problems at all. Even a tad faster than the new(er) cuda implementation with less delay on the initial generation. In fact my only benefit for gptq-v2 is act order and true sequential raising the "smartness" score ever so slightly. |
Ugh Python people and their willingness to make breaking changes. Here's an idea: don't lock things to outdated branches. Make the setup script/instructions just change the relevant commands for installing the right GPTQ-for-LLaMA version. You can detect and/or store the version and read it back later to run the old code vs the new code in the inference. |
@da3dsoul this is mainly why I hate pip, but maybe I'm too hard on pip but rather should be about the practices of the developers. |
It's a problem that plagues the entire Python ecosystem. Supposedly, Python was about democratizing programming by being easy, but all that did was invite inexperience and poor planning. A programmer is 10% someone who can write code and 90% someone who can plan and problem-solve. If you can't do those, then you have no business writing code for anyone but yourself. What I would like to do is have a system that can more or less check first then run or try and fail to ensure that the first time setup experience is as easy as possible. I am still having issues just downloading various models and getting them to run, because oh that's the wrong version or whatever. Ok, then tell me. There should be a way to tell. I don't know if there's a way to run multiple versions of GPTQ and GPTQ-for-LLaMA side by side and swap them out, but that would be a game changer. |
No drama, please. |
Yes, easily. And you can still load the old models with a few simple changes and the correct GPTQ. You can even run several textgen-webui's side by side sharing the same environment. You may have to do a edit: From the updated alpaca_lora_4bit repo.. it appears possible to load both v1 and v2 models and perform inference.. haven't tried how well or fast it is and if there are caveats. This would mean one fork of GPTQ. Additionally it supports parallelism, triton, and offload. With genercized loading functions it will do opt, llama, pygmalion, etc and probably train loras for them too. This would eliminate even having to recompile the kernel. edit2: new cuda implementation is 1/3 the speed for me. edit3: and just like that V1 and V2 both work in the same repo: https://github.com/Ph0rk0z/text-generation-webui-testing/tree/DualModel/ have to compare v1/v2/new/old cuda and if I ever get triton fixed that too. |
Can confirm that it's way slower on AMD. It's weird because triton fork is directly maintained by AMD and seem mature, to the point that it's being upstreamed. |
Hello!
I have a model that has all the GPTQ implementations and it's called "gpt-x-alpaca-13b-native-true_sequential-act_order-128g-TRITON"
This model was made using Triton and it can be run on the webui with the current commit from GPTQ-for-LLaMa.
That will install the Triton package on your textgen environnement
to this:
And there you have it! You can now run a model that has all the GPTQ implementations on the webui!
Here's an example of a successful run:
With
Output generated in 79.70 seconds (1.07 tokens/s, 85 tokens, context 59)
giving something like this:PS: If you're not using the
--pre_layer
flag, you'll get this error:To fix this, open the GPTQ_loader.py file here
text-generation-webui\modules\GPTQ_loader.py
and replace the line 36:with this:
The text was updated successfully, but these errors were encountered: