Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think I made the Triton-GPTQ model work with the GPTQ-for-LLaMa (Triton branch) repository. #734

Closed
BadisG opened this issue Apr 2, 2023 · 19 comments
Labels
enhancement New feature or request

Comments

@BadisG
Copy link
Contributor

BadisG commented Apr 2, 2023

Hello!

I have a model that has all the GPTQ implementations and it's called "gpt-x-alpaca-13b-native-true_sequential-act_order-128g-TRITON"
This model was made using Triton and it can be run on the webui with the current commit from GPTQ-for-LLaMa.

  1. Do a regular clone of the GPTQ-for-LLaMa repository (triton is the main branch)
conda activate textgen
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt

That will install the Triton package on your textgen environnement

  1. Go to your anaconda/miniconda/mamba installation directory and find the autotuner.py script
miniconda3\envs\textgen\lib\python3.10\site-packages\triton\runtime\autotuner.py 
  1. Once you've opened autotuner.py, change the line 81 from this:
self.cache[key] = builtins.min(timings, key=timings.get)

to this:

self.cache[key] = min(timings, key=lambda x: (timings[x] if isinstance(timings[x], float) else float('inf')))

And there you have it! You can now run a model that has all the GPTQ implementations on the webui!

Here's an example of a successful run:

conda activate textgen 
python server.py --wbits 4 --auto-devices --pre_layer 30 --disk --groupsize 128
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/adduser/miniconda3/envs/triton/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/adduser/miniconda3/envs/triton/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
The following models are available:

1. alpaca-native-7b
2. gpt-x-alpaca-13b-native-true_sequential-128g-CUDA
3. gpt-x-alpaca-13b-native-true_sequential-act_order-128g-TRITON
4. gpt-x-alpaca-13b-native-true_sequential-act_order-CUDA
5. llama-13b-128g
6. llama-7b-128g
7. llamacpp-AlpacaXgpt4-q4_1

Which one do you want to load? 1-7

3

Loading gpt-x-alpaca-13b-native-true_sequential-act_order-128g-TRITON...
Loading model ...
Done.
Loaded the model in 46.58 seconds.
/home/adduser/miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Output generated in 79.70 seconds (1.07 tokens/s, 85 tokens, context 59)

With Output generated in 79.70 seconds (1.07 tokens/s, 85 tokens, context 59) giving something like this:

image

PS: If you're not using the --pre_layer flag, you'll get this error:

Loading gpt-x-alpaca-13b-native-true_sequential-act_order-128g-TRITON...
Traceback (most recent call last):
  File "/mnt/d/Large Language Models/text-generation-webui/server.py", line 276, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/mnt/d/Large Language Models/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/mnt/d/Large Language Models/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "/mnt/d/Large Language Models/text-generation-webui/modules/GPTQ_loader.py", line 36, in _load_quant
    make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)
TypeError: make_quant() got an unexpected keyword argument 'faster'

To fix this, open the GPTQ_loader.py file here text-generation-webui\modules\GPTQ_loader.py and replace the line 36:

make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)

with this:

make_quant(model, layers, wbits, groupsize)
@BadisG BadisG added the enhancement New feature or request label Apr 2, 2023
@tensiondriven
Copy link
Contributor

@BadisG This is fantastic!

If you happen to have the CUDA version too, can you possible tell us how fast your inference times are between CUDA vs Triton?

I had to look up --pre_layer, looks like this allows you to send n layers to CPU instead of GPU. I assume you're using --pre_layer with no number, or with --pre_layer 0. I can see how that might be needed, hopefully when we integrate this, we can work around that by providing it if not provided by the user.

@BadisG
Copy link
Contributor Author

BadisG commented Apr 4, 2023

@tensiondriven @oobabooga @qwopqwop200 Here's my CUDA vs TRITON comparaison.

1) Context

  • I'm using a RTX 3060 (12go VRAM)
  • I'm on Windows 10 but I'm using WSL2 to run the inferences
  • The TRITON model is this one -> gpt-x-alpaca-13b-native-true_sequential-act_order-128g-TRITON
  • The CUDA model is this one -> gpt-x-alpaca-13b-native-true_sequential-128g-CUDA-4bit

The TRITON model has act_order and not CUDA not because this functionally only works on the webui for the TRITON model at the moment.

2) Size of the models

CUDA -> 7 921 243 kb
TRITON -> 7 726 283 kb
Their size will play an important role to the analysis as we'll measure the RAM load size and the VRAM load size and look at the increase relative to the model size.

3) RAM load size (How much RAM you need to load the model)

CUDA -> 16 982 100 kb -> 2.14 times bigger than the model size
TRITON -> 16 214 000 kb -> 2.10 times bigger than the model size

Conclusion : They have a similar behavior on the RAM load size

4) VRAM load size (What is the minimum amount of VRAM needed to run the model)

CUDA -> 8 186 796 kb -> 1.03 times bigger than the model size
Size CUDA

TRITON -> 7 861 160 kb -> 1.01 times bigger than the model size
Size TRITON

Conclusion : They have a similar behavior on the VRAM load size

5) Inference speed

CUDA
Inference time CUDA

TRITON
Inference time TRITON

Note : The very first inference will alaways be slower than the other ones... especially for the TRITON model -> I'll consider them as outliners and won't count them for the comparaison.

7zRQQuZ

  • CUDA wins by a big margin
  • TRITON seems to get the same speed no matter how verbose the inference is.
  • For the CUDA model, the more verbose it is, the faster it will be.

6) Conclusion?

My comparaison doesn't really goes apples vs apples (as the CUDA model doesn't have the act_order implementation in it.)
But the results strongly suggests that the CUDA models are the superior versions, at least for the speed.
In my opinion, once we got both models (that have all the GPTQ implementations) running on the webui, we also should look at the perplexity to be sure the quality of those models are the same no matter what.

If that's the case, we should focus on CUDA as it gives us the output significantly faster than TRITON.

@MetaIX
Copy link
Contributor

MetaIX commented Apr 5, 2023

My comparaison doesn't really goes apples vs apples (as the CUDA model doesn't have the act_order implementation in it.)

Aren't --act-order + --groupsize compatible in the latest GPTQ CUDA branch? I managed to run llama 7B quantized with both arguments using the latest GPTQ and modifying GPTQ_loader.py

Also, @oobabooga I don't know if you're aware of this so I'll just ping you. (Sorry btw)

@oobabooga
Copy link
Owner

I have just tried the new CUDA branch and the performance seems to be significantly slower for the models that I currently have. I am not sure if I am doing something wrong. I have tested these models:

  • MetaIX/Alpaca-30B-Int4-128G-Safetensors
  • mayaeary/pygmalion-6b-4bit-128g by @mayaeary

For the second one, these were the results:

Branch Performance
Upstream cuda branch Output generated in 17.35 seconds (11.47 tokens/s, 199 tokens, context 15)
https://github.com/oobabooga/GPTQ-for-LLaMa Output generated in 3.16 seconds (29.08 tokens/s, 92 tokens, context 15)

This branch contains the necessary changes to run the upstream cuda branch: https://github.com/oobabooga/text-generation-webui/tree/new-qwop

I'll also tag @qwopqwop200

@qwopqwop200
Copy link

I have just tried the new CUDA branch and the performance seems to be significantly slower for the models that I currently have. I am not sure if I am doing something wrong. I have tested these models:

  • MetaIX/Alpaca-30B-Int4-128G-Safetensors
  • mayaeary/pygmalion-6b-4bit-128g by @mayaeary

For the second one, these were the results:

Branch Performance
Upstream cuda branch Output generated in 17.35 seconds (11.47 tokens/s, 199 tokens, context 15)
https://github.com/oobabooga/GPTQ-for-LLaMa Output generated in 3.16 seconds (29.08 tokens/s, 92 tokens, context 15)
This branch contains the necessary changes to run the upstream cuda branch: https://github.com/oobabooga/text-generation-webui/tree/new-qwop

I'll also tag @qwopqwop200

Currently cuda is changed to implement act_order and groupsize at the same time. These changes make cuda inefficient. Therefore, triton is currently recommended, and it is normal for triton to be approximately twice as fast.

@BadisG
Copy link
Contributor Author

BadisG commented Apr 5, 2023

@oobabooga how do you make the cuda model that has all the implementations to work on the webui? I have errors when I try to load a cuda model that has "act_order" in it

@qwopqwop200 To be honest, if you believe the Triton branch is the superior version, I don't understand why you maintain the CUDA branch. Users will inevitably convert high-quality models using the CUDA branch, and we'll be forced to rely on its performance.

By discontinuing the CUDA branch, you would encourage future users to adopt the Triton version exclusively, ensuring that everyone benefits from the enhanced version.

Even if you choose not to remove the CUDA branch, I suggest updating your README to emphasize the advantages of converting models using the Triton branch, including a performance comparison. This would incentivize users to favor the Triton model for their conversions.

@EyeDeck
Copy link
Contributor

EyeDeck commented Apr 5, 2023

(Aimed at the #785 side of the conversation, not CUDA vs triton)


While performance isn't that much lower at small context sizes on the new CUDA branch, it scales extremely poorly with large context size.

It's not exactly a 1:1 comparison, but on the same 3090, same prompt, text-generation-webui settings, etc, with with two setups:

  • ooba's GPTQ-for-LLaMA fork; USBhost's LLAMA 30B --wbits 4 --act-order --true-sequential
  • Output generated in 35.77 seconds (5.56 tokens/s, 199 tokens, context 1848)

  • upstream CUDA branch; LLaMA 30B --wbits 4 --act-order --true-sequential --groupsize 1024
  • Output generated in 2810.99 seconds (0.07 tokens/s, 199 tokens, context 1848)

~1848 context size is what --cai-chat mode grows to with default settings after not all that long, so it's by no means an unusual use case. Users are not going to tolerate first having to redownload newly requantized models, then finding out that performance has dropped by a factor of 80, in exchange for a marginal improvement to output quality. It would lead to a ton of complaints, probably with instructions to roll everything back to some specific commit where performance is acceptable.

I have no idea what the cause is, someone with more pytorch knowledge would have to poke around at it. I don't think it's just my setup, I've fiddled with it + some GPTQ-for-LLaMA code a lot and never managed to make any improvement myself.

@BadisG
Copy link
Contributor Author

BadisG commented Apr 5, 2023

@EyeDeck it looks like it got updated to make triton faster, you should make a git pull on the gptq for llama repository and try it again, maybe that'll fix your problem.

Edit, I tried the new commit, and it makes triton run really fast, regardless of the number of tokens (just the very first inference is very slow, I don't know if that can be fixed)

image

@EyeDeck
Copy link
Contributor

EyeDeck commented Apr 5, 2023

I have little doubt that triton is an improvement over the current CUDA branch, but I'm commenting on the proposal to support the latest GPTQ-for-LLaMA CUDA branch, which on my machine is up to almost 2 orders of magnitude slower than the older code that's supported here now. Unless someone works out some optimizations for the new CUDA branch, I don't think it's viable to update.

Switching over to triton as recommended by qwopqwop200 would be great, but it means entirely dropping native Windows compatibility until someone makes the necessary tweaks to get triton compiling on Windows (edit: and cards older than GeForce RTX 20-series). Which, who knows, might not be that hard to do; I've read a few comments mentioning custom Windows builds, but at best they list some 6-month+ out of date instructions for the necessary tweaks required to get it compiling, never with a proper git repo or pull request to look at.

Also, running tests with a tiny context size is about as interesting to me as how many FPS a GPU can render with the monitor turned off.

@BadisG
Copy link
Contributor Author

BadisG commented Apr 5, 2023

@EyeDeck I also have windows but I can make run triton through WSL2, you should do that aswell it's not that hard to implement
But I agree with you, would be good to make run triton on windows "native"

@lolxdmainkaisemaanlu
Copy link

Making GPTQ kernel triton exclusive and abandoning CUDA would be bad for people with cards with compute capability 7.0 and below, since anything below is unsupported... Unless I'm misunderstanding something?

@YellowRoseCx
Copy link

YellowRoseCx commented Apr 5, 2023

I have just tried the new CUDA branch and the performance seems to be significantly slower for the models that I currently have. I am not sure if I am doing something wrong. I have tested these models:

  • MetaIX/Alpaca-30B-Int4-128G-Safetensors
  • mayaeary/pygmalion-6b-4bit-128g by @mayaeary

For the second one, these were the results:
Branch Performance
Upstream cuda branch Output generated in 17.35 seconds (11.47 tokens/s, 199 tokens, context 15)
https://github.com/oobabooga/GPTQ-for-LLaMa Output generated in 3.16 seconds (29.08 tokens/s, 92 tokens, context 15)
This branch contains the necessary changes to run the upstream cuda branch: https://github.com/oobabooga/text-generation-webui/tree/new-qwop
I'll also tag @qwopqwop200

Currently cuda is changed to implement act_order and groupsize at the same time. These changes make cuda inefficient. Therefore, triton is currently recommended, and it is normal for triton to be approximately twice as fast.

Triton is only 1/10th the speed or slower than CUDA for AMD GPUs

I went from 18 tokens/second to 1.5 tokens/s on a 13b model at 4 bits on a 6800xt

Getting rid of the cuda branch is bad for AMD users

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 5, 2023

By discontinuing the CUDA branch, you would encourage future users to adopt the Triton version exclusively, ensuring that everyone benefits from the enhanced version.

And if we can't use it, due to old GPU or windows? I suppose we should just forget about 4-bit then?

I say make branches for older cuda, newer cuda and triton. Textgen can be made compatible with either. Heck.. I don't even understand why we abandoned the old model format. All it takes is running an older version of GPTQ.

These things are not all mutually exclusive where the code MUST change for the sake of novelty and everyone else is on their own.

https://github.com/johnsmith0031/alpaca_lora_4bit can also be used for inference, btw. Not just for loading and training loras. I have not touched the v2 version yet but I'm guessing it will be the same. And it supported parallelism and offload too. It's model loading can be made generic like gptq_loader.py was.

I am still using it with v1 models (llama/opt/gpt-j) and it's fast with no problems at all. Even a tad faster than the new(er) cuda implementation with less delay on the initial generation. In fact my only benefit for gptq-v2 is act order and true sequential raising the "smartness" score ever so slightly.

@da3dsoul
Copy link
Contributor

da3dsoul commented Apr 5, 2023

Ugh Python people and their willingness to make breaking changes. Here's an idea: don't lock things to outdated branches. Make the setup script/instructions just change the relevant commands for installing the right GPTQ-for-LLaMA version. You can detect and/or store the version and read it back later to run the old code vs the new code in the inference.
This is stuff that Python devs have been doing for years to work around the Python 2 vs 3 API changes.
Ideally the models would have version info embedded so that they could run on any version, or at least recommend the versions that you need to install.

@ye7iaserag
Copy link
Contributor

@da3dsoul this is mainly why I hate pip, but maybe I'm too hard on pip but rather should be about the practices of the developers.
There should be some sort of locking mechanism as every other respectful package manager

@da3dsoul
Copy link
Contributor

da3dsoul commented Apr 5, 2023

@da3dsoul this is mainly why I hate pip, but maybe I'm too hard on pip but rather should be about the practices of the developers. There should be some sort of locking mechanism as every other respectful package manager

It's a problem that plagues the entire Python ecosystem. Supposedly, Python was about democratizing programming by being easy, but all that did was invite inexperience and poor planning. A programmer is 10% someone who can write code and 90% someone who can plan and problem-solve. If you can't do those, then you have no business writing code for anyone but yourself.
I'm from C# land. I'm not going preach that I'm better than other people or something because of it, but I will absolutely preach that 1 hour of planning saves 30 hours of technical debt and support, and that support should last for as long as people can still get the version.
I'm trying not to be mean, but the mindset of Python's founders and supporters are almost universally wrong with his they handle these things. It hurts no one to keep old APIs functional. I don't even care if it requires a non-recommended setting, triggers warnings, etc.

What I would like to do is have a system that can more or less check first then run or try and fail to ensure that the first time setup experience is as easy as possible. I am still having issues just downloading various models and getting them to run, because oh that's the wrong version or whatever. Ok, then tell me. There should be a way to tell. I don't know if there's a way to run multiple versions of GPTQ and GPTQ-for-LLaMA side by side and swap them out, but that would be a game changer.

@oobabooga
Copy link
Owner

No drama, please.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 6, 2023

I don't know if there's a way to run multiple versions of GPTQ and GPTQ-for-LLaMA side by side and swap them out, but that would be a game changer.

Yes, easily. And you can still load the old models with a few simple changes and the correct GPTQ. You can even run several textgen-webui's side by side sharing the same environment. You may have to do a python setup_cuda.py install when switching between kernels but it takes all of 5 seconds after you compile it the first time.

edit: From the updated alpaca_lora_4bit repo.. it appears possible to load both v1 and v2 models and perform inference.. haven't tried how well or fast it is and if there are caveats. This would mean one fork of GPTQ. Additionally it supports parallelism, triton, and offload. With genercized loading functions it will do opt, llama, pygmalion, etc and probably train loras for them too. This would eliminate even having to recompile the kernel.

edit2: new cuda implementation is 1/3 the speed for me.

edit3: and just like that V1 and V2 both work in the same repo: https://github.com/Ph0rk0z/text-generation-webui-testing/tree/DualModel/ have to compare v1/v2/new/old cuda and if I ever get triton fixed that too.

@agrocylo
Copy link

agrocylo commented Apr 9, 2023

Triton is only 1/10th the speed or slower than CUDA for AMD GPUs

I went from 18 tokens/second to 1.5 tokens/s on a 13b model at 4 bits on a 6800xt

Getting rid of the cuda branch is bad for AMD users

Can confirm that it's way slower on AMD. It's weird because triton fork is directly maintained by AMD and seem mature, to the point that it's being upstreamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests