Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

Merged
merged 5 commits into from
Jan 7, 2024
Merged

Conversation

casper-hansen
Copy link
Owner

@casper-hansen casper-hansen commented Jan 1, 2024

AWQ has only ever been able to run 4-bit quantization. However, with this integration, we can run any-bit quantization and export to llama.cpp for inference. This results in lower perplexity while ensuring compatibility with the GGUF ecosystem.

The difference between GGUF and AWQ is most pronounced on the q_0 and q_1 models but I include most perplexity numbers for the K method from llama.cpp since it reaches the lowest perplexity.

Perplexity

Perplexity measured with: ./perplexity -m <gguf_model> -f wikitext-2-raw/wiki.test.raw -ngl 33

Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)

FP16: 5.6934

Method Perplexity (duo_scaling=True) Perplexity (duo_scaling=False)
GGUF Q2_K 6.1640 +/- 0.03474
GGUF Q3_K_M 5.8881 +/- 0.03320
GGUF Q4_0 5.8189 +/- 0.03257
GGUF Q4_K_M 5.7518 +/- 0.03231
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008
AWQ 3-bit + GGUF Q2_K 6.1079 +/- 0.03482 6.1147 +/- 0.03474
AWQ 3-bit + GGUF Q3_K_M 5.9123 +/- 0.03376 5.9072 +/- 0.03362
AWQ 4-bit + GGUF Q3_K_M 5.8528 +/- 0.03304 5.8570 +/- 0.03306
AWQ 4-bit + GGUF Q4_0 5.8127 +/- 0.03289 5.8018 +/- 0.03272
AWQ 4-bit + GGUF Q4_K_M 5.7415 +/- 0.03237 5.7396 +/- 0.03230
AWQ 6-bit + GGUF Q4_0 5.8064 +/- 0.03261 5.8030 +/- 0.03259
AWQ 6-bit + GGUF Q4_K_M 5.7442 +/- 0.03226 5.7425 +/- 0.03226

Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)

Method Perplexity (duo_scaling=True)
GGUF Q2_K 7.8406 +/- 0.04688
GGUF Q3_K_M 4.4192 +/- 0.02298
GGUF Q4_0 4.2242 +/- 0.02167
GGUF Q4_K_M 4.2499 +/- 0.02188
AWQ 2-bit + GGUF Q2_K
AWQ 3-bit + GGUF Q2_K
AWQ 3-bit + GGUF Q3_K_M
AWQ 4-bit + GGUF Q3_K_M 4.4301 +/- 0.02294
AWQ 4-bit + GGUF Q4_0 4.2696 +/- 0.02182
AWQ 4-bit + GGUF Q4_K_M 4.2239 +/- 0.02158
AWQ 6-bit + GGUF Q4_0
AWQ 6-bit + GGUF Q4_K_M

Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)

Method Perplexity
GGUF Q2_K 8.5820 +/- 0.05855
GGUF Q3_K_M 7.8605 +/- 0.05264
GGUF Q4_0 7.8797 +/- 0.05373
GGUF Q4_K_M 7.7172 +/- 0.05209
AWQ 3-bit + GGUF Q2_K 8.6528 +/- 0.05887
AWQ 3-bit + GGUF Q3_K_M 7.9620 +/- 0.05289
AWQ 4-bit + GGUF Q3_K_M 7.8312 +/- 0.05226
AWQ 4-bit + GGUF Q4_0 7.8115 +/- 0.05293
AWQ 4-bit + GGUF Q4_K_M 7.7438 +/- 0.05220
AWQ 6-bit + GGUF Q4_K_M 7.7372 +/- 0.05234

@casper-hansen casper-hansen changed the title GGUF compatible quantization (2, 3, 4 bit) GGUF compatible quantization (2, 3, 4 bit / any bit) Jan 1, 2024
@vince62s
Copy link

vince62s commented Jan 2, 2024

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

@casper-hansen
Copy link
Owner Author

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

@vince62s
Copy link

vince62s commented Jan 2, 2024

ok I get it, so the only impact is here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L48
which means you could use the exact bit quant of gguf (eg: 3.25 or whatever) for the awq scales/clip computation. Probably wouldn't make a difference but I get it. thanks.

@sorasoras
Copy link

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

@casper-hansen
Copy link
Owner Author

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.

@sorasoras
Copy link

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474
Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size.
It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

@casper-hansen
Copy link
Owner Author

casper-hansen commented Jan 2, 2024

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

@JianbangZ
Copy link

JianbangZ commented Jan 3, 2024

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.

@casper-hansen
Copy link
Owner Author

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

Isn't the benefit of AWQ limited in this case?

No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.

  1. Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.
  2. Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.

The practical step of quantizing to a specific format is handled by llama.cpp while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.

https://arxiv.org/pdf/2306.00978.pdf

@casper-hansen
Copy link
Owner Author

The main benefit is a higher quality model as can be observed from most cases from the perplexity numbers, except for Mixtral which I am working on better quantization for.

@JianbangZ
Copy link

If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit, the gguf quantization target should be also 3-bit.

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

Isn't the benefit of AWQ limited in this case?

No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.

  1. Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.
  2. Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.

The practical step of quantizing to a specific format is handled by llama.cpp while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.

https://arxiv.org/pdf/2306.00978.pdf

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?

@casper-hansen
Copy link
Owner Author

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?

The reason is that Q3_K_M is a mixed-bit quantization that GGUF applies. That means the Q3_K_M format is not just INT3, but it also has INT4 weights. We observe that INT4 is more effective for scaling in this case, likely because scaling with INT3 makes the quantization error much larger when you apply INT3 scales for INT4 weights. That is likely why we see that AWQ 4-bit works better for the Q3_K_M format.

An optimization in the future in AutoAWQ could include the ability to do mixed-bit scaling. This could likely even improve AWQ quantization if applied thoughtfully, i.e. maybe some losses are higher than others and you could adjust the w_bit and retry to find a better scale.

@sorasoras
Copy link

some result from a Qwen14B model.
Q3MK PPL =9.6685 +/- 0.06744
Q4MK PPL =9.5139 +- 0.06592
Q5MK PPL =9.4058 +/- 0.06490
Q2K PPL =10.8593 +/- 0.07482
Q8_0 PPL = 9.4008 +/- 0.06471
awq4+q4km PPL = 9.4109 +/- 0.06500
awq6bit+q4km PPL = 9.5216 +/- 0.06568
awq6bit+q5km PPL = 9.4202 +/- 0.06487
awq4bit+q3km PPL = 9.6123 +/- 0.06660
awq4bit+q2k PPL = 9.8321 +/- 0.06761
awq3+q2k PPL = 9.9867 +/- 0.06874

Looking forward to furfure mixed-bit scaling to further improvement.
a q4mk qwen14b model
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 20 tensors
llama_model_loader: - type q8_0: 20 tensors
llama_model_loader: - type q4_K: 121 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 1 tensors

@casper-hansen
Copy link
Owner Author

@sorasoras Thanks for these numbers! These look particularly good to me. Great improvements, especially Q2 is a large improvement. I just outlined the combinations below, they look good!

Q2:

  • Q2K PPL =10.8593 +/- 0.07482
  • awq4bit+q2k PPL = 9.8321 +/- 0.06761

Q3:

  • Q3MK PPL =9.6685 +/- 0.06744
  • awq4bit+q3km PPL = 9.6123 +/- 0.06660

Q4:

  • Q4MK PPL =9.5139 +- 0.06592
  • awq4+q4km PPL = 9.4109 +/- 0.06500

@sorasoras
Copy link

sidenote
Have you look into SOTA 2-bit quants

ggerganov/llama.cpp#4773 (comment)
This looks super interesting.
https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Perhaps, AWQ could do optimization for this new quants? I am not so sure through.

@casper-hansen
Copy link
Owner Author

sidenote Have you look into SOTA 2-bit quants

ggerganov/llama.cpp#4773 (comment) This looks super interesting. https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Perhaps, AWQ could do optimization for this new quants? I am not so sure through.

I checked their reference code for their new SOTA KNN/QuIP method. Many elements are similar to AWQ, but there are many unique aspects of this new method that are directly taken from QuIP#. You could certainly try to implement the unique aspects of QuIP# into AutoAWQ like the importance matrix and modification for the E8 lattice search.

However, I don't think it is feasible for me to do these things alone as AutoAWQ is already a large task to maintain mostly by myself. llama-cpp has a large community of open-source developers, so it is better suited to be implemented over there since they also have a whole framework with specialized formats, cuda kernels, and more that are constantly updated.

@ikawrakow
Copy link

https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S mistral-7b-q3ks.gguf 6.0692 6.62% 6.0021 5.44%
Q3_K_M mistral-7b-q3km.gguf 5.8894 3.46% 5.8489 2.75%
Q4_K_S mistral-7b-q4ks.gguf 5.7764 1.48% 5.7349 0.75%
Q4_K_M mistral-7b-q4km.gguf 5.7539 1.08% 5.7259 0.59%
Q5_K_S mistral-7b-q5ks.gguf 5.7258 0.59% 5.7100 0.31%
Q4_0 mistral-7b-q40.gguf 5.8189 2.23% 5.7924 1.76%
Q4_1 mistral-7b-q41.gguf 5.8244 2.32% 5.7455 0.94%
Q5_0 mistral-7b-q50.gguf 5.7180 0.45% 5.7070 0.26%
Q5_1 mistral-7b-q51.gguf 5.7128 0.36% 5.7057 0.24%

@casper-hansen
Copy link
Owner Author

@ikawrakow These are certainly large improvements. I will need to implement the llama.cpp perplexity computation like I did for AutoGPTQ to see whether this beats what I get in AWQ if I pack and run native inference with 4-bit.

Do you know why your numbers from main branch are different than my results? e.g. for Q4_K_M yours is 5.7539 and mine is 5.7518.

@ikawrakow
Copy link

Isn't a difference of 5.7539 vs 5.7518 negligible? I run these calculations using CUDA on an RTX-4080, and I do get difference in that order when I occasionally use my M2 laptop with the Metal backend.

@JianbangZ
Copy link

5.7100

Yes I remember we talked about this issue half year ago regarding the PPL calculation methods. It should bd aligned
ggerganov/llama.cpp#1877

@casper-hansen
Copy link
Owner Author

5.7100

Yes I remember we talked about this issue half year ago regarding the PPL calculation methods. It should bd aligned ggerganov/llama.cpp#1877

Yeah, llama.cpp should probably update their computations to match the original perplexity, but for now, I implemented it in AutoGPTQ and can just import that code to AutoAWQ. Either way, another good measure is quantization error in % like what was provided above. The relative % is easier to read and understand anyway.

https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/utils/perplexity_utils.py

@JianbangZ
Copy link

Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model

@casper-hansen
Copy link
Owner Author

Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model

There is no single golden measurement we can use. Even perplexity has its faults, but it's good enough as a proxy of quantization error. I always look at MMLU values for instruct/chat but that takes a long time to evaluate. You can also evaluate perplexity but you would need special handling and a special dataset for this.

@JianbangZ
Copy link

Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model

There is no single golden measurement we can use. Even perplexity has its faults, but it's good enough as a proxy of quantization error. I always look at MMLU values for instruct/chat but that takes a long time to evaluate. You can also evaluate perplexity but you would need special handling and a special dataset for this.

I think using MT-bench is better. MT-bench should be able to load awq models with some mild changes, but not sure how to load gguf models with their framework
https://github.com/lm-sys/FastChat/blob/722ab0299fd10221fa4686267fe068a688bacd4c/fastchat/model/model_adapter.py#L1644

@casper-hansen
Copy link
Owner Author

I think using MT-bench is better. MT-bench should be able to load awq models with some mild changes, but not sure how to load gguf models with their framework https://github.com/lm-sys/FastChat/blob/722ab0299fd10221fa4686267fe068a688bacd4c/fastchat/model/model_adapter.py#L1644

MT-bench uses GPT-4 to judge the model. I'm not a particular fan of this for many reasons, it can be very misleading.

@JianbangZ
Copy link

https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S mistral-7b-q3ks.gguf 6.0692 6.62% 6.0021 5.44%
Q3_K_M mistral-7b-q3km.gguf 5.8894 3.46% 5.8489 2.75%
Q4_K_S mistral-7b-q4ks.gguf 5.7764 1.48% 5.7349 0.75%
Q4_K_M mistral-7b-q4km.gguf 5.7539 1.08% 5.7259 0.59%
Q5_K_S mistral-7b-q5ks.gguf 5.7258 0.59% 5.7100 0.31%
Q4_0 mistral-7b-q40.gguf 5.8189 2.23% 5.7924 1.76%
Q4_1 mistral-7b-q41.gguf 5.8244 2.32% 5.7455 0.94%
Q5_0 mistral-7b-q50.gguf 5.7180 0.45% 5.7070 0.26%
Q5_1 mistral-7b-q51.gguf 5.7128 0.36% 5.7057 0.24%

Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggerganov/llama.cpp#4773? or it is just built on top of master branch

@ikawrakow
Copy link

Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggerganov/llama.cpp#4773? or it is just built on top of master branch

My repo, where I play with various quantization approaches (but also semi-regularly update with mainline llama.cpp), is a giant pile of spaghetti, so I wouldn't make this public in its current state (there are 23 quantization types in addition to what is available in mainline llama.cpp, plus a lot of exploration spaghetti). I'm contemplating whether to clean it up and make it public, or to pick the best pieces of it and contribute to llama.cpp. PR ggerganov/llama.cpp#4773 is kind of a test how adding stuff to llama.cpp will go. Note that the PR does not contain the quantization code, it just adds the kernels necessary for inference (but I have provided a copy of the quantization function for reference).

@sorasoras
Copy link

Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggerganov/llama.cpp#4773? or it is just built on top of master branch

My repo, where I play with various quantization approaches (but also semi-regularly update with mainline llama.cpp), is a giant pile of spaghetti, so I wouldn't make this public in its current state (there are 23 quantization types in addition to what is available in mainline llama.cpp, plus a lot of exploration spaghetti). I'm contemplating whether to clean it up and make it public, or to pick the best pieces of it and contribute to llama.cpp. PR ggerganov/llama.cpp#4773 is kind of a test how adding stuff to llama.cpp will go. Note that the PR does not contain the quantization code, it just adds the kernels necessary for inference (but I have provided a copy of the quantization function for reference).

It would be nice if i could play around the new SOTA2bit for other models.

@casper-hansen casper-hansen merged commit a3db809 into main Jan 7, 2024
@DD-DuDa
Copy link

DD-DuDa commented Jan 12, 2024

Is the "q_group_size" in AutoAWQ consistent with the super-blocks of k-quant in llama.cpp?

If so, should "q_group_size" be set to 16 when using Q2_K where each block in Q2_K has 16 weights?

@casper-hansen casper-hansen deleted the gguf branch January 21, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants