GGUF compatible quantization (2, 3, 4 bit / any bit) #285

casper-hansen · 2024-01-01T14:34:53Z

AWQ has only ever been able to run 4-bit quantization. However, with this integration, we can run any-bit quantization and export to llama.cpp for inference. This results in lower perplexity while ensuring compatibility with the GGUF ecosystem.

The difference between GGUF and AWQ is most pronounced on the q_0 and q_1 models but I include most perplexity numbers for the K method from llama.cpp since it reaches the lowest perplexity.

Perplexity

Perplexity measured with: ./perplexity -m <gguf_model> -f wikitext-2-raw/wiki.test.raw -ngl 33

Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)

FP16: 5.6934

Method	Perplexity (duo_scaling=True)	Perplexity (duo_scaling=False)
GGUF Q2_K	6.1640 +/- 0.03474
GGUF Q3_K_M	5.8881 +/- 0.03320
GGUF Q4_0	5.8189 +/- 0.03257
GGUF Q4_K_M	5.7518 +/- 0.03231
AWQ 2-bit + GGUF Q2_K	6.7290 +/- 0.04008
AWQ 3-bit + GGUF Q2_K	6.1079 +/- 0.03482	6.1147 +/- 0.03474
AWQ 3-bit + GGUF Q3_K_M	5.9123 +/- 0.03376	5.9072 +/- 0.03362
AWQ 4-bit + GGUF Q3_K_M	5.8528 +/- 0.03304	5.8570 +/- 0.03306
AWQ 4-bit + GGUF Q4_0	5.8127 +/- 0.03289	5.8018 +/- 0.03272
AWQ 4-bit + GGUF Q4_K_M	5.7415 +/- 0.03237	5.7396 +/- 0.03230
AWQ 6-bit + GGUF Q4_0	5.8064 +/- 0.03261	5.8030 +/- 0.03259
AWQ 6-bit + GGUF Q4_K_M	5.7442 +/- 0.03226	5.7425 +/- 0.03226

Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)

Method	Perplexity (duo_scaling=True)
GGUF Q2_K	7.8406 +/- 0.04688
GGUF Q3_K_M	4.4192 +/- 0.02298
GGUF Q4_0	4.2242 +/- 0.02167
GGUF Q4_K_M	4.2499 +/- 0.02188
AWQ 2-bit + GGUF Q2_K
AWQ 3-bit + GGUF Q2_K
AWQ 3-bit + GGUF Q3_K_M
AWQ 4-bit + GGUF Q3_K_M	4.4301 +/- 0.02294
AWQ 4-bit + GGUF Q4_0	4.2696 +/- 0.02182
AWQ 4-bit + GGUF Q4_K_M	4.2239 +/- 0.02158
AWQ 6-bit + GGUF Q4_0
AWQ 6-bit + GGUF Q4_K_M

Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)

Method	Perplexity
GGUF Q2_K	8.5820 +/- 0.05855
GGUF Q3_K_M	7.8605 +/- 0.05264
GGUF Q4_0	7.8797 +/- 0.05373
GGUF Q4_K_M	7.7172 +/- 0.05209
AWQ 3-bit + GGUF Q2_K	8.6528 +/- 0.05887
AWQ 3-bit + GGUF Q3_K_M	7.9620 +/- 0.05289
AWQ 4-bit + GGUF Q3_K_M	7.8312 +/- 0.05226
AWQ 4-bit + GGUF Q4_0	7.8115 +/- 0.05293
AWQ 4-bit + GGUF Q4_K_M	7.7438 +/- 0.05220
AWQ 6-bit + GGUF Q4_K_M	7.7372 +/- 0.05234

vince62s · 2024-01-02T10:36:41Z

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

casper-hansen · 2024-01-02T11:29:41Z

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

vince62s · 2024-01-02T12:05:04Z

ok I get it, so the only impact is here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L48
which means you could use the exact bit quant of gguf (eg: 3.25 or whatever) for the awq scales/clip computation. Probably wouldn't make a difference but I get it. thanks.

sorasoras · 2024-01-02T19:09:01Z

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

casper-hansen · 2024-01-02T19:22:55Z

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.

sorasoras · 2024-01-02T19:41:46Z

must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ?

It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format

so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly?

This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization.

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474
Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size.
It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

casper-hansen · 2024-01-02T20:54:57Z

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

JianbangZ · 2024-01-03T15:21:13Z

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.

casper-hansen · 2024-01-03T15:49:26Z

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

Isn't the benefit of AWQ limited in this case?

No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.

Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.
Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.

The practical step of quantizing to a specific format is handled by llama.cpp while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.

https://arxiv.org/pdf/2306.00978.pdf

casper-hansen · 2024-01-03T15:59:41Z

The main benefit is a higher quality model as can be observed from most cases from the perplexity numbers, except for Mixtral which I am working on better quantization for.

JianbangZ · 2024-01-03T16:00:07Z

If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate.

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit, the gguf quantization target should be also 3-bit.

AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 Does it mean you "quantize twice"? meaning applied GGUF quantization onto AWQ weight to further reduce size. It would be interesting to show file size of each method so it could give us the idea about the combinations beside Perplexity.

We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF.

Isn't the benefit of AWQ limited in this case?

No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.

Scaling: We search for each weight's most optimal scaling factor. We do this by measuring a loss function that uses pseudo-quantization to measure the difference between FP16 and quantized outputs of every layer. We apply these scales (+some weight clipping) after finding the most optimal scaling factor.

Quantization: This part just converts the scaled weights from FP16 -> INT4. This is a practical step to make sure we can execute in a quantized format. Nothing fancy happens here other than packing the weights in a specific format that is compatible with the implemented CUDA kernel.

The practical step of quantizing to a specific format is handled by llama.cpp while we apply the AWQ scales beforehand. I would highly recommend reading the paper. The concepts seem quite different from other methods.

https://arxiv.org/pdf/2306.00978.pdf

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?

casper-hansen · 2024-01-03T16:08:14Z

Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken?

The reason is that Q3_K_M is a mixed-bit quantization that GGUF applies. That means the Q3_K_M format is not just INT3, but it also has INT4 weights. We observe that INT4 is more effective for scaling in this case, likely because scaling with INT3 makes the quantization error much larger when you apply INT3 scales for INT4 weights. That is likely why we see that AWQ 4-bit works better for the Q3_K_M format.

An optimization in the future in AutoAWQ could include the ability to do mixed-bit scaling. This could likely even improve AWQ quantization if applied thoughtfully, i.e. maybe some losses are higher than others and you could adjust the w_bit and retry to find a better scale.

sorasoras · 2024-01-04T09:35:17Z

some result from a Qwen14B model.
Q3MK PPL =9.6685 +/- 0.06744
Q4MK PPL =9.5139 +- 0.06592
Q5MK PPL =9.4058 +/- 0.06490
Q2K PPL =10.8593 +/- 0.07482
Q8_0 PPL = 9.4008 +/- 0.06471
awq4+q4km PPL = 9.4109 +/- 0.06500
awq6bit+q4km PPL = 9.5216 +/- 0.06568
awq6bit+q5km PPL = 9.4202 +/- 0.06487
awq4bit+q3km PPL = 9.6123 +/- 0.06660
awq4bit+q2k PPL = 9.8321 +/- 0.06761
awq3+q2k PPL = 9.9867 +/- 0.06874

Looking forward to furfure mixed-bit scaling to further improvement.
a q4mk qwen14b model
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 20 tensors
llama_model_loader: - type q8_0: 20 tensors
llama_model_loader: - type q4_K: 121 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 1 tensors

casper-hansen · 2024-01-04T12:22:50Z

@sorasoras Thanks for these numbers! These look particularly good to me. Great improvements, especially Q2 is a large improvement. I just outlined the combinations below, they look good!

Q2:

Q2K PPL =10.8593 +/- 0.07482
awq4bit+q2k PPL = 9.8321 +/- 0.06761

Q3:

Q3MK PPL =9.6685 +/- 0.06744
awq4bit+q3km PPL = 9.6123 +/- 0.06660

Q4:

Q4MK PPL =9.5139 +- 0.06592
awq4+q4km PPL = 9.4109 +/- 0.06500

sorasoras · 2024-01-05T12:39:04Z

sidenote
Have you look into SOTA 2-bit quants

ggerganov/llama.cpp#4773 (comment)
This looks super interesting.
https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Perhaps, AWQ could do optimization for this new quants? I am not so sure through.

casper-hansen · 2024-01-05T13:08:45Z

sidenote Have you look into SOTA 2-bit quants

ggerganov/llama.cpp#4773 (comment) This looks super interesting. https://huggingface.co/ikawrakow/various-2bit-sota-gguf/tree/main

Perhaps, AWQ could do optimization for this new quants? I am not so sure through.

I checked their reference code for their new SOTA KNN/QuIP method. Many elements are similar to AWQ, but there are many unique aspects of this new method that are directly taken from QuIP#. You could certainly try to implement the unique aspects of QuIP# into AutoAWQ like the importance matrix and modification for the E8 lattice search.

However, I don't think it is feasible for me to do these things alone as AutoAWQ is already a large task to maintain mostly by myself. llama-cpp has a large community of open-source developers, so it is better suited to be implemented over there since they also have a whole framework with specialized formats, cuda kernels, and more that are constantly updated.

ikawrakow · 2024-01-05T13:53:01Z

https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q3_K_S	mistral-7b-q3ks.gguf	6.0692	6.62%	6.0021	5.44%
Q3_K_M	mistral-7b-q3km.gguf	5.8894	3.46%	5.8489	2.75%
Q4_K_S	mistral-7b-q4ks.gguf	5.7764	1.48%	5.7349	0.75%
Q4_K_M	mistral-7b-q4km.gguf	5.7539	1.08%	5.7259	0.59%
Q5_K_S	mistral-7b-q5ks.gguf	5.7258	0.59%	5.7100	0.31%
Q4_0	mistral-7b-q40.gguf	5.8189	2.23%	5.7924	1.76%
Q4_1	mistral-7b-q41.gguf	5.8244	2.32%	5.7455	0.94%
Q5_0	mistral-7b-q50.gguf	5.7180	0.45%	5.7070	0.26%
Q5_1	mistral-7b-q51.gguf	5.7128	0.36%	5.7057	0.24%

casper-hansen · 2024-01-05T14:16:57Z

@ikawrakow These are certainly large improvements. I will need to implement the llama.cpp perplexity computation like I did for AutoGPTQ to see whether this beats what I get in AWQ if I pack and run native inference with 4-bit.

Do you know why your numbers from main branch are different than my results? e.g. for Q4_K_M yours is 5.7539 and mine is 5.7518.

ikawrakow · 2024-01-05T14:30:54Z

Isn't a difference of 5.7539 vs 5.7518 negligible? I run these calculations using CUDA on an RTX-4080, and I do get difference in that order when I occasionally use my M2 laptop with the Metal backend.

JianbangZ · 2024-01-05T15:23:14Z

5.7100

Yes I remember we talked about this issue half year ago regarding the PPL calculation methods. It should bd aligned
ggerganov/llama.cpp#1877

casper-hansen · 2024-01-05T15:55:49Z

5.7100

Yes I remember we talked about this issue half year ago regarding the PPL calculation methods. It should bd aligned ggerganov/llama.cpp#1877

Yeah, llama.cpp should probably update their computations to match the original perplexity, but for now, I implemented it in AutoGPTQ and can just import that code to AutoAWQ. Either way, another good measure is quantization error in % like what was provided above. The relative % is easier to read and understand anyway.

https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/utils/perplexity_utils.py

JianbangZ · 2024-01-05T17:43:38Z

Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model

casper-hansen · 2024-01-05T17:52:51Z

Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model

There is no single golden measurement we can use. Even perplexity has its faults, but it's good enough as a proxy of quantization error. I always look at MMLU values for instruct/chat but that takes a long time to evaluate. You can also evaluate perplexity but you would need special handling and a special dataset for this.

JianbangZ · 2024-01-05T17:56:36Z

Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model

There is no single golden measurement we can use. Even perplexity has its faults, but it's good enough as a proxy of quantization error. I always look at MMLU values for instruct/chat but that takes a long time to evaluate. You can also evaluate perplexity but you would need special handling and a special dataset for this.

I think using MT-bench is better. MT-bench should be able to load awq models with some mild changes, but not sure how to load gguf models with their framework
https://github.com/lm-sys/FastChat/blob/722ab0299fd10221fa4686267fe068a688bacd4c/fastchat/model/model_adapter.py#L1644

casper-hansen · 2024-01-05T17:59:45Z

I think using MT-bench is better. MT-bench should be able to load awq models with some mild changes, but not sure how to load gguf models with their framework https://github.com/lm-sys/FastChat/blob/722ab0299fd10221fa4686267fe068a688bacd4c/fastchat/model/model_adapter.py#L1644

MT-bench uses GPT-4 to judge the model. I'm not a particular fan of this for many reasons, it can be very misleading.

JianbangZ · 2024-01-05T21:41:34Z

https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S mistral-7b-q3ks.gguf 6.0692 6.62% 6.0021 5.44%
Q3_K_M mistral-7b-q3km.gguf 5.8894 3.46% 5.8489 2.75%
Q4_K_S mistral-7b-q4ks.gguf 5.7764 1.48% 5.7349 0.75%
Q4_K_M mistral-7b-q4km.gguf 5.7539 1.08% 5.7259 0.59%
Q5_K_S mistral-7b-q5ks.gguf 5.7258 0.59% 5.7100 0.31%
Q4_0 mistral-7b-q40.gguf 5.8189 2.23% 5.7924 1.76%
Q4_1 mistral-7b-q41.gguf 5.8244 2.32% 5.7455 0.94%
Q5_0 mistral-7b-q50.gguf 5.7180 0.45% 5.7070 0.26%
Q5_1 mistral-7b-q51.gguf 5.7128 0.36% 5.7057 0.24%

Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggerganov/llama.cpp#4773? or it is just built on top of master branch

ikawrakow · 2024-01-06T08:30:13Z

Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggerganov/llama.cpp#4773? or it is just built on top of master branch

My repo, where I play with various quantization approaches (but also semi-regularly update with mainline llama.cpp), is a giant pile of spaghetti, so I wouldn't make this public in its current state (there are 23 quantization types in addition to what is available in mainline llama.cpp, plus a lot of exploration spaghetti). I'm contemplating whether to clean it up and make it public, or to pick the best pieces of it and contribute to llama.cpp. PR ggerganov/llama.cpp#4773 is kind of a test how adding stuff to llama.cpp will go. Note that the PR does not contain the quantization code, it just adds the kernels necessary for inference (but I have provided a copy of the quantization function for reference).

sorasoras · 2024-01-06T15:33:48Z

Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggerganov/llama.cpp#4773? or it is just built on top of master branch

My repo, where I play with various quantization approaches (but also semi-regularly update with mainline llama.cpp), is a giant pile of spaghetti, so I wouldn't make this public in its current state (there are 23 quantization types in addition to what is available in mainline llama.cpp, plus a lot of exploration spaghetti). I'm contemplating whether to clean it up and make it public, or to pick the best pieces of it and contribute to llama.cpp. PR ggerganov/llama.cpp#4773 is kind of a test how adding stuff to llama.cpp will go. Note that the PR does not contain the quantization code, it just adds the kernels necessary for inference (but I have provided a copy of the quantization function for reference).

It would be nice if i could play around the new SOTA2bit for other models.

DD-DuDa · 2024-01-12T03:47:34Z

Is the "q_group_size" in AutoAWQ consistent with the super-blocks of k-quant in llama.cpp?

If so, should "q_group_size" be set to 16 when using Q2_K where each block in Q2_K has 16 weights?

casper-hansen added 2 commits January 1, 2024 14:32

GGUF compatible quantization (2, 3, 4 bit)

a0cb9e5

Update example

b02263f

casper-hansen changed the title ~~GGUF compatible quantization (2, 3, 4 bit)~~ GGUF compatible quantization (2, 3, 4 bit / any bit) Jan 1, 2024

Change default model to Mistral

8bbf743

casper-hansen mentioned this pull request Jan 1, 2024

Support AutoAWQ in awq-py ggerganov/llama.cpp#4701

Closed

casper-hansen added 2 commits January 5, 2024 10:46

Merge branch 'main' into gguf

0b40094

pack() utility function. rename gguf_compatible -> export_compatible.

c7eae1b

casper-hansen merged commit a3db809 into main Jan 7, 2024

casper-hansen deleted the gguf branch January 21, 2024 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

casper-hansen commented Jan 1, 2024 •

edited

Loading

vince62s commented Jan 2, 2024

casper-hansen commented Jan 2, 2024

vince62s commented Jan 2, 2024

sorasoras commented Jan 2, 2024

casper-hansen commented Jan 2, 2024

sorasoras commented Jan 2, 2024

casper-hansen commented Jan 2, 2024 •

edited

Loading

JianbangZ commented Jan 3, 2024 •

edited

Loading

casper-hansen commented Jan 3, 2024

casper-hansen commented Jan 3, 2024

JianbangZ commented Jan 3, 2024

casper-hansen commented Jan 3, 2024

sorasoras commented Jan 4, 2024

casper-hansen commented Jan 4, 2024

sorasoras commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

ikawrakow commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

ikawrakow commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

ikawrakow commented Jan 6, 2024

sorasoras commented Jan 6, 2024

DD-DuDa commented Jan 12, 2024

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

GGUF compatible quantization (2, 3, 4 bit / any bit) #285

Conversation

casper-hansen commented Jan 1, 2024 • edited Loading

Perplexity

Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)

Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)

Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)

vince62s commented Jan 2, 2024

casper-hansen commented Jan 2, 2024

vince62s commented Jan 2, 2024

sorasoras commented Jan 2, 2024

casper-hansen commented Jan 2, 2024

sorasoras commented Jan 2, 2024

casper-hansen commented Jan 2, 2024 • edited Loading

JianbangZ commented Jan 3, 2024 • edited Loading

casper-hansen commented Jan 3, 2024

casper-hansen commented Jan 3, 2024

JianbangZ commented Jan 3, 2024

casper-hansen commented Jan 3, 2024

sorasoras commented Jan 4, 2024

casper-hansen commented Jan 4, 2024

sorasoras commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

ikawrakow commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

ikawrakow commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

JianbangZ commented Jan 5, 2024

ikawrakow commented Jan 6, 2024

sorasoras commented Jan 6, 2024

DD-DuDa commented Jan 12, 2024

casper-hansen commented Jan 1, 2024 •

edited

Loading

casper-hansen commented Jan 2, 2024 •

edited

Loading

JianbangZ commented Jan 3, 2024 •

edited

Loading