New SOTA 2-Bit Quant released: QuIP-Sharp #4327
Replies: 12 comments 38 replies
-
Just my few bits: Given the quantization is totally different the better question might be: Is the Q_K2 quant worse than that new one on the same model. Comparing to another model just by binary size is not too useful, more parameters are known to perform better even at high quantization. Also how the quantization works is important, the K quants are super nice to work within, by just loading a small block. |
Beta Was this translation helpful? Give feedback.
-
Do you see perplexity? That 2 bit model is drunk as hell :D |
Beta Was this translation helpful? Give feedback.
-
@Dampfinchen Where do the perplexities you posted above come from? For But if there is still interest in better quantization approaches, I can publish k-quants models with lower perplexities, including (almost) pure 2-bit models. Using
It is possible to do better than this (slightly smaller model sizes and lower perplexities), but that requires a significant change in |
Beta Was this translation helpful? Give feedback.
-
OK, here is a more apples-to-apples comparison to the results published in the QuIP# paper. They computed perplexities using a context window of 2048 for LLaMA-1 and 4096 for LLaMA-2 (see table near the end of the paper). In the table below
As there isn't a one-to-one correspondence in model sizes between |
Beta Was this translation helpful? Give feedback.
-
Hi, one of the QuIP# authors here. Thanks for your interest in our work and for putting in the effort to run this comparison! QuIP# has two core components: incoherence processing and a lattice codebook.
During inference, we first run a hadamard transform, do a matmul with the quantized weight matrix, and then do a reverse hadamard transformation on the input vector. Our implementation for E8P can be found here. We have CUDA kernels that can be used to guide integration into llama.cpp as "efficient" ways of doing these operations. QuIP# should be relatively straightforward to implement since it uses the same compression scheme for every weight, vs mixed-precision methods that use different precisions for different weights. QuIP# achieves true 2 bit models, whereas other "2 bit" methods with grouping usually end up with significantly more than 2 bits per weight. Our experiments show that our method achieves state of the art results at true 2 bits. Regarding your comparison with Q2K (2.6 bits) and Q2K* (2.3 bits), this is not really an apples-to-apples comparison because it compares approaches that quantize the embedding and output tensors with those that don't. If you want to compare methods that quantize all the tensors, we'll need to produce some QuIP# models that do that. One easy way to test the performance QuIP# with a quantized embedding and output tensor is by copying your Q2K*-quantized embedding and output tensors into QuIP# and calculating perplexity with that. @ikawrakow Can you share instructions on how to reproduce your numbers? Finally, this project is in active development, so we expect our method to improve in the coming months. |
Beta Was this translation helpful? Give feedback.
-
I have posted the LLaMA-v2 models quantized with the improved |
Beta Was this translation helpful? Give feedback.
-
Interesting! This is just experimental right now, is that correct? Not committed to llama.cpp yet |
Beta Was this translation helpful? Give feedback.
-
Yes, I have not contributed the improved quantization method to |
Beta Was this translation helpful? Give feedback.
-
I have posted the 2-bit quantized LLaMA-v1 models on Huggingface in this repository. For some reason the 65B model is being rejected (yes, I have setup the repository to accept files larger than 5 GB and have successfully pushed the 34B model, which is 10.8 GB). |
Beta Was this translation helpful? Give feedback.
-
Hi @ikawrakow, we ran your Q2K (not *, since those were not released) models from https://huggingface.co/ikawrakow/llama-v1-2bit-gguf/tree/main on our evaluation pipeline that we used to generate the QuIP# numbers. The numbers we are getting do not match the numbers you report. The Q2K perplexities we are getting are higher than what you report, and our model size on disk for QuIP# is also slightly smaller than what you report. These are the numbers we are getting
The trend here is basically "you get what you pay for" in that the larger the model the better the results, which is not surprising. We also ran a simple experiment with quantizing the embedding and output layers by doing the following very naive algorithm:
This is possibly the "dumbest" thing one can do with a Hadamard transform as there is no Hessian information, adaptive rounding, or groupwise scaling. Thus, the results here for quantizing QuIP# embedding/output layers with this algorithm should be taken as an upper bound on perplexities achievable with QuiP# and quantized embedding/output layers. These are indicated in the table as QuIP# quant emb (k1) + output (k2) where k1 is the number of bits for the embedding and k2 is the number of bits for the output. QuIP# can achieve a significant reduction in size without sacrificing performance from quantizing these two layers, which I suspect is true in Q2K as well. In this setting, the difference in size a 1.77G QuIP# model (QuIP# quant emb (4) + output (6)) and Q2K* (2.22G) is more than 20%. Our code that generated these numbers is available at https://github.com/Cornell-RelaxML/quip-sharp/tree/q2k_test and the commands to run the code are in q2k.sh. The regular QuIP# numbers can be obtained from the code in the main branch. The embedding/output quantized models (which we saved as fp16 to avoid writing unnecessary code for this test) were generated by hack_emb.py and use the eval scripts in the main branch. We would highly appreciate it if you could tell us how you got your numbers in case we are misunderstanding something here. |
Beta Was this translation helpful? Give feedback.
-
I have spent some more time looking into this, and have created 3 new quantizations: "pure" 2-bit, 2.156-bit, and 2.28 bits per weight (bpw). The results for these along with the original
Is there interest to integrate one or more of these into If there is interest but 4 different 2-bit variants are considered too much, which should I pick? |
Beta Was this translation helpful? Give feedback.
-
Do you have the models from your latest table available online? And do you have fp16 numbers for your table? Thanks
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Kawrakow ***@***.***>
Sent: Saturday, December 23, 2023 9:31:47 AM
To: ggerganov/llama.cpp ***@***.***>
Cc: Albert Tseng ***@***.***>; Mention ***@***.***>
Subject: Re: [ggerganov/llama.cpp] New SOTA 2-Bit Quant released: QuIP-Sharp (Discussion #4327)
How are you able to produce Q2_K LLaMA-v2-70B with a model size of 22.90 GiB and perplexity of 3.671?
By using an "importance matrix". The quantization approach that is currently available in llama.cpp minimizes some similarity measure (e.g., RMSE, mean absolute difference, etc.) between the fp16 weights and the quantized weights without worrying about weight importance differences. Because in practice some weights really are more important than others, one cannot go for full minimization of the difference as this sometimes leads to very undesirable results. The ffn_down tensors are particularly sensitive, especially the first few layers.
Getting the infrastructure to generate and then use the importance matrix during quantization requires quite a bit of changes to llama.cpp/ggml. This is why, at least for now, I do the quantization using my forked llama.cpp version and publish the resulting models on HF.
—
Reply to this email directly, view it on GitHub<#4327 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSD4VMLKDUHPZWXSBWDYK4IQHAVCNFSM6AAAAABAGLCR56VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TSMZUHA4TM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Oobabooga implemented this into the webui and certainly in terms of memory, it seems a lot better than current Q2K, by a landslide. A Q2_K 13B model needs around 5.4 GB, while a 2-BIT QuIP model only needs around 3.8 GB https://huggingface.co/relaxml/Llama-2-13b-E8P-2Bit/tree/main . This means a 13B model can be fully offloaded on a 6 GB GPU.
Likewise according to Oobabooga, a 70B model now fits entirely within 24 GB VRAM and at a context of 3072.
This is likely because its true 2 Bit and not a mixture of different bits like its the case with K quants in Llama.cpp.
Based on this table provided by Oobabooga, perplexity looks promising:
I think answers to the following questions need to be persued to check if an implementation would make sense.
How does a Llama 2 7B model at Q4K_S (which is 3.8 GB in size) compare to a Llama 2 13B QuIP-Sharp model which is also 3.8 GB, perplexity wise? If its better than 7B, then it would absolutely make sense to implement it.
How does Q2_K compare to 2-Bit QuIP-Sharp complexity wise? Now, even if it would be worse than Q2_K which is pretty likely, the massive memory savings can't be ignored. A 13B model measuring just around 3.8 GB is truly unprecedented.
According to Ooba's data, it's pretty interesting how pure 2-Bit QuIP outperforms Exllama's 2.5 BPW one.
Note: You may have heard about QuIP in the past. QuIP-Sharp is a new one that is drastically improved.
Here's the link to the repo: https://github.com/Cornell-RelaxML/quip-sharp
@slaren @ggerganov @ikawrakow Curious to hear your opinions.
Beta Was this translation helpful? Give feedback.
All reactions