-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k-quants #1684
k-quants #1684
Commits on Jun 3, 2023
-
Starting to add k-quantization to ggml
I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations.
Configuration menu - View commit details
-
Copy full SHA for 8673a41 - Browse repository at this point
Copy the full SHA 8673a41View commit details -
Configuration menu - View commit details
-
Copy full SHA for b4f7134 - Browse repository at this point
Copy the full SHA b4f7134View commit details -
Q3_K now working on CUDA and AVX2/scalar
CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X).
Configuration menu - View commit details
-
Copy full SHA for c93cce3 - Browse repository at this point
Copy the full SHA c93cce3View commit details -
Some improvement for Q3_K on CUDA
It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
Configuration menu - View commit details
-
Copy full SHA for a3c0673 - Browse repository at this point
Copy the full SHA a3c0673View commit details -
Some more CUDA optimizations for Q3_K
Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0.
Configuration menu - View commit details
-
Copy full SHA for 3d8b1de - Browse repository at this point
Copy the full SHA 3d8b1deView commit details -
Adding Q4_K - scalar, AVX2, CUDA
Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same).
Configuration menu - View commit details
-
Copy full SHA for a0b8e9f - Browse repository at this point
Copy the full SHA a0b8e9fView commit details -
Adding Q6_K - scalar, AVX2, CUDA
Performance is ~40% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 6-bit model is ~44% larger than the 4-bit. On the GPU, single token prediction is ~6% lower than Q4_0, batch mode (perplexity) is even closer (but still slower).
Configuration menu - View commit details
-
Copy full SHA for cf221af - Browse repository at this point
Copy the full SHA cf221afView commit details -
Adding Q5_K - scalar, AVX2, CUDA
Performance is ~20% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 5-bit model is ~22% larger than the 4-bit. On the GPU, single token prediction is about the same as Q4_0 for both, single token and batch prediction.
Configuration menu - View commit details
-
Copy full SHA for b835d0f - Browse repository at this point
Copy the full SHA b835d0fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5c5191a - Browse repository at this point
Copy the full SHA 5c5191aView commit details -
Configuration menu - View commit details
-
Copy full SHA for d537b97 - Browse repository at this point
Copy the full SHA d537b97View commit details -
Configuration menu - View commit details
-
Copy full SHA for 54f808d - Browse repository at this point
Copy the full SHA 54f808dView commit details -
Configuration menu - View commit details
-
Copy full SHA for a2533a7 - Browse repository at this point
Copy the full SHA a2533a7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5ca15ce - Browse repository at this point
Copy the full SHA 5ca15ceView commit details -
Configuration menu - View commit details
-
Copy full SHA for a197eb5 - Browse repository at this point
Copy the full SHA a197eb5View commit details -
It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K.
Configuration menu - View commit details
-
Copy full SHA for 13264fa - Browse repository at this point
Copy the full SHA 13264faView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4faa040 - Browse repository at this point
Copy the full SHA 4faa040View commit details -
Adding Q2_K - just CUDA for now
Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K.
Configuration menu - View commit details
-
Copy full SHA for b439efb - Browse repository at this point
Copy the full SHA b439efbView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8516fdf - Browse repository at this point
Copy the full SHA 8516fdfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6ec7057 - Browse repository at this point
Copy the full SHA 6ec7057View commit details -
A slightly faster ARM_NEON Q2_K dot
Single token prediction is now ~36 ms on M2 Max. The code is much simpler too.
Configuration menu - View commit details
-
Copy full SHA for 7bcc376 - Browse repository at this point
Copy the full SHA 7bcc376View commit details -
Fixed bug in Q2_K CUDA dot product kernel
Stranegly enough, for the few prompts I tried with the 7B model the responses looked perfectly reasonable. Only realized something is not quite right when I tried the larger models and started getting nonse back. In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X box iusing CUDA and model fully loaded on the GPU are ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B. The max number of layers that fit in VRAM for The 65B is 32. With that, we get ~330 ms per token, which is not that much faster than just running on the CPU (~470 ms per token).
Configuration menu - View commit details
-
Copy full SHA for e51ce72 - Browse repository at this point
Copy the full SHA e51ce72View commit details -
Configuration menu - View commit details
-
Copy full SHA for c5959d5 - Browse repository at this point
Copy the full SHA c5959d5View commit details -
A 10% faster CUDA vector dot kernel for Q3_K
Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does.
Configuration menu - View commit details
-
Copy full SHA for 9a9c5a0 - Browse repository at this point
Copy the full SHA 9a9c5a0View commit details -
A slightly daster Q4_K AVX2 dot product
For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction.
Configuration menu - View commit details
-
Copy full SHA for 894210a - Browse repository at this point
Copy the full SHA 894210aView commit details -
Configuration menu - View commit details
-
Copy full SHA for abd99a8 - Browse repository at this point
Copy the full SHA abd99a8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8f5d42d - Browse repository at this point
Copy the full SHA 8f5d42dView commit details -
We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants.
Configuration menu - View commit details
-
Copy full SHA for 6ef1382 - Browse repository at this point
Copy the full SHA 6ef1382View commit details -
I have been sloppy with vector reinterpret casts on ARM_NEON. It seems clang is very forgiving in that regard.
Configuration menu - View commit details
-
Copy full SHA for 0a71a4e - Browse repository at this point
Copy the full SHA 0a71a4eView commit details
Commits on Jun 4, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 431693c - Browse repository at this point
Copy the full SHA 431693cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 32a5f3a - Browse repository at this point
Copy the full SHA 32a5f3aView commit details
Commits on Jun 5, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 12d4344 - Browse repository at this point
Copy the full SHA 12d4344View commit details -
Configuration menu - View commit details
-
Copy full SHA for af275fa - Browse repository at this point
Copy the full SHA af275faView commit details