Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-quants #1684

Merged
merged 32 commits into from
Jun 5, 2023
Merged

k-quants #1684

merged 32 commits into from
Jun 5, 2023

Commits on Jun 3, 2023

  1. Starting to add k-quantization to ggml

    I think it is better to have quantization separate from
    ggml. For now just adding the k-quants there, but it would be
    better to also factor out the existing ggml quantizations.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    8673a41 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b4f7134 View commit details
    Browse the repository at this point in the history
  3. Q3_K now working on CUDA and AVX2/scalar

    CUDA is not ideal - ~50% slower than Q4_0 for
    single token prediction, about the same in batch
    mode (perplexity). CPU single token is ~55 ms
    (on Ryzen 7950X).
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    c93cce3 View commit details
    Browse the repository at this point in the history
  4. Some improvement for Q3_K on CUDA

    It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    a3c0673 View commit details
    Browse the repository at this point in the history
  5. Some more CUDA optimizations for Q3_K

    Single token is now 20.5 ms/token (~20% slower than Q4_0).
    Perplexity is on par with Q4_0.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    3d8b1de View commit details
    Browse the repository at this point in the history
  6. Adding Q4_K - scalar, AVX2, CUDA

    Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
    On the GPU, single token prediction is ~10% better than Q4_0,
    batch mode (perplexity is about the same).
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    a0b8e9f View commit details
    Browse the repository at this point in the history
  7. Adding Q6_K - scalar, AVX2, CUDA

    Performance is ~40% lower compared to Q4_K on the CPU.
    This is to be expected, considering that we are memory bound
    on the CPU and the 6-bit model is ~44% larger than the 4-bit.
    On the GPU, single token prediction is ~6% lower than Q4_0,
    batch mode (perplexity) is even closer (but still slower).
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    cf221af View commit details
    Browse the repository at this point in the history
  8. Adding Q5_K - scalar, AVX2, CUDA

    Performance is ~20% lower compared to Q4_K on the CPU.
    This is to be expected, considering that we are memory bound
    on the CPU and the 5-bit model is ~22% larger than the 4-bit.
    On the GPU, single token prediction is about the same as Q4_0
    for both, single token and batch prediction.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    b835d0f View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    5c5191a View commit details
    Browse the repository at this point in the history
  10. Adding quantization mixes

    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    d537b97 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    54f808d View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    a2533a7 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    5ca15ce View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    a197eb5 View commit details
    Browse the repository at this point in the history
  15. Adding Q3_K dot for ARM_NEON

    It is 22% slower than Q4_K, despite the smaller model size.
    On x86_64, where we are memory bound, the Q3_K model is
    quite a bit faster than Q4_K.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    13264fa View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    4faa040 View commit details
    Browse the repository at this point in the history
  17. Adding Q2_K - just CUDA for now

    Token prediction is pretty good - about 15.5 ms on a RTX 4080.
    Perplexity is about the same as Q4_K.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    b439efb View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    8516fdf View commit details
    Browse the repository at this point in the history
  19. Adding ARM_NEON Q2_K dot

    About the same performance as Q4_K.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    6ec7057 View commit details
    Browse the repository at this point in the history
  20. A slightly faster ARM_NEON Q2_K dot

    Single token prediction is now ~36 ms on M2 Max.
    The code is much simpler too.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    7bcc376 View commit details
    Browse the repository at this point in the history
  21. Fixed bug in Q2_K CUDA dot product kernel

    Stranegly enough, for the few prompts I tried with the 7B model
    the responses looked perfectly reasonable. Only realized something
    is not quite right when I tried the larger models and started getting
    nonse back.
    
    In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
    box iusing CUDA and model fully loaded on the GPU are
      ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
    The max number of layers that fit in VRAM for The 65B is 32.
    With that, we get ~330 ms per token, which is not that much faster
    than just running on the CPU (~470 ms per token).
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    e51ce72 View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    c5959d5 View commit details
    Browse the repository at this point in the history
  23. A 10% faster CUDA vector dot kernel for Q3_K

    Q3_K is now running at ~18.5 ms / token on CUDA,
    so the gap to Q4_0 is only 10%.
    It seems memory acccess pattern is more important for
    performance than the amount of computation the kernel
    does.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    9a9c5a0 View commit details
    Browse the repository at this point in the history
  24. A slightly daster Q4_K AVX2 dot product

    For perplexity, where we are less memory bound, time per
    pass drops by ~5%. Barely measurable difference for single
    token prediction.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    894210a View commit details
    Browse the repository at this point in the history
  25. Configuration menu
    Copy the full SHA
    abd99a8 View commit details
    Browse the repository at this point in the history
  26. Minor

    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    8f5d42d View commit details
    Browse the repository at this point in the history
  27. Fix quantization error test

    We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
    quantization variants.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    6ef1382 View commit details
    Browse the repository at this point in the history
  28. Fix docker build

    I have been sloppy with vector reinterpret casts on ARM_NEON.
    It seems clang is very forgiving in that regard.
    Kawrakow committed Jun 3, 2023
    Configuration menu
    Copy the full SHA
    0a71a4e View commit details
    Browse the repository at this point in the history

Commits on Jun 4, 2023

  1. Configuration menu
    Copy the full SHA
    431693c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    32a5f3a View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2023

  1. Configuration menu
    Copy the full SHA
    12d4344 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    af275fa View commit details
    Browse the repository at this point in the history