This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.
Besides the usual FP32, it supports FP16, quantized INT4, INT5 and INT8 inference. This project is focused on CPU, but cuBLAS is also supported.
This project provides a C library rwkv.h and a convinient Python wrapper for it.
RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2)
attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.
If you use rwkv.cpp
for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.
Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8G. Latency per token shown.
Model | Layers on GPU | Format | 24 Threads | 8 Threads | 4 Threads | 2 Threads | 1 Threads |
---|---|---|---|---|---|---|---|
RWKV-4-Pile-169M |
12 | Q4_0 |
20.6 ms | 8.6 ms | 6.9 ms | 6.2 ms | 7.9 ms |
RWKV-4-Pile-169M |
12 | Q4_1 |
21.4 ms | 8.6 ms | 6.9 ms | 6.7 ms | 7.8 ms |
RWKV-4-Pile-169M |
12 | Q5_1 |
22.2 ms | 9.0 ms | 6.9 ms | 6.7 ms | 8.1 ms |
RWKV-4-Raven-7B-v11 |
32 | Q4_0 |
94.9 ms | 54.3 ms | 50.2 ms | 51.6 ms | 59.2 ms |
RWKV-4-Raven-7B-v11 |
32 | Q4_1 |
94.5 ms | 54.3 ms | 49.7 ms | 51.8 ms | 59.2 ms |
RWKV-4-Raven-7B-v11 |
32 | Q5_1 |
101.6 ms | 72.3 ms | 67.2 ms | 69.3 ms | 77.0 ms |
Note: since cuBLAS is supported only for ggml_mul_mat()
, we still need to use few CPU resources to execute remaining operations.
Requirements: git.
git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp