Skip to content
forked from RWKV/rwkv.cpp

INT4 and FP16 inference on CPU for RWKV language model

License

Notifications You must be signed in to change notification settings

ArEnSc/rwkv.cpp

 
 

Repository files navigation

rwkv.cpp

This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.

Besides the usual FP32, it supports FP16, quantized INT4, INT5 and INT8 inference. This project is focused on CPU, but cuBLAS is also supported.

This project provides a C library rwkv.h and a convinient Python wrapper for it.

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.

Quality and performance

If you use rwkv.cpp for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.

Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

With cuBLAS

Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8G. Latency per token shown.

Model Layers on GPU Format 24 Threads 8 Threads 4 Threads 2 Threads 1 Threads
RWKV-4-Pile-169M 12 Q4_0 20.6 ms 8.6 ms 6.9 ms 6.2 ms 7.9 ms
RWKV-4-Pile-169M 12 Q4_1 21.4 ms 8.6 ms 6.9 ms 6.7 ms 7.8 ms
RWKV-4-Pile-169M 12 Q5_1 22.2 ms 9.0 ms 6.9 ms 6.7 ms 8.1 ms
RWKV-4-Raven-7B-v11 32 Q4_0 94.9 ms 54.3 ms 50.2 ms 51.6 ms 59.2 ms
RWKV-4-Raven-7B-v11 32 Q4_1 94.5 ms 54.3 ms 49.7 ms 51.8 ms 59.2 ms
RWKV-4-Raven-7B-v11 32 Q5_1 101.6 ms 72.3 ms 67.2 ms 69.3 ms 77.0 ms

Note: since cuBLAS is supported only for ggml_mul_mat(), we still need to use few CPU resources to execute remaining operations.

How to use

1. Clone the repo

Requirements: git.

git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library