Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More accurate Q4_0 and Q4_1 quantizations #896

Closed
wants to merge 12 commits into from

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Apr 11, 2023

Update

After seeing PR #835, I pushed some more changes that only affect the Q4_0 results. I now get

rmse = 0.00185228

for the 7B model. Perplexity becomes 6.2644. This is the result on my MacBook with M2 Max. Running the same quantization on a Ryzen 7950X gives completely different results. The test is still running, but so far it looks like it will end up with a ~0.3 higher perplexity. I guess, there is a problem with the AVX2 version that is being used there. @ggerganov tells me that the difference I'm observing is simply due to using BLAS on the Mac and not using BLAS on the Ryzen 7950X.

Update 2

OK, it looks like the low perplexities I'm getting are simply due to the fact that I'm running on the Mac, where BLAS is enabled by default. So, basically, most of the reduction in perplexity I'm observing is simply due to the full precision in matrix multiplications. I will rerun perplexity without BLAS (or with BLAS using the reference Q4_0 quantization) and will post the results. This will better tell us how much one can gain from improving the quantization.

Update 3

Perplexity of the 7B model with reference Q4_0 quantization and BLAS enabled is 6.2838 after 655 chunks. So, basically, the ~25% reduction in MSE of the quantized weights results in a 0.02 improvement in perplexity. In contrast, full precision in matrix multiplications via BLAS improves perplexity by ~0.3. Which basically means that this PR is pretty pointless.

Update 4

Perplexity results for 7B and 13B with Q4_0 and Q4_1 are available here

A also added a POC for 5-bit quantization. Memory/disk usage is the same as the current Q4_1 by using two fp16 floats instead of two fp32 floats. For each quantized value in a set of 32 weights, it stores 4 of the 5 bits as Q4. It then uses the 32 bits that are now not used due to fp16 to store a flag if the corresponding value has the 5th bit set. The encoding/decoding ends up being not too bad at the end.

The improvement in rmse compared to Q4_1 is dramatic. I get

rmse 0.00076131, maxerr 0.05273438, 95pct<0.0016, median<0.0006 

after a full round trip of quantization - dequantization.

iwan@MacBook-Pro:~/other/llama.cpp/build$ ./bin/quantize-stats -m ../../quant/models/7B/ggml-model-f16-new.bin -nq -p Loading model llama.cpp: loading model from ../../quant/models/7B/ggml-model-f16-new.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 256 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: f16 = 1 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 14645.07 MB (+ 2052.00 MB per state) llama_init_from_file: kv self size = 256.00 MB note: source model is f16 testing 291 layers with max size 131072000 q4_1::tok_embeddings.weight : rmse 0.00067479, maxerr 0.00573730, 95pct<0.0014, median<0.0006 q4_1::norm.weight : rmse 0.00464176, maxerr 0.02522278, 95pct<0.0098, median<0.0028 q4_1::output.weight : rmse 0.00067622, maxerr 0.00727844, 95pct<0.0014, median<0.0006 q4_1::layers.0.attention.wq.weight : rmse 0.00113705, maxerr 0.01367188, 95pct<0.0024, median<0.0008 q4_1::layers.0.attention.wk.weight : rmse 0.00112141, maxerr 0.02249146, 95pct<0.0024, median<0.0008 q4_1::layers.0.attention.wv.weight : rmse 0.00045496, maxerr 0.00343323, 95pct<0.0010, median<0.0004 q4_1::layers.0.attention.wo.weight : rmse 0.00040908, maxerr 0.01490784, 95pct<0.0008, median<0.0004 q4_1::layers.0.feed_forward.w1.weight : rmse 0.00056531, maxerr 0.02049255, 95pct<0.0012, median<0.0006 q4_1::layers.0.feed_forward.w2.weight : rmse 0.00068808, maxerr 0.01811218, 95pct<0.0014, median<0.0006 q4_1::layers.0.feed_forward.w3.weight : rmse 0.00054793, maxerr 0.00539398, 95pct<0.0012, median<0.0006 q4_1::layers.0.attention_norm.weight : rmse 0.00184961, maxerr 0.01128960, 95pct<0.0046, median<0.0004 q4_1::layers.0.ffn_norm.weight : rmse 0.00057309, maxerr 0.00354767, 95pct<0.0014, median<0.0004 q4_1::layers.1.attention.wq.weight : rmse 0.00113066, maxerr 0.01373291, 95pct<0.0024, median<0.0008 q4_1::layers.1.attention.wk.weight : rmse 0.00115275, maxerr 0.01394653, 95pct<0.0026, median<0.0008 q4_1::layers.1.attention.wv.weight : rmse 0.00038954, maxerr 0.00264549, 95pct<0.0008, median<0.0004 q4_1::layers.1.attention.wo.weight : rmse 0.00039675, maxerr 0.01858521, 95pct<0.0008, median<0.0004 q4_1::layers.1.feed_forward.w1.weight : rmse 0.00071841, maxerr 0.01058960, 95pct<0.0014, median<0.0006 q4_1::layers.1.feed_forward.w2.weight : rmse 0.00070427, maxerr 0.01712418, 95pct<0.0014, median<0.0006 q4_1::layers.1.feed_forward.w3.weight : rmse 0.00068200, maxerr 0.00634003, 95pct<0.0014, median<0.0006 q4_1::layers.1.attention_norm.weight : rmse 0.00113834, maxerr 0.00441742, 95pct<0.0022, median<0.0010 q4_1::layers.1.ffn_norm.weight : rmse 0.00045691, maxerr 0.00189972, 95pct<0.0012, median<0.0004 q4_1::layers.2.attention.wq.weight : rmse 0.00122553, maxerr 0.01284790, 95pct<0.0026, median<0.0008 q4_1::layers.2.attention.wk.weight : rmse 0.00126909, maxerr 0.01083374, 95pct<0.0026, median<0.0008 q4_1::layers.2.attention.wv.weight : rmse 0.00046610, maxerr 0.00373840, 95pct<0.0010, median<0.0004 q4_1::layers.2.attention.wo.weight : rmse 0.00047293, maxerr 0.02255249, 95pct<0.0010, median<0.0004 q4_1::layers.2.feed_forward.w1.weight : rmse 0.00075789, maxerr 0.02072144, 95pct<0.0014, median<0.0006 q4_1::layers.2.feed_forward.w2.weight : rmse 0.00070288, maxerr 0.02935791, 95pct<0.0014, median<0.0006 q4_1::layers.2.feed_forward.w3.weight : rmse 0.00069113, maxerr 0.01350403, 95pct<0.0014, median<0.0006 q4_1::layers.2.attention_norm.weight : rmse 0.00066977, maxerr 0.00357437, 95pct<0.0016, median<0.0004 q4_1::layers.2.ffn_norm.weight : rmse 0.00057539, maxerr 0.00602722, 95pct<0.0014, median<0.0002 q4_1::layers.3.attention.wq.weight : rmse 0.00100399, maxerr 0.01467896, 95pct<0.0020, median<0.0008 q4_1::layers.3.attention.wk.weight : rmse 0.00105630, maxerr 0.00891876, 95pct<0.0022, median<0.0008 q4_1::layers.3.attention.wv.weight : rmse 0.00055252, maxerr 0.00350380, 95pct<0.0012, median<0.0006 q4_1::layers.3.attention.wo.weight : rmse 0.00055156, maxerr 0.02120972, 95pct<0.0012, median<0.0006 q4_1::layers.3.feed_forward.w1.weight : rmse 0.00076589, maxerr 0.00913620, 95pct<0.0014, median<0.0006 q4_1::layers.3.feed_forward.w2.weight : rmse 0.00070594, maxerr 0.01667786, 95pct<0.0014, median<0.0006 q4_1::layers.3.feed_forward.w3.weight : rmse 0.00070419, maxerr 0.00675297, 95pct<0.0014, median<0.0006 q4_1::layers.3.attention_norm.weight : rmse 0.00064717, maxerr 0.00405121, 95pct<0.0016, median<0.0004 q4_1::layers.3.ffn_norm.weight : rmse 0.00052969, maxerr 0.00325775, 95pct<0.0014, median<0.0002 q4_1::layers.4.attention.wq.weight : rmse 0.00102266, maxerr 0.01240540, 95pct<0.0020, median<0.0008 q4_1::layers.4.attention.wk.weight : rmse 0.00103666, maxerr 0.00794220, 95pct<0.0022, median<0.0008 q4_1::layers.4.attention.wv.weight : rmse 0.00055190, maxerr 0.00349617, 95pct<0.0012, median<0.0006 q4_1::layers.4.attention.wo.weight : rmse 0.00055145, maxerr 0.01269531, 95pct<0.0012, median<0.0006 q4_1::layers.4.feed_forward.w1.weight : rmse 0.00077576, maxerr 0.01290894, 95pct<0.0016, median<0.0006 q4_1::layers.4.feed_forward.w2.weight : rmse 0.00070362, maxerr 0.02110291, 95pct<0.0014, median<0.0006 q4_1::layers.4.feed_forward.w3.weight : rmse 0.00070692, maxerr 0.00906372, 95pct<0.0014, median<0.0006 q4_1::layers.4.attention_norm.weight : rmse 0.00073026, maxerr 0.00402832, 95pct<0.0018, median<0.0004 q4_1::layers.4.ffn_norm.weight : rmse 0.00051424, maxerr 0.00222015, 95pct<0.0014, median<0.0002 q4_1::layers.5.attention.wq.weight : rmse 0.00097000, maxerr 0.01222229, 95pct<0.0020, median<0.0008 q4_1::layers.5.attention.wk.weight : rmse 0.00097998, maxerr 0.00923157, 95pct<0.0020, median<0.0008 q4_1::layers.5.attention.wv.weight : rmse 0.00056170, maxerr 0.00393677, 95pct<0.0012, median<0.0006 q4_1::layers.5.attention.wo.weight : rmse 0.00055888, maxerr 0.01965332, 95pct<0.0012, median<0.0006 q4_1::layers.5.feed_forward.w1.weight : rmse 0.00079171, maxerr 0.00949097, 95pct<0.0016, median<0.0006 q4_1::layers.5.feed_forward.w2.weight : rmse 0.00069678, maxerr 0.01489258, 95pct<0.0014, median<0.0006 q4_1::layers.5.feed_forward.w3.weight : rmse 0.00070391, maxerr 0.00760412, 95pct<0.0014, median<0.0006 q4_1::layers.5.attention_norm.weight : rmse 0.00064811, maxerr 0.00418091, 95pct<0.0014, median<0.0004 q4_1::layers.5.ffn_norm.weight : rmse 0.00048299, maxerr 0.00224304, 95pct<0.0014, median<0.0002 q4_1::layers.6.attention.wq.weight : rmse 0.00097928, maxerr 0.01423645, 95pct<0.0020, median<0.0008 q4_1::layers.6.attention.wk.weight : rmse 0.00100344, maxerr 0.00769806, 95pct<0.0020, median<0.0008 q4_1::layers.6.attention.wv.weight : rmse 0.00056611, maxerr 0.00344467, 95pct<0.0012, median<0.0006 q4_1::layers.6.attention.wo.weight : rmse 0.00056490, maxerr 0.01663208, 95pct<0.0012, median<0.0006 q4_1::layers.6.feed_forward.w1.weight : rmse 0.00078077, maxerr 0.01182556, 95pct<0.0016, median<0.0006 q4_1::layers.6.feed_forward.w2.weight : rmse 0.00070368, maxerr 0.01443481, 95pct<0.0014, median<0.0006 q4_1::layers.6.feed_forward.w3.weight : rmse 0.00071282, maxerr 0.00653839, 95pct<0.0014, median<0.0006 q4_1::layers.6.attention_norm.weight : rmse 0.00065040, maxerr 0.00478363, 95pct<0.0014, median<0.0004 q4_1::layers.6.ffn_norm.weight : rmse 0.00045547, maxerr 0.00253296, 95pct<0.0012, median<0.0002 q4_1::layers.7.attention.wq.weight : rmse 0.00096305, maxerr 0.01339722, 95pct<0.0020, median<0.0008 q4_1::layers.7.attention.wk.weight : rmse 0.00097201, maxerr 0.00742340, 95pct<0.0020, median<0.0008 q4_1::layers.7.attention.wv.weight : rmse 0.00058344, maxerr 0.00369263, 95pct<0.0012, median<0.0006 q4_1::layers.7.attention.wo.weight : rmse 0.00057667, maxerr 0.01287842, 95pct<0.0012, median<0.0006 q4_1::layers.7.feed_forward.w1.weight : rmse 0.00077346, maxerr 0.00889587, 95pct<0.0016, median<0.0006 q4_1::layers.7.feed_forward.w2.weight : rmse 0.00070649, maxerr 0.01315308, 95pct<0.0014, median<0.0006 q4_1::layers.7.feed_forward.w3.weight : rmse 0.00071581, maxerr 0.00903320, 95pct<0.0014, median<0.0006 q4_1::layers.7.attention_norm.weight : rmse 0.00078184, maxerr 0.00570679, 95pct<0.0018, median<0.0004 q4_1::layers.7.ffn_norm.weight : rmse 0.00041102, maxerr 0.00209045, 95pct<0.0010, median<0.0002 q4_1::layers.8.attention.wq.weight : rmse 0.00094744, maxerr 0.01050568, 95pct<0.0020, median<0.0008 q4_1::layers.8.attention.wk.weight : rmse 0.00094850, maxerr 0.00811768, 95pct<0.0020, median<0.0008 q4_1::layers.8.attention.wv.weight : rmse 0.00057753, maxerr 0.00346375, 95pct<0.0012, median<0.0006 q4_1::layers.8.attention.wo.weight : rmse 0.00057410, maxerr 0.01193237, 95pct<0.0012, median<0.0006 q4_1::layers.8.feed_forward.w1.weight : rmse 0.00077391, maxerr 0.00781250, 95pct<0.0016, median<0.0006 q4_1::layers.8.feed_forward.w2.weight : rmse 0.00070675, maxerr 0.01203918, 95pct<0.0014, median<0.0006 q4_1::layers.8.feed_forward.w3.weight : rmse 0.00071790, maxerr 0.00637674, 95pct<0.0014, median<0.0006 q4_1::layers.8.attention_norm.weight : rmse 0.00078496, maxerr 0.00555420, 95pct<0.0018, median<0.0004 q4_1::layers.8.ffn_norm.weight : rmse 0.00039499, maxerr 0.00217056, 95pct<0.0010, median<0.0002 q4_1::layers.9.attention.wq.weight : rmse 0.00091703, maxerr 0.01177216, 95pct<0.0018, median<0.0008 q4_1::layers.9.attention.wk.weight : rmse 0.00092338, maxerr 0.00704193, 95pct<0.0020, median<0.0006 q4_1::layers.9.attention.wv.weight : rmse 0.00057349, maxerr 0.00370789, 95pct<0.0012, median<0.0006 q4_1::layers.9.attention.wo.weight : rmse 0.00056955, maxerr 0.01493835, 95pct<0.0012, median<0.0006 q4_1::layers.9.feed_forward.w1.weight : rmse 0.00076262, maxerr 0.01126099, 95pct<0.0016, median<0.0006 q4_1::layers.9.feed_forward.w2.weight : rmse 0.00071218, maxerr 0.01254272, 95pct<0.0014, median<0.0006 q4_1::layers.9.feed_forward.w3.weight : rmse 0.00072266, maxerr 0.01207066, 95pct<0.0014, median<0.0006 q4_1::layers.9.attention_norm.weight : rmse 0.00083655, maxerr 0.00610352, 95pct<0.0020, median<0.0006 q4_1::layers.9.ffn_norm.weight : rmse 0.00036572, maxerr 0.00198364, 95pct<0.0010, median<0.0002 q4_1::layers.10.attention.wq.weight : rmse 0.00091798, maxerr 0.00944328, 95pct<0.0018, median<0.0008 q4_1::layers.10.attention.wk.weight : rmse 0.00092796, maxerr 0.00902557, 95pct<0.0020, median<0.0008 q4_1::layers.10.attention.wv.weight : rmse 0.00059499, maxerr 0.00408936, 95pct<0.0012, median<0.0006 q4_1::layers.10.attention.wo.weight : rmse 0.00059391, maxerr 0.01306152, 95pct<0.0012, median<0.0006 q4_1::layers.10.feed_forward.w1.weight : rmse 0.00075719, maxerr 0.00927734, 95pct<0.0014, median<0.0006 q4_1::layers.10.feed_forward.w2.weight : rmse 0.00071889, maxerr 0.01138306, 95pct<0.0014, median<0.0006 q4_1::layers.10.feed_forward.w3.weight : rmse 0.00073179, maxerr 0.00698090, 95pct<0.0014, median<0.0006 q4_1::layers.10.attention_norm.weight : rmse 0.00081416, maxerr 0.00582886, 95pct<0.0018, median<0.0004 q4_1::layers.10.ffn_norm.weight : rmse 0.00037315, maxerr 0.00198364, 95pct<0.0010, median<0.0002 q4_1::layers.11.attention.wq.weight : rmse 0.00094457, maxerr 0.01239014, 95pct<0.0020, median<0.0008 q4_1::layers.11.attention.wk.weight : rmse 0.00095987, maxerr 0.00829315, 95pct<0.0020, median<0.0008 q4_1::layers.11.attention.wv.weight : rmse 0.00062563, maxerr 0.00389481, 95pct<0.0012, median<0.0006 q4_1::layers.11.attention.wo.weight : rmse 0.00062213, maxerr 0.01336670, 95pct<0.0012, median<0.0006 q4_1::layers.11.feed_forward.w1.weight : rmse 0.00075509, maxerr 0.00779724, 95pct<0.0014, median<0.0006 q4_1::layers.11.feed_forward.w2.weight : rmse 0.00072435, maxerr 0.01480103, 95pct<0.0014, median<0.0006 q4_1::layers.11.feed_forward.w3.weight : rmse 0.00073620, maxerr 0.00790024, 95pct<0.0014, median<0.0006 q4_1::layers.11.attention_norm.weight : rmse 0.00081532, maxerr 0.00553894, 95pct<0.0020, median<0.0004 q4_1::layers.11.ffn_norm.weight : rmse 0.00034090, maxerr 0.00186920, 95pct<0.0008, median<0.0002 q4_1::layers.12.attention.wq.weight : rmse 0.00090180, maxerr 0.01110840, 95pct<0.0018, median<0.0008 q4_1::layers.12.attention.wk.weight : rmse 0.00091626, maxerr 0.00747681, 95pct<0.0020, median<0.0008 q4_1::layers.12.attention.wv.weight : rmse 0.00060456, maxerr 0.00358963, 95pct<0.0012, median<0.0006 q4_1::layers.12.attention.wo.weight : rmse 0.00061119, maxerr 0.00909424, 95pct<0.0012, median<0.0006 q4_1::layers.12.feed_forward.w1.weight : rmse 0.00075938, maxerr 0.01080322, 95pct<0.0016, median<0.0006 q4_1::layers.12.feed_forward.w2.weight : rmse 0.00072589, maxerr 0.01767731, 95pct<0.0014, median<0.0006 q4_1::layers.12.feed_forward.w3.weight : rmse 0.00074074, maxerr 0.00480270, 95pct<0.0014, median<0.0006 q4_1::layers.12.attention_norm.weight : rmse 0.00079187, maxerr 0.00598145, 95pct<0.0018, median<0.0006 q4_1::layers.12.ffn_norm.weight : rmse 0.00039970, maxerr 0.00243759, 95pct<0.0010, median<0.0002 q4_1::layers.13.attention.wq.weight : rmse 0.00088012, maxerr 0.01113892, 95pct<0.0018, median<0.0006 q4_1::layers.13.attention.wk.weight : rmse 0.00089455, maxerr 0.00920868, 95pct<0.0018, median<0.0006 q4_1::layers.13.attention.wv.weight : rmse 0.00063239, maxerr 0.00388336, 95pct<0.0012, median<0.0006 q4_1::layers.13.attention.wo.weight : rmse 0.00063344, maxerr 0.01577759, 95pct<0.0012, median<0.0006 q4_1::layers.13.feed_forward.w1.weight : rmse 0.00075605, maxerr 0.00798798, 95pct<0.0014, median<0.0006 q4_1::layers.13.feed_forward.w2.weight : rmse 0.00073209, maxerr 0.01325989, 95pct<0.0014, median<0.0006 q4_1::layers.13.feed_forward.w3.weight : rmse 0.00074790, maxerr 0.00488091, 95pct<0.0014, median<0.0006 q4_1::layers.13.attention_norm.weight : rmse 0.00073977, maxerr 0.00601959, 95pct<0.0014, median<0.0004 q4_1::layers.13.ffn_norm.weight : rmse 0.00043181, maxerr 0.00242615, 95pct<0.0010, median<0.0002 q4_1::layers.14.attention.wq.weight : rmse 0.00088644, maxerr 0.00962925, 95pct<0.0018, median<0.0008 q4_1::layers.14.attention.wk.weight : rmse 0.00089396, maxerr 0.00738525, 95pct<0.0018, median<0.0008 q4_1::layers.14.attention.wv.weight : rmse 0.00063632, maxerr 0.00393677, 95pct<0.0012, median<0.0006 q4_1::layers.14.attention.wo.weight : rmse 0.00063472, maxerr 0.01136780, 95pct<0.0012, median<0.0006 q4_1::layers.14.feed_forward.w1.weight : rmse 0.00075503, maxerr 0.00747681, 95pct<0.0014, median<0.0006 q4_1::layers.14.feed_forward.w2.weight : rmse 0.00073546, maxerr 0.01710510, 95pct<0.0014, median<0.0006 q4_1::layers.14.feed_forward.w3.weight : rmse 0.00075066, maxerr 0.00717163, 95pct<0.0014, median<0.0006 q4_1::layers.14.attention_norm.weight : rmse 0.00069518, maxerr 0.00553131, 95pct<0.0014, median<0.0004 q4_1::layers.14.ffn_norm.weight : rmse 0.00042802, maxerr 0.00244141, 95pct<0.0010, median<0.0002 q4_1::layers.15.attention.wq.weight : rmse 0.00088594, maxerr 0.01150513, 95pct<0.0018, median<0.0006 q4_1::layers.15.attention.wk.weight : rmse 0.00090233, maxerr 0.00720215, 95pct<0.0018, median<0.0006 q4_1::layers.15.attention.wv.weight : rmse 0.00063824, maxerr 0.00362396, 95pct<0.0012, median<0.0006 q4_1::layers.15.attention.wo.weight : rmse 0.00063629, maxerr 0.01213074, 95pct<0.0012, median<0.0006 q4_1::layers.15.feed_forward.w1.weight : rmse 0.00075536, maxerr 0.00686646, 95pct<0.0014, median<0.0006 q4_1::layers.15.feed_forward.w2.weight : rmse 0.00073590, maxerr 0.02175903, 95pct<0.0014, median<0.0006 q4_1::layers.15.feed_forward.w3.weight : rmse 0.00075102, maxerr 0.00634384, 95pct<0.0014, median<0.0006 q4_1::layers.15.attention_norm.weight : rmse 0.00072914, maxerr 0.00546265, 95pct<0.0016, median<0.0004 q4_1::layers.15.ffn_norm.weight : rmse 0.00046546, maxerr 0.00308990, 95pct<0.0012, median<0.0002 q4_1::layers.16.attention.wq.weight : rmse 0.00087387, maxerr 0.01208496, 95pct<0.0018, median<0.0008 q4_1::layers.16.attention.wk.weight : rmse 0.00089641, maxerr 0.00708771, 95pct<0.0018, median<0.0008 q4_1::layers.16.attention.wv.weight : rmse 0.00067898, maxerr 0.00372696, 95pct<0.0014, median<0.0006 q4_1::layers.16.attention.wo.weight : rmse 0.00067446, maxerr 0.01918030, 95pct<0.0014, median<0.0006 q4_1::layers.16.feed_forward.w1.weight : rmse 0.00076073, maxerr 0.00684357, 95pct<0.0016, median<0.0006 q4_1::layers.16.feed_forward.w2.weight : rmse 0.00073517, maxerr 0.01637268, 95pct<0.0014, median<0.0006 q4_1::layers.16.feed_forward.w3.weight : rmse 0.00074853, maxerr 0.00637054, 95pct<0.0014, median<0.0006 q4_1::layers.16.attention_norm.weight : rmse 0.00073561, maxerr 0.00589752, 95pct<0.0016, median<0.0004 q4_1::layers.16.ffn_norm.weight : rmse 0.00048578, maxerr 0.00316620, 95pct<0.0012, median<0.0002 q4_1::layers.17.attention.wq.weight : rmse 0.00085523, maxerr 0.01348877, 95pct<0.0018, median<0.0006 q4_1::layers.17.attention.wk.weight : rmse 0.00087332, maxerr 0.00699615, 95pct<0.0018, median<0.0006 q4_1::layers.17.attention.wv.weight : rmse 0.00068333, maxerr 0.00382996, 95pct<0.0014, median<0.0006 q4_1::layers.17.attention.wo.weight : rmse 0.00068313, maxerr 0.01434326, 95pct<0.0014, median<0.0006 q4_1::layers.17.feed_forward.w1.weight : rmse 0.00076232, maxerr 0.00566864, 95pct<0.0016, median<0.0006 q4_1::layers.17.feed_forward.w2.weight : rmse 0.00073823, maxerr 0.01303101, 95pct<0.0014, median<0.0006 q4_1::layers.17.feed_forward.w3.weight : rmse 0.00075083, maxerr 0.00892830, 95pct<0.0014, median<0.0006 q4_1::layers.17.attention_norm.weight : rmse 0.00066482, maxerr 0.00503540, 95pct<0.0014, median<0.0004 q4_1::layers.17.ffn_norm.weight : rmse 0.00051143, maxerr 0.00348663, 95pct<0.0012, median<0.0002 q4_1::layers.18.attention.wq.weight : rmse 0.00084565, maxerr 0.01126099, 95pct<0.0018, median<0.0006 q4_1::layers.18.attention.wk.weight : rmse 0.00085699, maxerr 0.00658417, 95pct<0.0018, median<0.0006 q4_1::layers.18.attention.wv.weight : rmse 0.00068148, maxerr 0.00400162, 95pct<0.0014, median<0.0006 q4_1::layers.18.attention.wo.weight : rmse 0.00068093, maxerr 0.01626587, 95pct<0.0014, median<0.0006 q4_1::layers.18.feed_forward.w1.weight : rmse 0.00076836, maxerr 0.00660706, 95pct<0.0016, median<0.0006 q4_1::layers.18.feed_forward.w2.weight : rmse 0.00073702, maxerr 0.01779175, 95pct<0.0014, median<0.0006 q4_1::layers.18.feed_forward.w3.weight : rmse 0.00074808, maxerr 0.00534439, 95pct<0.0014, median<0.0006 q4_1::layers.18.attention_norm.weight : rmse 0.00075162, maxerr 0.00507355, 95pct<0.0016, median<0.0004 q4_1::layers.18.ffn_norm.weight : rmse 0.00060263, maxerr 0.00396729, 95pct<0.0014, median<0.0002 q4_1::layers.19.attention.wq.weight : rmse 0.00083017, maxerr 0.01272583, 95pct<0.0018, median<0.0006 q4_1::layers.19.attention.wk.weight : rmse 0.00084088, maxerr 0.00719452, 95pct<0.0018, median<0.0006 q4_1::layers.19.attention.wv.weight : rmse 0.00071467, maxerr 0.00441360, 95pct<0.0014, median<0.0006 q4_1::layers.19.attention.wo.weight : rmse 0.00070940, maxerr 0.01687622, 95pct<0.0014, median<0.0006 q4_1::layers.19.feed_forward.w1.weight : rmse 0.00077282, maxerr 0.00881958, 95pct<0.0016, median<0.0006 q4_1::layers.19.feed_forward.w2.weight : rmse 0.00073796, maxerr 0.01264954, 95pct<0.0014, median<0.0006 q4_1::layers.19.feed_forward.w3.weight : rmse 0.00074646, maxerr 0.00630951, 95pct<0.0014, median<0.0006 q4_1::layers.19.attention_norm.weight : rmse 0.00077662, maxerr 0.00615692, 95pct<0.0016, median<0.0004 q4_1::layers.19.ffn_norm.weight : rmse 0.00062620, maxerr 0.00425720, 95pct<0.0016, median<0.0004 q4_1::layers.20.attention.wq.weight : rmse 0.00084113, maxerr 0.01691055, 95pct<0.0018, median<0.0006 q4_1::layers.20.attention.wk.weight : rmse 0.00085405, maxerr 0.00733185, 95pct<0.0018, median<0.0006 q4_1::layers.20.attention.wv.weight : rmse 0.00073687, maxerr 0.00400925, 95pct<0.0014, median<0.0006 q4_1::layers.20.attention.wo.weight : rmse 0.00072742, maxerr 0.01225281, 95pct<0.0014, median<0.0006 q4_1::layers.20.feed_forward.w1.weight : rmse 0.00077668, maxerr 0.00712967, 95pct<0.0016, median<0.0006 q4_1::layers.20.feed_forward.w2.weight : rmse 0.00073884, maxerr 0.01961899, 95pct<0.0014, median<0.0006 q4_1::layers.20.feed_forward.w3.weight : rmse 0.00074688, maxerr 0.00490570, 95pct<0.0014, median<0.0006 q4_1::layers.20.attention_norm.weight : rmse 0.00078771, maxerr 0.00583649, 95pct<0.0018, median<0.0004 q4_1::layers.20.ffn_norm.weight : rmse 0.00064492, maxerr 0.00514984, 95pct<0.0016, median<0.0002 q4_1::layers.21.attention.wq.weight : rmse 0.00081337, maxerr 0.01411438, 95pct<0.0016, median<0.0006 q4_1::layers.21.attention.wk.weight : rmse 0.00082242, maxerr 0.00918579, 95pct<0.0018, median<0.0006 q4_1::layers.21.attention.wv.weight : rmse 0.00074255, maxerr 0.00424957, 95pct<0.0014, median<0.0006 q4_1::layers.21.attention.wo.weight : rmse 0.00073171, maxerr 0.02182007, 95pct<0.0014, median<0.0006 q4_1::layers.21.feed_forward.w1.weight : rmse 0.00078057, maxerr 0.00685120, 95pct<0.0016, median<0.0006 q4_1::layers.21.feed_forward.w2.weight : rmse 0.00073872, maxerr 0.01074219, 95pct<0.0014, median<0.0006 q4_1::layers.21.feed_forward.w3.weight : rmse 0.00074573, maxerr 0.00585175, 95pct<0.0014, median<0.0006 q4_1::layers.21.attention_norm.weight : rmse 0.00085819, maxerr 0.00559998, 95pct<0.0018, median<0.0004 q4_1::layers.21.ffn_norm.weight : rmse 0.00070441, maxerr 0.00471497, 95pct<0.0018, median<0.0004 q4_1::layers.22.attention.wq.weight : rmse 0.00082720, maxerr 0.01268435, 95pct<0.0016, median<0.0006 q4_1::layers.22.attention.wk.weight : rmse 0.00083526, maxerr 0.00759888, 95pct<0.0018, median<0.0006 q4_1::layers.22.attention.wv.weight : rmse 0.00073750, maxerr 0.00417137, 95pct<0.0014, median<0.0006 q4_1::layers.22.attention.wo.weight : rmse 0.00073417, maxerr 0.02563477, 95pct<0.0014, median<0.0006 q4_1::layers.22.feed_forward.w1.weight : rmse 0.00078155, maxerr 0.00765991, 95pct<0.0016, median<0.0006 q4_1::layers.22.feed_forward.w2.weight : rmse 0.00074211, maxerr 0.01165771, 95pct<0.0014, median<0.0006 q4_1::layers.22.feed_forward.w3.weight : rmse 0.00074905, maxerr 0.00787354, 95pct<0.0014, median<0.0006 q4_1::layers.22.attention_norm.weight : rmse 0.00080074, maxerr 0.00627136, 95pct<0.0018, median<0.0004 q4_1::layers.22.ffn_norm.weight : rmse 0.00072412, maxerr 0.00478363, 95pct<0.0018, median<0.0004 q4_1::layers.23.attention.wq.weight : rmse 0.00080186, maxerr 0.01232910, 95pct<0.0016, median<0.0006 q4_1::layers.23.attention.wk.weight : rmse 0.00080485, maxerr 0.00798798, 95pct<0.0016, median<0.0006 q4_1::layers.23.attention.wv.weight : rmse 0.00076730, maxerr 0.00468445, 95pct<0.0016, median<0.0006 q4_1::layers.23.attention.wo.weight : rmse 0.00075430, maxerr 0.02246094, 95pct<0.0014, median<0.0006 q4_1::layers.23.feed_forward.w1.weight : rmse 0.00078317, maxerr 0.00979614, 95pct<0.0016, median<0.0006 q4_1::layers.23.feed_forward.w2.weight : rmse 0.00074451, maxerr 0.01216125, 95pct<0.0014, median<0.0006 q4_1::layers.23.feed_forward.w3.weight : rmse 0.00075047, maxerr 0.00660706, 95pct<0.0014, median<0.0006 q4_1::layers.23.attention_norm.weight : rmse 0.00099103, maxerr 0.00698853, 95pct<0.0018, median<0.0006 q4_1::layers.23.ffn_norm.weight : rmse 0.00077689, maxerr 0.00527191, 95pct<0.0020, median<0.0004 q4_1::layers.24.attention.wq.weight : rmse 0.00080229, maxerr 0.01250458, 95pct<0.0016, median<0.0006 q4_1::layers.24.attention.wk.weight : rmse 0.00080784, maxerr 0.00827789, 95pct<0.0016, median<0.0006 q4_1::layers.24.attention.wv.weight : rmse 0.00077617, maxerr 0.00516891, 95pct<0.0016, median<0.0006 q4_1::layers.24.attention.wo.weight : rmse 0.00076378, maxerr 0.01571655, 95pct<0.0014, median<0.0006 q4_1::layers.24.feed_forward.w1.weight : rmse 0.00078416, maxerr 0.00661850, 95pct<0.0016, median<0.0006 q4_1::layers.24.feed_forward.w2.weight : rmse 0.00074811, maxerr 0.01764297, 95pct<0.0014, median<0.0006 q4_1::layers.24.feed_forward.w3.weight : rmse 0.00075464, maxerr 0.00608063, 95pct<0.0014, median<0.0006 q4_1::layers.24.attention_norm.weight : rmse 0.00103406, maxerr 0.00711060, 95pct<0.0022, median<0.0004 q4_1::layers.24.ffn_norm.weight : rmse 0.00080069, maxerr 0.00546265, 95pct<0.0020, median<0.0004 q4_1::layers.25.attention.wq.weight : rmse 0.00082454, maxerr 0.01164246, 95pct<0.0016, median<0.0006 q4_1::layers.25.attention.wk.weight : rmse 0.00083484, maxerr 0.00672913, 95pct<0.0016, median<0.0006 q4_1::layers.25.attention.wv.weight : rmse 0.00077909, maxerr 0.00453949, 95pct<0.0016, median<0.0006 q4_1::layers.25.attention.wo.weight : rmse 0.00077046, maxerr 0.01708984, 95pct<0.0016, median<0.0006 q4_1::layers.25.feed_forward.w1.weight : rmse 0.00078604, maxerr 0.00747681, 95pct<0.0016, median<0.0006 q4_1::layers.25.feed_forward.w2.weight : rmse 0.00075059, maxerr 0.01087952, 95pct<0.0014, median<0.0006 q4_1::layers.25.feed_forward.w3.weight : rmse 0.00075747, maxerr 0.00479889, 95pct<0.0014, median<0.0006 q4_1::layers.25.attention_norm.weight : rmse 0.00097833, maxerr 0.00801086, 95pct<0.0020, median<0.0004 q4_1::layers.25.ffn_norm.weight : rmse 0.00076633, maxerr 0.00572205, 95pct<0.0018, median<0.0004 q4_1::layers.26.attention.wq.weight : rmse 0.00081107, maxerr 0.01138306, 95pct<0.0016, median<0.0006 q4_1::layers.26.attention.wk.weight : rmse 0.00082083, maxerr 0.00728607, 95pct<0.0016, median<0.0006 q4_1::layers.26.attention.wv.weight : rmse 0.00080224, maxerr 0.00533295, 95pct<0.0016, median<0.0008 q4_1::layers.26.attention.wo.weight : rmse 0.00079334, maxerr 0.01106262, 95pct<0.0016, median<0.0006 q4_1::layers.26.feed_forward.w1.weight : rmse 0.00078547, maxerr 0.00971985, 95pct<0.0016, median<0.0006 q4_1::layers.26.feed_forward.w2.weight : rmse 0.00075533, maxerr 0.01620483, 95pct<0.0014, median<0.0006 q4_1::layers.26.feed_forward.w3.weight : rmse 0.00076290, maxerr 0.01182556, 95pct<0.0014, median<0.0006 q4_1::layers.26.attention_norm.weight : rmse 0.00098486, maxerr 0.00781250, 95pct<0.0022, median<0.0004 q4_1::layers.26.ffn_norm.weight : rmse 0.00075499, maxerr 0.00534821, 95pct<0.0018, median<0.0002 q4_1::layers.27.attention.wq.weight : rmse 0.00081180, maxerr 0.01181030, 95pct<0.0016, median<0.0006 q4_1::layers.27.attention.wk.weight : rmse 0.00081854, maxerr 0.00698853, 95pct<0.0016, median<0.0006 q4_1::layers.27.attention.wv.weight : rmse 0.00081606, maxerr 0.00493622, 95pct<0.0016, median<0.0008 q4_1::layers.27.attention.wo.weight : rmse 0.00081198, maxerr 0.02349854, 95pct<0.0016, median<0.0008 q4_1::layers.27.feed_forward.w1.weight : rmse 0.00078540, maxerr 0.01211548, 95pct<0.0016, median<0.0006 q4_1::layers.27.feed_forward.w2.weight : rmse 0.00075951, maxerr 0.01426315, 95pct<0.0014, median<0.0006 q4_1::layers.27.feed_forward.w3.weight : rmse 0.00076620, maxerr 0.01175487, 95pct<0.0014, median<0.0006 q4_1::layers.27.attention_norm.weight : rmse 0.00097933, maxerr 0.00705719, 95pct<0.0022, median<0.0004 q4_1::layers.27.ffn_norm.weight : rmse 0.00077453, maxerr 0.00610352, 95pct<0.0018, median<0.0004 q4_1::layers.28.attention.wq.weight : rmse 0.00079531, maxerr 0.01141357, 95pct<0.0016, median<0.0006 q4_1::layers.28.attention.wk.weight : rmse 0.00079968, maxerr 0.00871277, 95pct<0.0016, median<0.0006 q4_1::layers.28.attention.wv.weight : rmse 0.00082040, maxerr 0.00494003, 95pct<0.0016, median<0.0008 q4_1::layers.28.attention.wo.weight : rmse 0.00082139, maxerr 0.01425171, 95pct<0.0016, median<0.0008 q4_1::layers.28.feed_forward.w1.weight : rmse 0.00078189, maxerr 0.01318359, 95pct<0.0016, median<0.0006 q4_1::layers.28.feed_forward.w2.weight : rmse 0.00076300, maxerr 0.01531982, 95pct<0.0014, median<0.0006 q4_1::layers.28.feed_forward.w3.weight : rmse 0.00076984, maxerr 0.01034546, 95pct<0.0016, median<0.0006 q4_1::layers.28.attention_norm.weight : rmse 0.00110201, maxerr 0.00842285, 95pct<0.0022, median<0.0004 q4_1::layers.28.ffn_norm.weight : rmse 0.00073544, maxerr 0.00648499, 95pct<0.0016, median<0.0004 q4_1::layers.29.attention.wq.weight : rmse 0.00078770, maxerr 0.01138306, 95pct<0.0016, median<0.0006 q4_1::layers.29.attention.wk.weight : rmse 0.00079442, maxerr 0.00746536, 95pct<0.0016, median<0.0006 q4_1::layers.29.attention.wv.weight : rmse 0.00084494, maxerr 0.00521088, 95pct<0.0016, median<0.0008 q4_1::layers.29.attention.wo.weight : rmse 0.00084553, maxerr 0.01846313, 95pct<0.0016, median<0.0008 q4_1::layers.29.feed_forward.w1.weight : rmse 0.00078347, maxerr 0.00897980, 95pct<0.0016, median<0.0006 q4_1::layers.29.feed_forward.w2.weight : rmse 0.00076542, maxerr 0.02725220, 95pct<0.0014, median<0.0006 q4_1::layers.29.feed_forward.w3.weight : rmse 0.00077436, maxerr 0.00727081, 95pct<0.0016, median<0.0006 q4_1::layers.29.attention_norm.weight : rmse 0.00114305, maxerr 0.00759888, 95pct<0.0024, median<0.0004 q4_1::layers.29.ffn_norm.weight : rmse 0.00069918, maxerr 0.00627136, 95pct<0.0014, median<0.0004 q4_1::layers.30.attention.wq.weight : rmse 0.00079605, maxerr 0.01003265, 95pct<0.0016, median<0.0006 q4_1::layers.30.attention.wk.weight : rmse 0.00080232, maxerr 0.00825500, 95pct<0.0016, median<0.0006 q4_1::layers.30.attention.wv.weight : rmse 0.00083372, maxerr 0.00512314, 95pct<0.0016, median<0.0008 q4_1::layers.30.attention.wo.weight : rmse 0.00084424, maxerr 0.02038574, 95pct<0.0016, median<0.0008 q4_1::layers.30.feed_forward.w1.weight : rmse 0.00078815, maxerr 0.00737762, 95pct<0.0016, median<0.0006 q4_1::layers.30.feed_forward.w2.weight : rmse 0.00078960, maxerr 0.05273438, 95pct<0.0014, median<0.0006 q4_1::layers.30.feed_forward.w3.weight : rmse 0.00078154, maxerr 0.01219177, 95pct<0.0016, median<0.0006 q4_1::layers.30.attention_norm.weight : rmse 0.00124695, maxerr 0.00775146, 95pct<0.0028, median<0.0006 q4_1::layers.30.ffn_norm.weight : rmse 0.00072293, maxerr 0.00625610, 95pct<0.0016, median<0.0004 q4_1::layers.31.attention.wq.weight : rmse 0.00080764, maxerr 0.00848770, 95pct<0.0016, median<0.0006 q4_1::layers.31.attention.wk.weight : rmse 0.00082763, maxerr 0.00734711, 95pct<0.0016, median<0.0006 q4_1::layers.31.attention.wv.weight : rmse 0.00075143, maxerr 0.00494766, 95pct<0.0014, median<0.0006 q4_1::layers.31.attention.wo.weight : rmse 0.00076344, maxerr 0.05163574, 95pct<0.0014, median<0.0006 q4_1::layers.31.feed_forward.w1.weight : rmse 0.00082211, maxerr 0.01235199, 95pct<0.0016, median<0.0008 q4_1::layers.31.feed_forward.w2.weight : rmse 0.00078318, maxerr 0.04278564, 95pct<0.0016, median<0.0006 q4_1::layers.31.feed_forward.w3.weight : rmse 0.00081328, maxerr 0.01448441, 95pct<0.0016, median<0.0008 q4_1::layers.31.attention_norm.weight : rmse 0.00141233, maxerr 0.00952148, 95pct<0.0028, median<0.0008 q4_1::layers.31.ffn_norm.weight : rmse 0.00107585, maxerr 0.00598907, 95pct<0.0024, median<0.0008 q4_1 : rmse 0.00076131, maxerr 0.05273438, 95pct<0.0016, median<0.0006

main: total time = 391533.50 ms

Update 5

Added a Q4_0-like quantization scheme that ends up with a RMSE of ~0.00159 (so basically the same as the best Q4_1). Basically, we split a group of 32 weights into two group of 16 and quantize these separately. We store the two scaling factors as fp16, ending up using the exact same amount of memory as the current Q4_0 (5 bits per weight). A round trip of quantization - dequantization results in

rmse 0.00159265, maxerr 0.17480469, 95pct<0.0030, median<0.0012
iwan@MacBook-Pro:~/other/llama.cpp/build$ ./bin/quantize-stats -m ../../quant/models/7B/ggml-model-f16-new.bin -nq -p Loading model llama.cpp: loading model from ../../quant/models/7B/ggml-model-f16-new.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 256 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: f16 = 1 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 14645.07 MB (+ 2052.00 MB per state) llama_init_from_file: kv self size = 256.00 MB note: source model is f16 testing 291 layers with max size 131072000 q4_0::tok_embeddings.weight : rmse 0.00140924, maxerr 0.01599121, 95pct<0.0028, median<0.0012 q4_0::norm.weight : rmse 0.05220819, maxerr 0.17480469, 95pct<0.0300, median<0.0300 q4_0::output.weight : rmse 0.00141680, maxerr 0.02381897, 95pct<0.0028, median<0.0012 q4_0::layers.0.attention.wq.weight : rmse 0.00245018, maxerr 0.04733276, 95pct<0.0052, median<0.0014 q4_0::layers.0.attention.wk.weight : rmse 0.00240798, maxerr 0.06744385, 95pct<0.0050, median<0.0014 q4_0::layers.0.attention.wv.weight : rmse 0.00095200, maxerr 0.00719070, 95pct<0.0020, median<0.0008 q4_0::layers.0.attention.wo.weight : rmse 0.00085718, maxerr 0.03448486, 95pct<0.0018, median<0.0006 q4_0::layers.0.feed_forward.w1.weight : rmse 0.00118121, maxerr 0.06396484, 95pct<0.0022, median<0.0010 q4_0::layers.0.feed_forward.w2.weight : rmse 0.00143914, maxerr 0.04794312, 95pct<0.0028, median<0.0012 q4_0::layers.0.feed_forward.w3.weight : rmse 0.00114520, maxerr 0.01708984, 95pct<0.0022, median<0.0010 q4_0::layers.0.attention_norm.weight : rmse 0.00726788, maxerr 0.04315186, 95pct<0.0202, median<0.0014 q4_0::layers.0.ffn_norm.weight : rmse 0.00297167, maxerr 0.01489258, 95pct<0.0052, median<0.0024 q4_0::layers.1.attention.wq.weight : rmse 0.00236875, maxerr 0.03283691, 95pct<0.0050, median<0.0014 q4_0::layers.1.attention.wk.weight : rmse 0.00241936, maxerr 0.03659058, 95pct<0.0052, median<0.0014 q4_0::layers.1.attention.wv.weight : rmse 0.00081523, maxerr 0.00643539, 95pct<0.0018, median<0.0006 q4_0::layers.1.attention.wo.weight : rmse 0.00083170, maxerr 0.03112793, 95pct<0.0018, median<0.0006 q4_0::layers.1.feed_forward.w1.weight : rmse 0.00150114, maxerr 0.03445435, 95pct<0.0028, median<0.0012 q4_0::layers.1.feed_forward.w2.weight : rmse 0.00147360, maxerr 0.06106567, 95pct<0.0028, median<0.0012 q4_0::layers.1.feed_forward.w3.weight : rmse 0.00142470, maxerr 0.02111816, 95pct<0.0028, median<0.0012 q4_0::layers.1.attention_norm.weight : rmse 0.00535747, maxerr 0.01843262, 95pct<0.0100, median<0.0040 q4_0::layers.1.ffn_norm.weight : rmse 0.00342054, maxerr 0.00836182, 95pct<0.0062, median<0.0028 q4_0::layers.2.attention.wq.weight : rmse 0.00256614, maxerr 0.03411865, 95pct<0.0052, median<0.0018 q4_0::layers.2.attention.wk.weight : rmse 0.00266028, maxerr 0.03179932, 95pct<0.0056, median<0.0016 q4_0::layers.2.attention.wv.weight : rmse 0.00097341, maxerr 0.00904083, 95pct<0.0020, median<0.0008 q4_0::layers.2.attention.wo.weight : rmse 0.00099014, maxerr 0.03747559, 95pct<0.0020, median<0.0008 q4_0::layers.2.feed_forward.w1.weight : rmse 0.00158309, maxerr 0.03842163, 95pct<0.0030, median<0.0012 q4_0::layers.2.feed_forward.w2.weight : rmse 0.00146792, maxerr 0.06268311, 95pct<0.0028, median<0.0012 q4_0::layers.2.feed_forward.w3.weight : rmse 0.00144345, maxerr 0.02966309, 95pct<0.0028, median<0.0012 q4_0::layers.2.attention_norm.weight : rmse 0.00487176, maxerr 0.01904297, 95pct<0.0086, median<0.0038 q4_0::layers.2.ffn_norm.weight : rmse 0.00370989, maxerr 0.01013184, 95pct<0.0072, median<0.0028 q4_0::layers.3.attention.wq.weight : rmse 0.00210723, maxerr 0.04635620, 95pct<0.0042, median<0.0014 q4_0::layers.3.attention.wk.weight : rmse 0.00221153, maxerr 0.02807617, 95pct<0.0046, median<0.0014 q4_0::layers.3.attention.wv.weight : rmse 0.00115453, maxerr 0.00730133, 95pct<0.0022, median<0.0010 q4_0::layers.3.attention.wo.weight : rmse 0.00115204, maxerr 0.03686523, 95pct<0.0022, median<0.0010 q4_0::layers.3.feed_forward.w1.weight : rmse 0.00160014, maxerr 0.02883911, 95pct<0.0030, median<0.0012 q4_0::layers.3.feed_forward.w2.weight : rmse 0.00147493, maxerr 0.05181885, 95pct<0.0028, median<0.0012 q4_0::layers.3.feed_forward.w3.weight : rmse 0.00147109, maxerr 0.02159119, 95pct<0.0028, median<0.0012 q4_0::layers.3.attention_norm.weight : rmse 0.00675907, maxerr 0.02307129, 95pct<0.0122, median<0.0052 q4_0::layers.3.ffn_norm.weight : rmse 0.00368094, maxerr 0.01193237, 95pct<0.0074, median<0.0026 q4_0::layers.4.attention.wq.weight : rmse 0.00214453, maxerr 0.04370117, 95pct<0.0042, median<0.0016 q4_0::layers.4.attention.wk.weight : rmse 0.00216904, maxerr 0.02233887, 95pct<0.0044, median<0.0014 q4_0::layers.4.attention.wv.weight : rmse 0.00115284, maxerr 0.00891113, 95pct<0.0022, median<0.0010 q4_0::layers.4.attention.wo.weight : rmse 0.00115254, maxerr 0.03320312, 95pct<0.0022, median<0.0010 q4_0::layers.4.feed_forward.w1.weight : rmse 0.00162055, maxerr 0.03872681, 95pct<0.0032, median<0.0012 q4_0::layers.4.feed_forward.w2.weight : rmse 0.00146983, maxerr 0.04882812, 95pct<0.0028, median<0.0012 q4_0::layers.4.feed_forward.w3.weight : rmse 0.00147666, maxerr 0.03454590, 95pct<0.0028, median<0.0012 q4_0::layers.4.attention_norm.weight : rmse 0.00759098, maxerr 0.02697754, 95pct<0.0138, median<0.0060 q4_0::layers.4.ffn_norm.weight : rmse 0.00411936, maxerr 0.01220703, 95pct<0.0086, median<0.0030 q4_0::layers.5.attention.wq.weight : rmse 0.00203095, maxerr 0.04281616, 95pct<0.0040, median<0.0014 q4_0::layers.5.attention.wk.weight : rmse 0.00204840, maxerr 0.03198242, 95pct<0.0042, median<0.0014 q4_0::layers.5.attention.wv.weight : rmse 0.00117341, maxerr 0.01441193, 95pct<0.0024, median<0.0010 q4_0::layers.5.attention.wo.weight : rmse 0.00116686, maxerr 0.03796387, 95pct<0.0022, median<0.0010 q4_0::layers.5.feed_forward.w1.weight : rmse 0.00165397, maxerr 0.03155518, 95pct<0.0032, median<0.0014 q4_0::layers.5.feed_forward.w2.weight : rmse 0.00145557, maxerr 0.03903198, 95pct<0.0028, median<0.0012 q4_0::layers.5.feed_forward.w3.weight : rmse 0.00147068, maxerr 0.02502441, 95pct<0.0028, median<0.0012 q4_0::layers.5.attention_norm.weight : rmse 0.00806252, maxerr 0.03417969, 95pct<0.0154, median<0.0058 q4_0::layers.5.ffn_norm.weight : rmse 0.00421656, maxerr 0.01330566, 95pct<0.0086, median<0.0030 q4_0::layers.6.attention.wq.weight : rmse 0.00204956, maxerr 0.04840088, 95pct<0.0042, median<0.0014 q4_0::layers.6.attention.wk.weight : rmse 0.00209686, maxerr 0.01791382, 95pct<0.0042, median<0.0014 q4_0::layers.6.attention.wv.weight : rmse 0.00118219, maxerr 0.00815582, 95pct<0.0024, median<0.0010 q4_0::layers.6.attention.wo.weight : rmse 0.00117915, maxerr 0.03329468, 95pct<0.0024, median<0.0010 q4_0::layers.6.feed_forward.w1.weight : rmse 0.00163119, maxerr 0.03842163, 95pct<0.0032, median<0.0012 q4_0::layers.6.feed_forward.w2.weight : rmse 0.00147030, maxerr 0.04553223, 95pct<0.0028, median<0.0012 q4_0::layers.6.feed_forward.w3.weight : rmse 0.00148939, maxerr 0.02365112, 95pct<0.0028, median<0.0012 q4_0::layers.6.attention_norm.weight : rmse 0.00813445, maxerr 0.03381348, 95pct<0.0154, median<0.0058 q4_0::layers.6.ffn_norm.weight : rmse 0.00451214, maxerr 0.01354980, 95pct<0.0090, median<0.0032 q4_0::layers.7.attention.wq.weight : rmse 0.00201399, maxerr 0.04580688, 95pct<0.0040, median<0.0014 q4_0::layers.7.attention.wk.weight : rmse 0.00203167, maxerr 0.02160645, 95pct<0.0042, median<0.0014 q4_0::layers.7.attention.wv.weight : rmse 0.00121886, maxerr 0.00846863, 95pct<0.0024, median<0.0010 q4_0::layers.7.attention.wo.weight : rmse 0.00120427, maxerr 0.02935791, 95pct<0.0024, median<0.0010 q4_0::layers.7.feed_forward.w1.weight : rmse 0.00161576, maxerr 0.02743530, 95pct<0.0032, median<0.0012 q4_0::layers.7.feed_forward.w2.weight : rmse 0.00147621, maxerr 0.04199219, 95pct<0.0028, median<0.0012 q4_0::layers.7.feed_forward.w3.weight : rmse 0.00149551, maxerr 0.02548218, 95pct<0.0028, median<0.0012 q4_0::layers.7.attention_norm.weight : rmse 0.00890861, maxerr 0.02636719, 95pct<0.0164, median<0.0066 q4_0::layers.7.ffn_norm.weight : rmse 0.00473217, maxerr 0.01367188, 95pct<0.0094, median<0.0032 q4_0::layers.8.attention.wq.weight : rmse 0.00198212, maxerr 0.04116821, 95pct<0.0040, median<0.0014 q4_0::layers.8.attention.wk.weight : rmse 0.00198218, maxerr 0.02270508, 95pct<0.0040, median<0.0014 q4_0::layers.8.attention.wv.weight : rmse 0.00120661, maxerr 0.00962830, 95pct<0.0024, median<0.0010 q4_0::layers.8.attention.wo.weight : rmse 0.00119881, maxerr 0.02880859, 95pct<0.0024, median<0.0010 q4_0::layers.8.feed_forward.w1.weight : rmse 0.00161643, maxerr 0.02868652, 95pct<0.0032, median<0.0012 q4_0::layers.8.feed_forward.w2.weight : rmse 0.00147686, maxerr 0.03305054, 95pct<0.0028, median<0.0012 q4_0::layers.8.feed_forward.w3.weight : rmse 0.00149971, maxerr 0.02107239, 95pct<0.0028, median<0.0012 q4_0::layers.8.attention_norm.weight : rmse 0.00945429, maxerr 0.03265381, 95pct<0.0172, median<0.0072 q4_0::layers.8.ffn_norm.weight : rmse 0.00499407, maxerr 0.01416016, 95pct<0.0100, median<0.0036 q4_0::layers.9.attention.wq.weight : rmse 0.00191681, maxerr 0.03723145, 95pct<0.0038, median<0.0014 q4_0::layers.9.attention.wk.weight : rmse 0.00193005, maxerr 0.01721191, 95pct<0.0040, median<0.0014 q4_0::layers.9.attention.wv.weight : rmse 0.00119795, maxerr 0.00854492, 95pct<0.0024, median<0.0010 q4_0::layers.9.attention.wo.weight : rmse 0.00118937, maxerr 0.03378296, 95pct<0.0024, median<0.0010 q4_0::layers.9.feed_forward.w1.weight : rmse 0.00159322, maxerr 0.03854370, 95pct<0.0030, median<0.0012 q4_0::layers.9.feed_forward.w2.weight : rmse 0.00148821, maxerr 0.04580688, 95pct<0.0028, median<0.0012 q4_0::layers.9.feed_forward.w3.weight : rmse 0.00150940, maxerr 0.04849243, 95pct<0.0030, median<0.0012 q4_0::layers.9.attention_norm.weight : rmse 0.01032341, maxerr 0.03320312, 95pct<0.0186, median<0.0080 q4_0::layers.9.ffn_norm.weight : rmse 0.00516424, maxerr 0.01367188, 95pct<0.0104, median<0.0036 q4_0::layers.10.attention.wq.weight : rmse 0.00191860, maxerr 0.03503418, 95pct<0.0038, median<0.0014 q4_0::layers.10.attention.wk.weight : rmse 0.00194014, maxerr 0.01864624, 95pct<0.0040, median<0.0014 q4_0::layers.10.attention.wv.weight : rmse 0.00124297, maxerr 0.01520538, 95pct<0.0024, median<0.0010 q4_0::layers.10.attention.wo.weight : rmse 0.00123960, maxerr 0.02691650, 95pct<0.0024, median<0.0010 q4_0::layers.10.feed_forward.w1.weight : rmse 0.00158127, maxerr 0.02433777, 95pct<0.0030, median<0.0012 q4_0::layers.10.feed_forward.w2.weight : rmse 0.00150232, maxerr 0.04223633, 95pct<0.0028, median<0.0012 q4_0::layers.10.feed_forward.w3.weight : rmse 0.00152846, maxerr 0.02325439, 95pct<0.0030, median<0.0012 q4_0::layers.10.attention_norm.weight : rmse 0.01016605, maxerr 0.04028320, 95pct<0.0188, median<0.0076 q4_0::layers.10.ffn_norm.weight : rmse 0.00525648, maxerr 0.01490784, 95pct<0.0104, median<0.0038 q4_0::layers.11.attention.wq.weight : rmse 0.00197494, maxerr 0.04412842, 95pct<0.0040, median<0.0014 q4_0::layers.11.attention.wk.weight : rmse 0.00200654, maxerr 0.02586365, 95pct<0.0040, median<0.0014 q4_0::layers.11.attention.wv.weight : rmse 0.00130740, maxerr 0.01310730, 95pct<0.0026, median<0.0010 q4_0::layers.11.attention.wo.weight : rmse 0.00129915, maxerr 0.03100586, 95pct<0.0026, median<0.0010 q4_0::layers.11.feed_forward.w1.weight : rmse 0.00157701, maxerr 0.02528381, 95pct<0.0030, median<0.0012 q4_0::layers.11.feed_forward.w2.weight : rmse 0.00151437, maxerr 0.05520630, 95pct<0.0030, median<0.0012 q4_0::layers.11.feed_forward.w3.weight : rmse 0.00153778, maxerr 0.02384949, 95pct<0.0030, median<0.0012 q4_0::layers.11.attention_norm.weight : rmse 0.00937717, maxerr 0.03637695, 95pct<0.0172, median<0.0070 q4_0::layers.11.ffn_norm.weight : rmse 0.00517565, maxerr 0.01538086, 95pct<0.0104, median<0.0036 q4_0::layers.12.attention.wq.weight : rmse 0.00188511, maxerr 0.03573608, 95pct<0.0038, median<0.0014 q4_0::layers.12.attention.wk.weight : rmse 0.00191483, maxerr 0.02087402, 95pct<0.0040, median<0.0014 q4_0::layers.12.attention.wv.weight : rmse 0.00126328, maxerr 0.00842285, 95pct<0.0024, median<0.0010 q4_0::layers.12.attention.wo.weight : rmse 0.00127600, maxerr 0.02276611, 95pct<0.0024, median<0.0010 q4_0::layers.12.feed_forward.w1.weight : rmse 0.00158607, maxerr 0.03125000, 95pct<0.0030, median<0.0012 q4_0::layers.12.feed_forward.w2.weight : rmse 0.00151711, maxerr 0.05499268, 95pct<0.0030, median<0.0012 q4_0::layers.12.feed_forward.w3.weight : rmse 0.00154733, maxerr 0.01757812, 95pct<0.0030, median<0.0012 q4_0::layers.12.attention_norm.weight : rmse 0.01066279, maxerr 0.03515625, 95pct<0.0196, median<0.0082 q4_0::layers.12.ffn_norm.weight : rmse 0.00526677, maxerr 0.01623535, 95pct<0.0106, median<0.0036 q4_0::layers.13.attention.wq.weight : rmse 0.00183954, maxerr 0.03576660, 95pct<0.0038, median<0.0014 q4_0::layers.13.attention.wk.weight : rmse 0.00186986, maxerr 0.01885986, 95pct<0.0038, median<0.0014 q4_0::layers.13.attention.wv.weight : rmse 0.00132139, maxerr 0.00842285, 95pct<0.0026, median<0.0010 q4_0::layers.13.attention.wo.weight : rmse 0.00132272, maxerr 0.02880859, 95pct<0.0026, median<0.0010 q4_0::layers.13.feed_forward.w1.weight : rmse 0.00157917, maxerr 0.02076721, 95pct<0.0030, median<0.0012 q4_0::layers.13.feed_forward.w2.weight : rmse 0.00152975, maxerr 0.03314209, 95pct<0.0030, median<0.0012 q4_0::layers.13.feed_forward.w3.weight : rmse 0.00156214, maxerr 0.01795959, 95pct<0.0030, median<0.0012 q4_0::layers.13.attention_norm.weight : rmse 0.01080014, maxerr 0.03906250, 95pct<0.0202, median<0.0080 q4_0::layers.13.ffn_norm.weight : rmse 0.00527759, maxerr 0.01580811, 95pct<0.0106, median<0.0036 q4_0::layers.14.attention.wq.weight : rmse 0.00185303, maxerr 0.03887939, 95pct<0.0038, median<0.0014 q4_0::layers.14.attention.wk.weight : rmse 0.00186891, maxerr 0.01937866, 95pct<0.0038, median<0.0014 q4_0::layers.14.attention.wv.weight : rmse 0.00132941, maxerr 0.00946045, 95pct<0.0026, median<0.0010 q4_0::layers.14.attention.wo.weight : rmse 0.00132513, maxerr 0.02856445, 95pct<0.0026, median<0.0010 q4_0::layers.14.feed_forward.w1.weight : rmse 0.00157694, maxerr 0.02264404, 95pct<0.0030, median<0.0012 q4_0::layers.14.feed_forward.w2.weight : rmse 0.00153703, maxerr 0.05963135, 95pct<0.0030, median<0.0012 q4_0::layers.14.feed_forward.w3.weight : rmse 0.00156792, maxerr 0.02243042, 95pct<0.0030, median<0.0012 q4_0::layers.14.attention_norm.weight : rmse 0.00944150, maxerr 0.03662109, 95pct<0.0186, median<0.0066 q4_0::layers.14.ffn_norm.weight : rmse 0.00526855, maxerr 0.01635742, 95pct<0.0106, median<0.0036 q4_0::layers.15.attention.wq.weight : rmse 0.00185226, maxerr 0.03628540, 95pct<0.0038, median<0.0014 q4_0::layers.15.attention.wk.weight : rmse 0.00188623, maxerr 0.01896667, 95pct<0.0038, median<0.0014 q4_0::layers.15.attention.wv.weight : rmse 0.00133306, maxerr 0.01000977, 95pct<0.0026, median<0.0010 q4_0::layers.15.attention.wo.weight : rmse 0.00132875, maxerr 0.02636719, 95pct<0.0026, median<0.0010 q4_0::layers.15.feed_forward.w1.weight : rmse 0.00157719, maxerr 0.02148438, 95pct<0.0030, median<0.0012 q4_0::layers.15.feed_forward.w2.weight : rmse 0.00153784, maxerr 0.05371094, 95pct<0.0030, median<0.0012 q4_0::layers.15.feed_forward.w3.weight : rmse 0.00156844, maxerr 0.01736450, 95pct<0.0030, median<0.0012 q4_0::layers.15.attention_norm.weight : rmse 0.00926184, maxerr 0.04168701, 95pct<0.0186, median<0.0062 q4_0::layers.15.ffn_norm.weight : rmse 0.00522685, maxerr 0.01654053, 95pct<0.0104, median<0.0036 q4_0::layers.16.attention.wq.weight : rmse 0.00182710, maxerr 0.04269409, 95pct<0.0036, median<0.0014 q4_0::layers.16.attention.wk.weight : rmse 0.00187375, maxerr 0.02206421, 95pct<0.0038, median<0.0014 q4_0::layers.16.attention.wv.weight : rmse 0.00141846, maxerr 0.00961304, 95pct<0.0028, median<0.0012 q4_0::layers.16.attention.wo.weight : rmse 0.00140776, maxerr 0.04193115, 95pct<0.0028, median<0.0012 q4_0::layers.16.feed_forward.w1.weight : rmse 0.00158873, maxerr 0.02014160, 95pct<0.0030, median<0.0012 q4_0::layers.16.feed_forward.w2.weight : rmse 0.00153625, maxerr 0.05413818, 95pct<0.0030, median<0.0012 q4_0::layers.16.feed_forward.w3.weight : rmse 0.00156343, maxerr 0.02308655, 95pct<0.0030, median<0.0012 q4_0::layers.16.attention_norm.weight : rmse 0.00882078, maxerr 0.04125977, 95pct<0.0176, median<0.0062 q4_0::layers.16.ffn_norm.weight : rmse 0.00501514, maxerr 0.01635742, 95pct<0.0100, median<0.0034 q4_0::layers.17.attention.wq.weight : rmse 0.00178805, maxerr 0.04727173, 95pct<0.0036, median<0.0014 q4_0::layers.17.attention.wk.weight : rmse 0.00182524, maxerr 0.01977539, 95pct<0.0036, median<0.0014 q4_0::layers.17.attention.wv.weight : rmse 0.00142757, maxerr 0.01419830, 95pct<0.0028, median<0.0012 q4_0::layers.17.attention.wo.weight : rmse 0.00142669, maxerr 0.03002930, 95pct<0.0028, median<0.0012 q4_0::layers.17.feed_forward.w1.weight : rmse 0.00159202, maxerr 0.01837158, 95pct<0.0030, median<0.0012 q4_0::layers.17.feed_forward.w2.weight : rmse 0.00154283, maxerr 0.04699707, 95pct<0.0030, median<0.0012 q4_0::layers.17.feed_forward.w3.weight : rmse 0.00156794, maxerr 0.02423096, 95pct<0.0030, median<0.0012 q4_0::layers.17.attention_norm.weight : rmse 0.00863434, maxerr 0.03540039, 95pct<0.0174, median<0.0058 q4_0::layers.17.ffn_norm.weight : rmse 0.00526889, maxerr 0.01773071, 95pct<0.0106, median<0.0036 q4_0::layers.18.attention.wq.weight : rmse 0.00176889, maxerr 0.03970337, 95pct<0.0036, median<0.0012 q4_0::layers.18.attention.wk.weight : rmse 0.00179129, maxerr 0.01916504, 95pct<0.0036, median<0.0012 q4_0::layers.18.attention.wv.weight : rmse 0.00142390, maxerr 0.00970459, 95pct<0.0028, median<0.0012 q4_0::layers.18.attention.wo.weight : rmse 0.00142140, maxerr 0.03387451, 95pct<0.0028, median<0.0012 q4_0::layers.18.feed_forward.w1.weight : rmse 0.00160467, maxerr 0.02200317, 95pct<0.0030, median<0.0012 q4_0::layers.18.feed_forward.w2.weight : rmse 0.00153983, maxerr 0.06341553, 95pct<0.0030, median<0.0012 q4_0::layers.18.feed_forward.w3.weight : rmse 0.00156218, maxerr 0.01829529, 95pct<0.0030, median<0.0012 q4_0::layers.18.attention_norm.weight : rmse 0.00912104, maxerr 0.03564453, 95pct<0.0184, median<0.0060 q4_0::layers.18.ffn_norm.weight : rmse 0.00556628, maxerr 0.01904297, 95pct<0.0112, median<0.0038 q4_0::layers.19.attention.wq.weight : rmse 0.00173669, maxerr 0.04415894, 95pct<0.0034, median<0.0012 q4_0::layers.19.attention.wk.weight : rmse 0.00175806, maxerr 0.01998901, 95pct<0.0036, median<0.0012 q4_0::layers.19.attention.wv.weight : rmse 0.00149304, maxerr 0.00927734, 95pct<0.0028, median<0.0012 q4_0::layers.19.attention.wo.weight : rmse 0.00148160, maxerr 0.03442383, 95pct<0.0028, median<0.0012 q4_0::layers.19.feed_forward.w1.weight : rmse 0.00161370, maxerr 0.03111267, 95pct<0.0032, median<0.0012 q4_0::layers.19.feed_forward.w2.weight : rmse 0.00154201, maxerr 0.04092407, 95pct<0.0030, median<0.0012 q4_0::layers.19.feed_forward.w3.weight : rmse 0.00155880, maxerr 0.02091980, 95pct<0.0030, median<0.0012 q4_0::layers.19.attention_norm.weight : rmse 0.00947609, maxerr 0.04248047, 95pct<0.0192, median<0.0064 q4_0::layers.19.ffn_norm.weight : rmse 0.00561705, maxerr 0.02087402, 95pct<0.0110, median<0.0038 q4_0::layers.20.attention.wq.weight : rmse 0.00176062, maxerr 0.05462646, 95pct<0.0034, median<0.0014 q4_0::layers.20.attention.wk.weight : rmse 0.00178543, maxerr 0.02259827, 95pct<0.0036, median<0.0014 q4_0::layers.20.attention.wv.weight : rmse 0.00153912, maxerr 0.01049805, 95pct<0.0030, median<0.0012 q4_0::layers.20.attention.wo.weight : rmse 0.00151891, maxerr 0.02714539, 95pct<0.0030, median<0.0012 q4_0::layers.20.feed_forward.w1.weight : rmse 0.00162212, maxerr 0.02268982, 95pct<0.0032, median<0.0012 q4_0::layers.20.feed_forward.w2.weight : rmse 0.00154374, maxerr 0.06872559, 95pct<0.0030, median<0.0012 q4_0::layers.20.feed_forward.w3.weight : rmse 0.00155945, maxerr 0.01570129, 95pct<0.0030, median<0.0012 q4_0::layers.20.attention_norm.weight : rmse 0.00912985, maxerr 0.03491211, 95pct<0.0184, median<0.0060 q4_0::layers.20.ffn_norm.weight : rmse 0.00545571, maxerr 0.02185059, 95pct<0.0108, median<0.0036 q4_0::layers.21.attention.wq.weight : rmse 0.00170287, maxerr 0.04907227, 95pct<0.0034, median<0.0012 q4_0::layers.21.attention.wk.weight : rmse 0.00171981, maxerr 0.02088928, 95pct<0.0034, median<0.0012 q4_0::layers.21.attention.wv.weight : rmse 0.00155078, maxerr 0.00872803, 95pct<0.0030, median<0.0012 q4_0::layers.21.attention.wo.weight : rmse 0.00152792, maxerr 0.05682373, 95pct<0.0030, median<0.0012 q4_0::layers.21.feed_forward.w1.weight : rmse 0.00163029, maxerr 0.02185059, 95pct<0.0032, median<0.0014 q4_0::layers.21.feed_forward.w2.weight : rmse 0.00154296, maxerr 0.03689575, 95pct<0.0030, median<0.0012 q4_0::layers.21.feed_forward.w3.weight : rmse 0.00155747, maxerr 0.01473236, 95pct<0.0030, median<0.0012 q4_0::layers.21.attention_norm.weight : rmse 0.01051225, maxerr 0.03753662, 95pct<0.0210, median<0.0072 q4_0::layers.21.ffn_norm.weight : rmse 0.00564722, maxerr 0.02294922, 95pct<0.0110, median<0.0038 q4_0::layers.22.attention.wq.weight : rmse 0.00173143, maxerr 0.04519653, 95pct<0.0034, median<0.0012 q4_0::layers.22.attention.wk.weight : rmse 0.00174586, maxerr 0.01945496, 95pct<0.0034, median<0.0012 q4_0::layers.22.attention.wv.weight : rmse 0.00154078, maxerr 0.00958252, 95pct<0.0030, median<0.0012 q4_0::layers.22.attention.wo.weight : rmse 0.00153304, maxerr 0.06982422, 95pct<0.0030, median<0.0012 q4_0::layers.22.feed_forward.w1.weight : rmse 0.00163221, maxerr 0.02517700, 95pct<0.0032, median<0.0014 q4_0::layers.22.feed_forward.w2.weight : rmse 0.00154983, maxerr 0.04281616, 95pct<0.0030, median<0.0012 q4_0::layers.22.feed_forward.w3.weight : rmse 0.00156437, maxerr 0.03030396, 95pct<0.0030, median<0.0012 q4_0::layers.22.attention_norm.weight : rmse 0.01016588, maxerr 0.03649902, 95pct<0.0204, median<0.0070 q4_0::layers.22.ffn_norm.weight : rmse 0.00574030, maxerr 0.02175903, 95pct<0.0114, median<0.0038 q4_0::layers.23.attention.wq.weight : rmse 0.00167802, maxerr 0.04333496, 95pct<0.0034, median<0.0012 q4_0::layers.23.attention.wk.weight : rmse 0.00168389, maxerr 0.02124023, 95pct<0.0034, median<0.0012 q4_0::layers.23.attention.wv.weight : rmse 0.00160218, maxerr 0.01093292, 95pct<0.0030, median<0.0012 q4_0::layers.23.attention.wo.weight : rmse 0.00157519, maxerr 0.06591797, 95pct<0.0030, median<0.0012 q4_0::layers.23.feed_forward.w1.weight : rmse 0.00163544, maxerr 0.03022766, 95pct<0.0032, median<0.0014 q4_0::layers.23.feed_forward.w2.weight : rmse 0.00155510, maxerr 0.04754639, 95pct<0.0030, median<0.0012 q4_0::layers.23.feed_forward.w3.weight : rmse 0.00156715, maxerr 0.02468872, 95pct<0.0030, median<0.0012 q4_0::layers.23.attention_norm.weight : rmse 0.01194653, maxerr 0.04150391, 95pct<0.0236, median<0.0084 q4_0::layers.23.ffn_norm.weight : rmse 0.00592925, maxerr 0.02349854, 95pct<0.0120, median<0.0038 q4_0::layers.24.attention.wq.weight : rmse 0.00167990, maxerr 0.04760742, 95pct<0.0034, median<0.0012 q4_0::layers.24.attention.wk.weight : rmse 0.00169040, maxerr 0.02137756, 95pct<0.0034, median<0.0012 q4_0::layers.24.attention.wv.weight : rmse 0.00162101, maxerr 0.00909424, 95pct<0.0032, median<0.0012 q4_0::layers.24.attention.wo.weight : rmse 0.00159527, maxerr 0.04144287, 95pct<0.0030, median<0.0012 q4_0::layers.24.feed_forward.w1.weight : rmse 0.00163766, maxerr 0.02067566, 95pct<0.0032, median<0.0014 q4_0::layers.24.feed_forward.w2.weight : rmse 0.00156213, maxerr 0.05670166, 95pct<0.0030, median<0.0012 q4_0::layers.24.feed_forward.w3.weight : rmse 0.00157601, maxerr 0.02043152, 95pct<0.0030, median<0.0012 q4_0::layers.24.attention_norm.weight : rmse 0.01140493, maxerr 0.04980469, 95pct<0.0230, median<0.0076 q4_0::layers.24.ffn_norm.weight : rmse 0.00600982, maxerr 0.02331543, 95pct<0.0118, median<0.0040 q4_0::layers.25.attention.wq.weight : rmse 0.00172559, maxerr 0.03808594, 95pct<0.0034, median<0.0012 q4_0::layers.25.attention.wk.weight : rmse 0.00174477, maxerr 0.01940918, 95pct<0.0034, median<0.0014 q4_0::layers.25.attention.wv.weight : rmse 0.00162673, maxerr 0.00921631, 95pct<0.0032, median<0.0012 q4_0::layers.25.attention.wo.weight : rmse 0.00160901, maxerr 0.04428101, 95pct<0.0032, median<0.0012 q4_0::layers.25.feed_forward.w1.weight : rmse 0.00164163, maxerr 0.01789856, 95pct<0.0032, median<0.0014 q4_0::layers.25.feed_forward.w2.weight : rmse 0.00156754, maxerr 0.03286743, 95pct<0.0030, median<0.0012 q4_0::layers.25.feed_forward.w3.weight : rmse 0.00158208, maxerr 0.01832581, 95pct<0.0030, median<0.0012 q4_0::layers.25.attention_norm.weight : rmse 0.01000364, maxerr 0.04394531, 95pct<0.0204, median<0.0066 q4_0::layers.25.ffn_norm.weight : rmse 0.00582826, maxerr 0.02410889, 95pct<0.0114, median<0.0038 q4_0::layers.26.attention.wq.weight : rmse 0.00169759, maxerr 0.03735352, 95pct<0.0034, median<0.0012 q4_0::layers.26.attention.wk.weight : rmse 0.00171683, maxerr 0.02526855, 95pct<0.0034, median<0.0012 q4_0::layers.26.attention.wv.weight : rmse 0.00167550, maxerr 0.01040649, 95pct<0.0032, median<0.0014 q4_0::layers.26.attention.wo.weight : rmse 0.00165650, maxerr 0.02258301, 95pct<0.0032, median<0.0014 q4_0::layers.26.feed_forward.w1.weight : rmse 0.00164058, maxerr 0.03396606, 95pct<0.0032, median<0.0014 q4_0::layers.26.feed_forward.w2.weight : rmse 0.00157755, maxerr 0.04171753, 95pct<0.0030, median<0.0012 q4_0::layers.26.feed_forward.w3.weight : rmse 0.00159313, maxerr 0.03059387, 95pct<0.0030, median<0.0012 q4_0::layers.26.attention_norm.weight : rmse 0.01028290, maxerr 0.04541016, 95pct<0.0206, median<0.0070 q4_0::layers.26.ffn_norm.weight : rmse 0.00588573, maxerr 0.02441406, 95pct<0.0116, median<0.0038 q4_0::layers.27.attention.wq.weight : rmse 0.00170066, maxerr 0.04006958, 95pct<0.0034, median<0.0012 q4_0::layers.27.attention.wk.weight : rmse 0.00171133, maxerr 0.02406311, 95pct<0.0034, median<0.0012 q4_0::layers.27.attention.wv.weight : rmse 0.00170459, maxerr 0.01264954, 95pct<0.0032, median<0.0014 q4_0::layers.27.attention.wo.weight : rmse 0.00169575, maxerr 0.05517578, 95pct<0.0032, median<0.0014 q4_0::layers.27.feed_forward.w1.weight : rmse 0.00164005, maxerr 0.02774048, 95pct<0.0032, median<0.0014 q4_0::layers.27.feed_forward.w2.weight : rmse 0.00158663, maxerr 0.04428101, 95pct<0.0030, median<0.0012 q4_0::layers.27.feed_forward.w3.weight : rmse 0.00160017, maxerr 0.04153442, 95pct<0.0030, median<0.0012 q4_0::layers.27.attention_norm.weight : rmse 0.00993009, maxerr 0.03540039, 95pct<0.0202, median<0.0064 q4_0::layers.27.ffn_norm.weight : rmse 0.00599262, maxerr 0.02731323, 95pct<0.0118, median<0.0040 q4_0::layers.28.attention.wq.weight : rmse 0.00166564, maxerr 0.04092407, 95pct<0.0034, median<0.0012 q4_0::layers.28.attention.wk.weight : rmse 0.00167293, maxerr 0.02145386, 95pct<0.0034, median<0.0012 q4_0::layers.28.attention.wv.weight : rmse 0.00171405, maxerr 0.01080322, 95pct<0.0034, median<0.0014 q4_0::layers.28.attention.wo.weight : rmse 0.00171473, maxerr 0.02978516, 95pct<0.0034, median<0.0014 q4_0::layers.28.feed_forward.w1.weight : rmse 0.00163294, maxerr 0.02809143, 95pct<0.0032, median<0.0012 q4_0::layers.28.feed_forward.w2.weight : rmse 0.00159430, maxerr 0.05215454, 95pct<0.0030, median<0.0012 q4_0::layers.28.feed_forward.w3.weight : rmse 0.00160784, maxerr 0.02539062, 95pct<0.0030, median<0.0012 q4_0::layers.28.attention_norm.weight : rmse 0.01141754, maxerr 0.05895996, 95pct<0.0236, median<0.0074 q4_0::layers.28.ffn_norm.weight : rmse 0.00626150, maxerr 0.02587891, 95pct<0.0126, median<0.0040 q4_0::layers.29.attention.wq.weight : rmse 0.00165014, maxerr 0.04244995, 95pct<0.0032, median<0.0012 q4_0::layers.29.attention.wk.weight : rmse 0.00166139, maxerr 0.01962280, 95pct<0.0034, median<0.0012 q4_0::layers.29.attention.wv.weight : rmse 0.00176460, maxerr 0.01135254, 95pct<0.0034, median<0.0014 q4_0::layers.29.attention.wo.weight : rmse 0.00176555, maxerr 0.04040527, 95pct<0.0034, median<0.0014 q4_0::layers.29.feed_forward.w1.weight : rmse 0.00163643, maxerr 0.02803040, 95pct<0.0032, median<0.0012 q4_0::layers.29.feed_forward.w2.weight : rmse 0.00160070, maxerr 0.09802246, 95pct<0.0030, median<0.0012 q4_0::layers.29.feed_forward.w3.weight : rmse 0.00161745, maxerr 0.02696228, 95pct<0.0032, median<0.0014 q4_0::layers.29.attention_norm.weight : rmse 0.01053851, maxerr 0.04125977, 95pct<0.0208, median<0.0070 q4_0::layers.29.ffn_norm.weight : rmse 0.00697287, maxerr 0.02880859, 95pct<0.0138, median<0.0044 q4_0::layers.30.attention.wq.weight : rmse 0.00166892, maxerr 0.04101562, 95pct<0.0032, median<0.0012 q4_0::layers.30.attention.wk.weight : rmse 0.00167814, maxerr 0.01953125, 95pct<0.0034, median<0.0012 q4_0::layers.30.attention.wv.weight : rmse 0.00174131, maxerr 0.01042175, 95pct<0.0034, median<0.0014 q4_0::layers.30.attention.wo.weight : rmse 0.00176365, maxerr 0.04565430, 95pct<0.0034, median<0.0014 q4_0::layers.30.feed_forward.w1.weight : rmse 0.00164651, maxerr 0.02940369, 95pct<0.0032, median<0.0014 q4_0::layers.30.feed_forward.w2.weight : rmse 0.00163353, maxerr 0.11157227, 95pct<0.0030, median<0.0012 q4_0::layers.30.feed_forward.w3.weight : rmse 0.00163264, maxerr 0.03038025, 95pct<0.0032, median<0.0014 q4_0::layers.30.attention_norm.weight : rmse 0.01055359, maxerr 0.03906250, 95pct<0.0220, median<0.0070 q4_0::layers.30.ffn_norm.weight : rmse 0.00798581, maxerr 0.03295898, 95pct<0.0162, median<0.0052 q4_0::layers.31.attention.wq.weight : rmse 0.00169262, maxerr 0.02526855, 95pct<0.0034, median<0.0012 q4_0::layers.31.attention.wk.weight : rmse 0.00173245, maxerr 0.02009583, 95pct<0.0034, median<0.0012 q4_0::layers.31.attention.wv.weight : rmse 0.00157056, maxerr 0.01557922, 95pct<0.0030, median<0.0012 q4_0::layers.31.attention.wo.weight : rmse 0.00159227, maxerr 0.10278320, 95pct<0.0030, median<0.0012 q4_0::layers.31.feed_forward.w1.weight : rmse 0.00171869, maxerr 0.02217102, 95pct<0.0034, median<0.0014 q4_0::layers.31.feed_forward.w2.weight : rmse 0.00164236, maxerr 0.11260986, 95pct<0.0032, median<0.0012 q4_0::layers.31.feed_forward.w3.weight : rmse 0.00170012, maxerr 0.03759766, 95pct<0.0032, median<0.0014 q4_0::layers.31.attention_norm.weight : rmse 0.01264741, maxerr 0.04638672, 95pct<0.0230, median<0.0096 q4_0::layers.31.ffn_norm.weight : rmse 0.01198714, maxerr 0.02905273, 95pct<0.0224, median<0.0088 q4_0 : rmse 0.00159265, maxerr 0.17480469, 95pct<0.0030, median<0.0012

Description

This PR adds new methods for Q4_0 and Q4_1 quantization that (almost) exactly solve the mixed integer minimization problem

Minimize Sum (x_i - b l_i)^2  subject to -7 <= l_i <=7      (Q4_0)
or
Minimize Sum (x_i - a - b l_i)^2   subject to 0 <= l_i <= 15 (Q4_1)

where the x_i are the original weights, the l_i are the quantized weights, and a, b are the conversion coefficients. It is almost exact because in some very rare degenerate cases the method may not find the global minimum. Guaranteeing that the global minimum is obtained is not worth because the difference in mean-square-error (MSE) to the guaranteed minimum is less than 0.01% but costs at least 2-fold increase in computation time.

On the 7B model, the improved Q4_0 quantization achieves ~14% reduction in MSE compared to the existing implementation. The improved Q4_1 quantization is even better, achieving ~25% reduction in MSE compared to the existing Q4_1.

So far I have only measured perplexity for the the 7B model. I get 6.3539 for Q4_0 and 6.0863 for Q4_1 with the default context size.

For the sake of compatibility, I have kept the format of the existing Q4_0 and Q4_1 quantization (i.e., one or two 32-bit floats followed by 16 uint8_t containing the quants), so that a model quantized with these new methods can be used without any other changes to the code. This is quite wasteful as the same 6 bits per weight that are used in Q4_1 lead to a massive reduction in MSE if one switched to 5 bit quantization and fp16 coefficients.

The new quantization methods are meant to be used for quantizing the original model only. They are by far not fast enough for quantization of intermediate results (in single-threaded mode the new Q4_0 is ~25 times slower and the new Q4_1 about ~50 times slower than the corresponding existing implementations).

The quantization function will automatically use multi-threading if the chunk of weights given for quantization is large enough. Plugged into the quantize example, it gets the job done in about 49 seconds for Q4_0 on my MacBook (M2 Max) for the 7B model. Q4_1 quantization of the 7B model takes ~190 seconds.

I have also added a change to the reference (i.e., scalar) versions of the Q4_0 and Q4_1 quantization implementations: replacing the roundf() function with a better conversions to int speeds the scalar implementation quite a bit, especially on X86_64 (and x86) where the slowness of round is legendary. After this change, the reference implementation is only ~10% slower than the vectorized quantization on the two CPU's I have tried (M2 Max and Ryzen 7950X).

in quantize_row_q4_0_reference and quantize_row_q4_1_reference.
This reduces the difference to the vectorized versions to
~10% for quantize_row_q4_0 and <15% for quantize_row_q4_1 on
the two CPU's I have tried (Ryzen 7950X and M2 Max).
@sw
Copy link
Collaborator

sw commented Apr 11, 2023

As for ggml_extra.cpp, is that the same approach as @unbounded tried here: #397 (comment) ? Anyway I'll look into it...

(edit: looks like your RMS errors are higher than those by @unbounded posted here: #835 (comment). Maybe it's because you don't seem to be using the value -8. But why is your perplexity so much lower?)

I have also added a change to the reference (i.e., scalar) versions of the Q4_0 and Q4_1 quantization implementations

I can confirm that it's faster, however it changes the output, somewhat subverting the meaning of "reference". I would find it better to make it a separate PR.

Checksums for the 7B model compared to master (3e6e70d):

$ sha256sum ggml-model-q4_*
2dad53e70ca521fedcf9f9be5c26c15df602487a9c008bdafbb2bf8f946b6bf0  ggml-model-q4_0.bin-master
16aa14a8865af282466c3e9440f59e6fe2c1f547e3c1c1b858f34d2160022f10  ggml-model-q4_0.bin-896
4f4603bb53a194dfe6b471c2fe0864094d124c8c03744c1c18bee5e09de89c83  ggml-model-q4_1.bin-master
cf59e1f29bf56db6ea1d8d21b891a427d38337262f80c21d853bb116ba32b4e6  ggml-model-q4_1.bin-896

The quantization function will automatically use multi-threading if the chunk of weights given for quantization is large enough.

Great job on this, however again, this should probably be a separate PR. It could also be made to benefit the existing formats (ftypes 2, 3).

But we should eventually switch back to nearestInt() and adapt the test.
if (df0 > 0) {
kmin = nmax-2; kmax = nmax + 1;
} else {
kmin = nmax/2; kmax = nmax+1;
Copy link
Collaborator

@sw sw Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kmax is the same in both cases, move outside the if/else or eliminate entirely in favor of nmax+1 (if that's what you intended).

All in all, this function would benefit from some explanatory comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df0 is the negative of the cost function derivative with respect to the scale at where we started. If greater than 0, we expect that the search range beyond nmax will be greater, and indeed it is. On occasion one can get a better solution by going to nmax+2 or even nmax+3. In practice, the gain in MSE is so marginal that the added extra computation time is just not worth it. But I have left kmax defined explicitly in both branches to remind us that we may want to explore ways to find such marginal improvements more efficiently (and then we would change kmax in the df0 > 0 branch correspondingly).

@ivanstepanovftw
Copy link
Collaborator

Could you please share full perplexity result?

Somehow I had it hard-wired in my brain that quants need to be
in -7...7 to be comparable to the original Q4_0.

But this is clearly not the case, and if we relax this requirement
this simple change brings the rmse down to 0.001966 at the expense of
a somewhat longer computation (~67 seconds vs 49 seconds for the 7B
model on M2 Max).

Perplexity test is still running but it looks like the improvement
compared to the previous version will be quite modest ~0.03) despite
the significant improvement in MSE.

The change does not affect Q4_1 as there we already use the full
range of 16 possible int values.
@ikawrakow
Copy link
Contributor Author

For completeness, here the perplexity runs:

Q4_0, 7B

iwan@MacBook-Pro:~/other/llama.cpp$ ./bin/perplexity -m ../quant/models/7B/ggml-model-q40k.bin -f tests/wikitext-2-raw/wiki.test.raw 
main: seed = 1681219811
llama.cpp: loading model from ../quant/models/7B/ggml-model-q40k.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512 
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256 
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128 
llama_model_load_internal: f16        = 4 
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1 
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
9.64 seconds per pass - ETA 1.75 hours
[1]4.5124,[2]5.0602,[3]5.9593,[4]6.5845,[5]6.6641,[6]6.6320,[7]6.8217,[8]6.9154,[9]7.2817,[10]7.5396,[11]7.7750,[12]7.8101,[13]7.7434,[14]7.8365,[15]8.0953,[16]7.6838,[17]7.5515,[18]7.5028,[19]7.1215,[20]7.1135,[21]7.0180,[22]6.8499,[23]6.8126,[24]6.7075,[25]6.7003,[26]6.5246,[27]6.3262,[28]6.2173,[29]6.1251,[30]5.9657,[31]5.9297,[32]5.9513,[33]5.8931,[34]5.9323,[35]5.9561,[36]5.9960,[37]5.9964,[38]6.0126,[39]6.0489,[40]6.1105,[41]6.1195,[42]6.1617,[43]6.1170,[44]6.1729,[45]6.1724,[46]6.1461,[47]6.1724,[48]6.1460,[49]6.1509,[50]6.1075,[51]6.1005,[52]6.0874,[53]6.1320,[54]6.1121,[55]6.0883,[56]6.1187,[57]6.1394,[58]6.1616,[59]6.1774,[60]6.2215,[61]6.2078,[62]6.2662,[63]6.2997,[64]6.3128,[65]6.3599,[66]6.3700,[67]6.3849,[68]6.3977,[69]6.4246,[70]6.4576,[71]6.4802,[72]6.5102,[73]6.5760,[74]6.5815,[75]6.5966,[76]6.6099,[77]6.6193,[78]6.6047,[79]6.6334,[80]6.6258,[81]6.6364,[82]6.6425,[83]6.5881,[84]6.5727,[85]6.5618,[86]6.5406,[87]6.4771,[88]6.4482,[89]6.4314,[90]6.4158,[91]6.4422,[92]6.4381,[93]6.4414,[94]6.4393,[95]6.4682,[96]6.4652,[97]6.4623,[98]6.4552,[99]6.4399,[100]6.4398,[101]6.4657,[102]6.4591,[103]6.4787,[104]6.4857,[105]6.4880,[106]6.5052,[107]6.5039,[108]6.5142,[109]6.5071,[110]6.5034,[111]6.5254,[112]6.5461,[113]6.5503,[114]6.5472,[115]6.5567,[116]6.5497,[117]6.5551,[118]6.5846,[119]6.6062,[120]6.6448,[121]6.6620,[122]6.6866,[123]6.7253,[124]6.7443,[125]6.7357,[126]6.7768,[127]6.8140,[128]6.8455,[129]6.8299,[130]6.8417,[131]6.8373,[132]6.8287,[133]6.8153,[134]6.8266,[135]6.8245,[136]6.8128,[137]6.8052,[138]6.7909,[139]6.7798,[140]6.7755,[141]6.7446,[142]6.7419,[143]6.7113,[144]6.6905,[145]6.6827,[146]6.6686,[147]6.6747,[148]6.6758,[149]6.6699,[150]6.6641,[151]6.6657,[152]6.6541,[153]6.6364,[154]6.6272,[155]6.6340,[156]6.6291,[157]6.6476,[158]6.6503,[159]6.6544,[160]6.6558,[161]6.6690,[162]6.6388,[163]6.6265,[164]6.6006,[165]6.5684,[166]6.5387,[167]6.5012,[168]6.4686,[169]6.4558,[170]6.4445,[171]6.4159,[172]6.3979,[173]6.3787,[174]6.3472,[175]6.3256,[176]6.3154,[177]6.2928,[178]6.2699,[179]6.2520,[180]6.2424,[181]6.2201,[182]6.2009,[183]6.1868,[184]6.1866,[185]6.1789,[186]6.1808,[187]6.1871,[188]6.1838,[189]6.2024,[190]6.2042,[191]6.2255,[192]6.2422,[193]6.2613,[194]6.2726,[195]6.2942,[196]6.3106,[197]6.3336,[198]6.3494,[199]6.3537,[200]6.3593,[201]6.3540,[202]6.3755,[203]6.3838,[204]6.3832,[205]6.3945,[206]6.4029,[207]6.3987,[208]6.4074,[209]6.4127,[210]6.4181,[211]6.4272,[212]6.4341,[213]6.4445,[214]6.4468,[215]6.4506,[216]6.4651,[217]6.4826,[218]6.4966,[219]6.4970,[220]6.4931,[221]6.4880,[222]6.4856,[223]6.4753,[224]6.4679,[225]6.4635,[226]6.4850,[227]6.4941,[228]6.4995,[229]6.5056,[230]6.5011,[231]6.5182,[232]6.5064,[233]6.4893,[234]6.4757,[235]6.4587,[236]6.4512,[237]6.4409,[238]6.4441,[239]6.4283,[240]6.4172,[241]6.4195,[242]6.4239,[243]6.4224,[244]6.4100,[245]6.4078,[246]6.3957,[247]6.3830,[248]6.3752,[249]6.3722,[250]6.3759,[251]6.3678,[252]6.3641,[253]6.3531,[254]6.3488,[255]6.3372,[256]6.3183,[257]6.3068,[258]6.2976,[259]6.2948,[260]6.2863,[261]6.2818,[262]6.2758,[263]6.2712,[264]6.2517,[265]6.2504,[266]6.2490,[267]6.2425,[268]6.2519,[269]6.2497,[270]6.2505,[271]6.2579,[272]6.2618,[273]6.2614,[274]6.2634,[275]6.2719,[276]6.2781,[277]6.2942,[278]6.3051,[279]6.3138,[280]6.3164,[281]6.3257,[282]6.3309,[283]6.3459,[284]6.3543,[285]6.3629,[286]6.3773,[287]6.3768,[288]6.3836,[289]6.3739,[290]6.3584,[291]6.3430,[292]6.3275,[293]6.3132,[294]6.3148,[295]6.3143,[296]6.3180,[297]6.3170,[298]6.3198,[299]6.3167,[300]6.3058,[301]6.3062,[302]6.2985,[303]6.2906,[304]6.2824,[305]6.2791,[306]6.2655,[307]6.2673,[308]6.2710,[309]6.2543,[310]6.2483,[311]6.2414,[312]6.2445,[313]6.2392,[314]6.2375,[315]6.2206,[316]6.2157,[317]6.1991,[318]6.1770,[319]6.1897,[320]6.2026,[321]6.2069,[322]6.2021,[323]6.1952,[324]6.1927,[325]6.2032,[326]6.2035,[327]6.2062,[328]6.2106,[329]6.2171,[330]6.2205,[331]6.2325,[332]6.2298,[333]6.2370,[334]6.2312,[335]6.2245,[336]6.2275,[337]6.2244,[338]6.2237,[339]6.2180,[340]6.2137,[341]6.2214,[342]6.2238,[343]6.2294,[344]6.2293,[345]6.2287,[346]6.2257,[347]6.2306,[348]6.2343,[349]6.2361,[350]6.2330,[351]6.2337,[352]6.2339,[353]6.2279,[354]6.2286,[355]6.2340,[356]6.2369,[357]6.2332,[358]6.2424,[359]6.2452,[360]6.2409,[361]6.2411,[362]6.2479,[363]6.2589,[364]6.2651,[365]6.2710,[366]6.2715,[367]6.2806,[368]6.2782,[369]6.2782,[370]6.2795,[371]6.2734,[372]6.2781,[373]6.2830,[374]6.2814,[375]6.2807,[376]6.2882,[377]6.2830,[378]6.2852,[379]6.2912,[380]6.2829,[381]6.2791,[382]6.2739,[383]6.2727,[384]6.2718,[385]6.2712,[386]6.2713,[387]6.2703,[388]6.2659,[389]6.2606,[390]6.2539,[391]6.2455,[392]6.2411,[393]6.2391,[394]6.2419,[395]6.2402,[396]6.2325,[397]6.2407,[398]6.2444,[399]6.2531,[400]6.2529,[401]6.2542,[402]6.2550,[403]6.2566,[404]6.2632,[405]6.2532,[406]6.2498,[407]6.2494,[408]6.2508,[409]6.2633,[410]6.2743,[411]6.2867,[412]6.3031,[413]6.3154,[414]6.3232,[415]6.3286,[416]6.3365,[417]6.3502,[418]6.3538,[419]6.3614,[420]6.3702,[421]6.3823,[422]6.3878,[423]6.3947,[424]6.4068,[425]6.4158,[426]6.4228,[427]6.4275,[428]6.4359,[429]6.4415,[430]6.4502,[431]6.4650,[432]6.4697,[433]6.4684,[434]6.4641,[435]6.4648,[436]6.4666,[437]6.4761,[438]6.4836,[439]6.4804,[440]6.4799,[441]6.4748,[442]6.4740,[443]6.4758,[444]6.4767,[445]6.4749,[446]6.4774,[447]6.4806,[448]6.4852,[449]6.4826,[450]6.4832,[451]6.4785,[452]6.4667,[453]6.4580,[454]6.4521,[455]6.4533,[456]6.4579,[457]6.4604,[458]6.4581,[459]6.4586,[460]6.4669,[461]6.4636,[462]6.4619,[463]6.4666,[464]6.4658,[465]6.4628,[466]6.4549,[467]6.4555,[468]6.4554,[469]6.4572,[470]6.4581,[471]6.4532,[472]6.4579,[473]6.4519,[474]6.4530,[475]6.4469,[476]6.4499,[477]6.4426,[478]6.4416,[479]6.4480,[480]6.4532,[481]6.4550,[482]6.4507,[483]6.4463,[484]6.4490,[485]6.4474,[486]6.4417,[487]6.4418,[488]6.4400,[489]6.4352,[490]6.4325,[491]6.4296,[492]6.4237,[493]6.4209,[494]6.4195,[495]6.4198,[496]6.4161,[497]6.4105,[498]6.4091,[499]6.4040,[500]6.3941,[501]6.3873,[502]6.3871,[503]6.3865,[504]6.3773,[505]6.3794,[506]6.3804,[507]6.3746,[508]6.3711,[509]6.3702,[510]6.3744,[511]6.3792,[512]6.3827,[513]6.3849,[514]6.3916,[515]6.3864,[516]6.3859,[517]6.3871,[518]6.3870,[519]6.3904,[520]6.3931,[521]6.3946,[522]6.3975,[523]6.3983,[524]6.4041,[525]6.4082,[526]6.4095,[527]6.4111,[528]6.4060,[529]6.4067,[530]6.4019,[531]6.4011,[532]6.4059,[533]6.4083,[534]6.4067,[535]6.4089,[536]6.4031,[537]6.4011,[538]6.4060,[539]6.4071,[540]6.4113,[541]6.4122,[542]6.4130,[543]6.4143,[544]6.4159,[545]6.4137,[546]6.4149,[547]6.4105,[548]6.4053,[549]6.4054,[550]6.4025,[551]6.3986,[552]6.3967,[553]6.3924,[554]6.3900,[555]6.3869,[556]6.3868,[557]6.3892,[558]6.3851,[559]6.3847,[560]6.3845,[561]6.3845,[562]6.3828,[563]6.3825,[564]6.3868,[565]6.3890,[566]6.3890,[567]6.3866,[568]6.3872,[569]6.3856,[570]6.3883,[571]6.3887,[572]6.3899,[573]6.3899,[574]6.3864,[575]6.3864,[576]6.3863,[577]6.3848,[578]6.3827,[579]6.3836,[580]6.3768,[581]6.3729,[582]6.3718,[583]6.3723,[584]6.3725,[585]6.3650,[586]6.3580,[587]6.3585,[588]6.3633,[589]6.3691,[590]6.3724,[591]6.3743,[592]6.3727,[593]6.3691,[594]6.3701,[595]6.3675,[596]6.3711,[597]6.3685,[598]6.3649,[599]6.3671,[600]6.3666,[601]6.3651,[602]6.3669,[603]6.3702,[604]6.3711,[605]6.3744,[606]6.3765,[607]6.3747,[608]6.3711,[609]6.3714,[610]6.3753,[611]6.3735,[612]6.3761,[613]6.3724,[614]6.3671,[615]6.3593,[616]6.3622,[617]6.3563,[618]6.3513,[619]6.3455,[620]6.3311,[621]6.3238,[622]6.3218,[623]6.3236,[624]6.3239,[625]6.3237,[626]6.3223,[627]6.3242,[628]6.3243,[629]6.3237,[630]6.3270,[631]6.3333,[632]6.3390,[633]6.3371,[634]6.3406,[635]6.3411,[636]6.3382,[637]6.3350,[638]6.3377,[639]6.3345,[640]6.3354,[641]6.3358,[642]6.3428,[643]6.3447,[644]6.3462,[645]6.3441,[646]6.3487,[647]6.3448,[648]6.3457,[649]6.3460,[650]6.3504,[651]6.3562,[652]6.3569,[653]6.3611,[654]6.3546,[655]6.3539,

llama_print_timings:        load time = 10097.21 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 6179950.08 ms / 334705 tokens (   18.46 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 6207602.93 ms

Q4_1, 7B

iwan@MacBook-Pro:~/other/llama.cpp$ ./bin/perplexity -m ../quant/models/7B/ggml-model-q41k.bin -f tests/wikitext-2-raw/wiki.test.raw 
main: seed = 1681226371
llama.cpp: loading model from ../quant/models/7B/ggml-model-q41k.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512 
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256 
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128 
llama_model_load_internal: f16        = 5 
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1 
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6612.57 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
17.66 seconds per pass - ETA 3.21 hours
[1]4.3994,[2]4.8500,[3]5.7479,[4]6.3533,[5]6.4709,[6]6.4153,[7]6.6099,[8]6.7147,[9]7.0359,[10]7.2721,[11]7.4840,[12]7.5222,[13]7.4411,[14]7.5150,[15]7.7825,[16]7.3834,[17]7.2603,[18]7.2035,[19]6.8471,[20]6.8273,[21]6.7340,[22]6.5625,[23]6.5263,[24]6.4312,[25]6.4291,[26]6.2679,[27]6.0891,[28]5.9861,[29]5.9011,[30]5.7428,[31]5.7120,[32]5.7306,[33]5.6747,[34]5.7078,[35]5.7275,[36]5.7632,[37]5.7655,[38]5.7722,[39]5.8059,[40]5.8530,[41]5.8619,[42]5.9022,[43]5.8613,[44]5.9188,[45]5.9210,[46]5.8923,[47]5.9125,[48]5.8878,[49]5.8851,[50]5.8445,[51]5.8413,[52]5.8315,[53]5.8760,[54]5.8596,[55]5.8376,[56]5.8673,[57]5.8858,[58]5.9072,[59]5.9268,[60]5.9702,[61]5.9606,[62]6.0171,[63]6.0474,[64]6.0604,[65]6.1021,[66]6.1104,[67]6.1290,[68]6.1475,[69]6.1712,[70]6.2030,[71]6.2269,[72]6.2600,[73]6.3179,[74]6.3205,[75]6.3342,[76]6.3468,[77]6.3581,[78]6.3437,[79]6.3709,[80]6.3642,[81]6.3768,[82]6.3802,[83]6.3294,[84]6.3128,[85]6.2998,[86]6.2777,[87]6.2145,[88]6.1906,[89]6.1715,[90]6.1570,[91]6.1809,[92]6.1757,[93]6.1747,[94]6.1717,[95]6.1988,[96]6.1975,[97]6.1928,[98]6.1867,[99]6.1727,[100]6.1714,[101]6.1956,[102]6.1911,[103]6.2118,[104]6.2201,[105]6.2191,[106]6.2350,[107]6.2345,[108]6.2498,[109]6.2447,[110]6.2408,[111]6.2624,[112]6.2833,[113]6.2861,[114]6.2813,[115]6.2871,[116]6.2784,[117]6.2840,[118]6.3121,[119]6.3330,[120]6.3675,[121]6.3836,[122]6.4087,[123]6.4457,[124]6.4628,[125]6.4532,[126]6.4927,[127]6.5271,[128]6.5576,[129]6.5433,[130]6.5505,[131]6.5468,[132]6.5384,[133]6.5253,[134]6.5337,[135]6.5292,[136]6.5198,[137]6.5126,[138]6.4963,[139]6.4845,[140]6.4803,[141]6.4535,[142]6.4509,[143]6.4226,[144]6.4025,[145]6.3941,[146]6.3822,[147]6.3852,[148]6.3847,[149]6.3804,[150]6.3772,[151]6.3799,[152]6.3700,[153]6.3542,[154]6.3465,[155]6.3531,[156]6.3484,[157]6.3644,[158]6.3687,[159]6.3742,[160]6.3765,[161]6.3884,[162]6.3606,[163]6.3491,[164]6.3255,[165]6.2953,[166]6.2687,[167]6.2319,[168]6.2015,[169]6.1886,[170]6.1785,[171]6.1525,[172]6.1364,[173]6.1203,[174]6.0907,[175]6.0688,[176]6.0573,[177]6.0376,[178]6.0157,[179]5.9996,[180]5.9903,[181]5.9695,[182]5.9514,[183]5.9376,[184]5.9372,[185]5.9300,[186]5.9312,[187]5.9373,[188]5.9340,[189]5.9509,[190]5.9518,[191]5.9734,[192]5.9894,[193]6.0062,[194]6.0168,[195]6.0371,[196]6.0523,[197]6.0734,[198]6.0885,[199]6.0915,[200]6.0968,[201]6.0927,[202]6.1110,[203]6.1185,[204]6.1187,[205]6.1287,[206]6.1353,[207]6.1313,[208]6.1397,[209]6.1434,[210]6.1483,[211]6.1595,[212]6.1664,[213]6.1769,[214]6.1806,[215]6.1832,[216]6.1965,[217]6.2140,[218]6.2269,[219]6.2269,[220]6.2227,[221]6.2175,[222]6.2153,[223]6.2065,[224]6.2003,[225]6.1961,[226]6.2166,[227]6.2251,[228]6.2303,[229]6.2367,[230]6.2333,[231]6.2497,[232]6.2381,[233]6.2216,[234]6.2070,[235]6.1889,[236]6.1823,[237]6.1724,[238]6.1749,[239]6.1599,[240]6.1503,[241]6.1518,[242]6.1549,[243]6.1539,[244]6.1431,[245]6.1394,[246]6.1286,[247]6.1167,[248]6.1094,[249]6.1074,[250]6.1121,[251]6.1056,[252]6.1017,[253]6.0926,[254]6.0876,[255]6.0760,[256]6.0580,[257]6.0454,[258]6.0373,[259]6.0355,[260]6.0272,[261]6.0229,[262]6.0178,[263]6.0116,[264]5.9902,[265]5.9896,[266]5.9882,[267]5.9816,[268]5.9904,[269]5.9888,[270]5.9899,[271]5.9976,[272]6.0016,[273]6.0018,[274]6.0046,[275]6.0130,[276]6.0182,[277]6.0340,[278]6.0442,[279]6.0531,[280]6.0559,[281]6.0658,[282]6.0718,[283]6.0861,[284]6.0935,[285]6.1020,[286]6.1152,[287]6.1144,[288]6.1208,[289]6.1121,[290]6.0959,[291]6.0811,[292]6.0663,[293]6.0532,[294]6.0551,[295]6.0534,[296]6.0578,[297]6.0566,[298]6.0596,[299]6.0569,[300]6.0463,[301]6.0458,[302]6.0383,[303]6.0296,[304]6.0210,[305]6.0179,[306]6.0053,[307]6.0074,[308]6.0104,[309]5.9945,[310]5.9887,[311]5.9825,[312]5.9848,[313]5.9793,[314]5.9780,[315]5.9625,[316]5.9579,[317]5.9418,[318]5.9219,[319]5.9331,[320]5.9455,[321]5.9499,[322]5.9460,[323]5.9394,[324]5.9363,[325]5.9471,[326]5.9472,[327]5.9491,[328]5.9528,[329]5.9583,[330]5.9611,[331]5.9729,[332]5.9704,[333]5.9772,[334]5.9722,[335]5.9664,[336]5.9698,[337]5.9676,[338]5.9667,[339]5.9616,[340]5.9575,[341]5.9657,[342]5.9682,[343]5.9730,[344]5.9735,[345]5.9738,[346]5.9712,[347]5.9748,[348]5.9782,[349]5.9802,[350]5.9775,[351]5.9778,[352]5.9782,[353]5.9721,[354]5.9729,[355]5.9779,[356]5.9812,[357]5.9778,[358]5.9868,[359]5.9897,[360]5.9860,[361]5.9856,[362]5.9925,[363]6.0031,[364]6.0093,[365]6.0142,[366]6.0156,[367]6.0237,[368]6.0208,[369]6.0218,[370]6.0233,[371]6.0179,[372]6.0230,[373]6.0277,[374]6.0262,[375]6.0263,[376]6.0330,[377]6.0284,[378]6.0310,[379]6.0369,[380]6.0292,[381]6.0257,[382]6.0202,[383]6.0194,[384]6.0190,[385]6.0179,[386]6.0178,[387]6.0174,[388]6.0135,[389]6.0084,[390]6.0013,[391]5.9937,[392]5.9895,[393]5.9883,[394]5.9908,[395]5.9896,[396]5.9823,[397]5.9887,[398]5.9923,[399]6.0000,[400]5.9998,[401]6.0008,[402]6.0019,[403]6.0037,[404]6.0098,[405]6.0009,[406]5.9981,[407]5.9977,[408]5.9991,[409]6.0108,[410]6.0219,[411]6.0329,[412]6.0486,[413]6.0599,[414]6.0681,[415]6.0734,[416]6.0810,[417]6.0932,[418]6.0972,[419]6.1042,[420]6.1135,[421]6.1247,[422]6.1288,[423]6.1359,[424]6.1462,[425]6.1546,[426]6.1609,[427]6.1653,[428]6.1735,[429]6.1786,[430]6.1866,[431]6.2003,[432]6.2038,[433]6.2030,[434]6.1990,[435]6.1998,[436]6.2023,[437]6.2119,[438]6.2196,[439]6.2165,[440]6.2153,[441]6.2102,[442]6.2085,[443]6.2096,[444]6.2101,[445]6.2082,[446]6.2104,[447]6.2135,[448]6.2176,[449]6.2151,[450]6.2157,[451]6.2116,[452]6.1991,[453]6.1911,[454]6.1856,[455]6.1866,[456]6.1918,[457]6.1938,[458]6.1914,[459]6.1920,[460]6.2007,[461]6.1979,[462]6.1966,[463]6.2010,[464]6.1999,[465]6.1972,[466]6.1896,[467]6.1900,[468]6.1902,[469]6.1923,[470]6.1927,[471]6.1878,[472]6.1924,[473]6.1872,[474]6.1887,[475]6.1829,[476]6.1847,[477]6.1777,[478]6.1767,[479]6.1823,[480]6.1869,[481]6.1886,[482]6.1844,[483]6.1800,[484]6.1820,[485]6.1806,[486]6.1749,[487]6.1750,[488]6.1727,[489]6.1679,[490]6.1656,[491]6.1628,[492]6.1572,[493]6.1544,[494]6.1529,[495]6.1526,[496]6.1488,[497]6.1434,[498]6.1418,[499]6.1377,[500]6.1286,[501]6.1222,[502]6.1223,[503]6.1217,[504]6.1130,[505]6.1149,[506]6.1156,[507]6.1097,[508]6.1055,[509]6.1049,[510]6.1085,[511]6.1131,[512]6.1167,[513]6.1188,[514]6.1250,[515]6.1198,[516]6.1187,[517]6.1199,[518]6.1195,[519]6.1224,[520]6.1247,[521]6.1262,[522]6.1290,[523]6.1298,[524]6.1355,[525]6.1389,[526]6.1398,[527]6.1415,[528]6.1367,[529]6.1375,[530]6.1324,[531]6.1313,[532]6.1360,[533]6.1387,[534]6.1373,[535]6.1394,[536]6.1342,[537]6.1321,[538]6.1371,[539]6.1381,[540]6.1419,[541]6.1423,[542]6.1434,[543]6.1450,[544]6.1461,[545]6.1443,[546]6.1449,[547]6.1410,[548]6.1361,[549]6.1361,[550]6.1333,[551]6.1298,[552]6.1276,[553]6.1241,[554]6.1221,[555]6.1191,[556]6.1188,[557]6.1215,[558]6.1177,[559]6.1171,[560]6.1169,[561]6.1174,[562]6.1151,[563]6.1146,[564]6.1188,[565]6.1207,[566]6.1206,[567]6.1186,[568]6.1191,[569]6.1176,[570]6.1206,[571]6.1211,[572]6.1218,[573]6.1215,[574]6.1179,[575]6.1175,[576]6.1177,[577]6.1162,[578]6.1144,[579]6.1148,[580]6.1083,[581]6.1048,[582]6.1037,[583]6.1043,[584]6.1046,[585]6.0970,[586]6.0902,[587]6.0906,[588]6.0954,[589]6.1005,[590]6.1035,[591]6.1054,[592]6.1041,[593]6.1006,[594]6.1013,[595]6.0989,[596]6.1021,[597]6.1001,[598]6.0972,[599]6.0994,[600]6.0989,[601]6.0977,[602]6.0990,[603]6.1021,[604]6.1029,[605]6.1066,[606]6.1086,[607]6.1071,[608]6.1035,[609]6.1040,[610]6.1073,[611]6.1055,[612]6.1081,[613]6.1045,[614]6.0996,[615]6.0923,[616]6.0950,[617]6.0891,[618]6.0841,[619]6.0787,[620]6.0650,[621]6.0582,[622]6.0563,[623]6.0577,[624]6.0583,[625]6.0584,[626]6.0571,[627]6.0594,[628]6.0593,[629]6.0590,[630]6.0622,[631]6.0675,[632]6.0733,[633]6.0718,[634]6.0752,[635]6.0757,[636]6.0727,[637]6.0692,[638]6.0717,[639]6.0686,[640]6.0695,[641]6.0697,[642]6.0764,[643]6.0784,[644]6.0796,[645]6.0776,[646]6.0816,[647]6.0779,[648]6.0789,[649]6.0792,[650]6.0832,[651]6.0884,[652]6.0893,[653]6.0931,[654]6.0868,[655]6.0863,

llama_print_timings:        load time = 18133.61 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 6308946.77 ms / 334705 tokens (   18.85 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 6336925.73 ms

The RMSE of the 7B model becomes 0.00185228.
It looks like the perplexity will end up being around 6.27-6.28.
@ikawrakow
Copy link
Contributor Author

Here the Q4_0 perplexity results on M2 Max after the latest changes:

iwan@MacBook-Pro:~/other/llama.cpp/build$ ./bin/perplexity -m ../../quant/models/7B/ggml-model-q40k3.bin -f ../../old.llama.cpp/tests/wikitext-2-raw/wiki.test.raw 
main: seed = 1681273426
llama.cpp: loading model from ../../quant/models/7B/ggml-model-q40k3.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512 
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256 
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128 
llama_model_load_internal: f16        = 4 
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1 
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
16.03 seconds per pass - ETA 2.92 hours
[1]4.4960,[2]5.0261,[3]5.8937,[4]6.4992,[5]6.6023,[6]6.5737,[7]6.7600,[8]6.8564,[9]7.2026,[10]7.4481,[11]7.6913,[12]7.7175,[13]7.6476,[14]7.7365,[15]7.9846,[16]7.5791,[17]7.4505,[18]7.3976,[19]7.0253,[20]7.0190,[21]6.9215,[22]6.7542,[23]6.7166,[24]6.6139,[25]6.6136,[26]6.4427,[27]6.2527,[28]6.1519,[29]6.0554,[30]5.8980,[31]5.8601,[32]5.8828,[33]5.8231,[34]5.8571,[35]5.8807,[36]5.9232,[37]5.9249,[38]5.9414,[39]5.9768,[40]6.0415,[41]6.0526,[42]6.0951,[43]6.0500,[44]6.1053,[45]6.1094,[46]6.0841,[47]6.1066,[48]6.0791,[49]6.0828,[50]6.0399,[51]6.0328,[52]6.0200,[53]6.0644,[54]6.0475,[55]6.0232,[56]6.0542,[57]6.0738,[58]6.0943,[59]6.1129,[60]6.1579,[61]6.1488,[62]6.2084,[63]6.2455,[64]6.2596,[65]6.3064,[66]6.3149,[67]6.3325,[68]6.3464,[69]6.3749,[70]6.4098,[71]6.4314,[72]6.4632,[73]6.5271,[74]6.5328,[75]6.5494,[76]6.5593,[77]6.5695,[78]6.5548,[79]6.5837,[80]6.5761,[81]6.5843,[82]6.5880,[83]6.5325,[84]6.5172,[85]6.5055,[86]6.4835,[87]6.4196,[88]6.3897,[89]6.3712,[90]6.3567,[91]6.3796,[92]6.3748,[93]6.3767,[94]6.3719,[95]6.4007,[96]6.3984,[97]6.3933,[98]6.3846,[99]6.3691,[100]6.3702,[101]6.3971,[102]6.3905,[103]6.4106,[104]6.4182,[105]6.4189,[106]6.4348,[107]6.4336,[108]6.4445,[109]6.4379,[110]6.4344,[111]6.4568,[112]6.4773,[113]6.4811,[114]6.4771,[115]6.4850,[116]6.4759,[117]6.4805,[118]6.5101,[119]6.5317,[120]6.5689,[121]6.5854,[122]6.6102,[123]6.6473,[124]6.6653,[125]6.6550,[126]6.6960,[127]6.7334,[128]6.7637,[129]6.7481,[130]6.7591,[131]6.7538,[132]6.7451,[133]6.7319,[134]6.7431,[135]6.7391,[136]6.7270,[137]6.7193,[138]6.7033,[139]6.6918,[140]6.6873,[141]6.6566,[142]6.6521,[143]6.6222,[144]6.6016,[145]6.5938,[146]6.5809,[147]6.5873,[148]6.5881,[149]6.5825,[150]6.5781,[151]6.5795,[152]6.5678,[153]6.5504,[154]6.5415,[155]6.5485,[156]6.5437,[157]6.5622,[158]6.5654,[159]6.5704,[160]6.5719,[161]6.5846,[162]6.5545,[163]6.5418,[164]6.5164,[165]6.4846,[166]6.4559,[167]6.4179,[168]6.3859,[169]6.3726,[170]6.3613,[171]6.3322,[172]6.3142,[173]6.2957,[174]6.2651,[175]6.2427,[176]6.2321,[177]6.2114,[178]6.1885,[179]6.1709,[180]6.1612,[181]6.1394,[182]6.1218,[183]6.1078,[184]6.1081,[185]6.0999,[186]6.1008,[187]6.1070,[188]6.1031,[189]6.1202,[190]6.1215,[191]6.1440,[192]6.1607,[193]6.1779,[194]6.1891,[195]6.2108,[196]6.2268,[197]6.2489,[198]6.2640,[199]6.2683,[200]6.2732,[201]6.2686,[202]6.2891,[203]6.2976,[204]6.2964,[205]6.3082,[206]6.3159,[207]6.3127,[208]6.3209,[209]6.3250,[210]6.3299,[211]6.3399,[212]6.3471,[213]6.3577,[214]6.3598,[215]6.3636,[216]6.3787,[217]6.3973,[218]6.4118,[219]6.4121,[220]6.4084,[221]6.4032,[222]6.4004,[223]6.3898,[224]6.3824,[225]6.3775,[226]6.3983,[227]6.4070,[228]6.4121,[229]6.4167,[230]6.4132,[231]6.4298,[232]6.4173,[233]6.4003,[234]6.3858,[235]6.3687,[236]6.3614,[237]6.3511,[238]6.3545,[239]6.3386,[240]6.3283,[241]6.3309,[242]6.3354,[243]6.3334,[244]6.3216,[245]6.3187,[246]6.3067,[247]6.2937,[248]6.2856,[249]6.2836,[250]6.2874,[251]6.2799,[252]6.2760,[253]6.2659,[254]6.2627,[255]6.2509,[256]6.2324,[257]6.2213,[258]6.2129,[259]6.2113,[260]6.2036,[261]6.1998,[262]6.1941,[263]6.1888,[264]6.1682,[265]6.1672,[266]6.1656,[267]6.1586,[268]6.1679,[269]6.1663,[270]6.1674,[271]6.1748,[272]6.1781,[273]6.1781,[274]6.1806,[275]6.1889,[276]6.1950,[277]6.2109,[278]6.2216,[279]6.2307,[280]6.2338,[281]6.2425,[282]6.2483,[283]6.2630,[284]6.2709,[285]6.2796,[286]6.2929,[287]6.2928,[288]6.2991,[289]6.2898,[290]6.2740,[291]6.2590,[292]6.2435,[293]6.2295,[294]6.2314,[295]6.2309,[296]6.2355,[297]6.2343,[298]6.2368,[299]6.2342,[300]6.2228,[301]6.2234,[302]6.2158,[303]6.2081,[304]6.2007,[305]6.1973,[306]6.1847,[307]6.1866,[308]6.1901,[309]6.1738,[310]6.1680,[311]6.1617,[312]6.1646,[313]6.1589,[314]6.1569,[315]6.1401,[316]6.1352,[317]6.1185,[318]6.0969,[319]6.1093,[320]6.1222,[321]6.1264,[322]6.1218,[323]6.1148,[324]6.1120,[325]6.1224,[326]6.1226,[327]6.1248,[328]6.1292,[329]6.1354,[330]6.1387,[331]6.1512,[332]6.1483,[333]6.1552,[334]6.1499,[335]6.1432,[336]6.1461,[337]6.1434,[338]6.1431,[339]6.1378,[340]6.1330,[341]6.1410,[342]6.1436,[343]6.1490,[344]6.1490,[345]6.1487,[346]6.1459,[347]6.1510,[348]6.1549,[349]6.1569,[350]6.1538,[351]6.1547,[352]6.1554,[353]6.1496,[354]6.1501,[355]6.1550,[356]6.1576,[357]6.1538,[358]6.1630,[359]6.1659,[360]6.1619,[361]6.1619,[362]6.1686,[363]6.1799,[364]6.1863,[365]6.1919,[366]6.1929,[367]6.2020,[368]6.1994,[369]6.1995,[370]6.2007,[371]6.1947,[372]6.1996,[373]6.2051,[374]6.2036,[375]6.2033,[376]6.2104,[377]6.2056,[378]6.2082,[379]6.2143,[380]6.2066,[381]6.2027,[382]6.1973,[383]6.1966,[384]6.1960,[385]6.1953,[386]6.1950,[387]6.1946,[388]6.1905,[389]6.1852,[390]6.1785,[391]6.1706,[392]6.1663,[393]6.1643,[394]6.1670,[395]6.1652,[396]6.1573,[397]6.1653,[398]6.1688,[399]6.1770,[400]6.1769,[401]6.1781,[402]6.1790,[403]6.1805,[404]6.1869,[405]6.1770,[406]6.1737,[407]6.1728,[408]6.1744,[409]6.1862,[410]6.1972,[411]6.2090,[412]6.2253,[413]6.2374,[414]6.2447,[415]6.2501,[416]6.2577,[417]6.2703,[418]6.2741,[419]6.2819,[420]6.2906,[421]6.3026,[422]6.3078,[423]6.3149,[424]6.3273,[425]6.3363,[426]6.3430,[427]6.3473,[428]6.3557,[429]6.3610,[430]6.3692,[431]6.3835,[432]6.3878,[433]6.3869,[434]6.3824,[435]6.3830,[436]6.3852,[437]6.3948,[438]6.4025,[439]6.3994,[440]6.3989,[441]6.3936,[442]6.3925,[443]6.3938,[444]6.3940,[445]6.3924,[446]6.3948,[447]6.3979,[448]6.4023,[449]6.3996,[450]6.4005,[451]6.3960,[452]6.3840,[453]6.3752,[454]6.3696,[455]6.3710,[456]6.3759,[457]6.3780,[458]6.3758,[459]6.3762,[460]6.3846,[461]6.3815,[462]6.3798,[463]6.3849,[464]6.3839,[465]6.3809,[466]6.3731,[467]6.3731,[468]6.3731,[469]6.3750,[470]6.3756,[471]6.3709,[472]6.3757,[473]6.3701,[474]6.3713,[475]6.3654,[476]6.3678,[477]6.3607,[478]6.3593,[479]6.3654,[480]6.3699,[481]6.3716,[482]6.3670,[483]6.3627,[484]6.3650,[485]6.3628,[486]6.3573,[487]6.3575,[488]6.3553,[489]6.3504,[490]6.3480,[491]6.3452,[492]6.3390,[493]6.3359,[494]6.3344,[495]6.3346,[496]6.3309,[497]6.3254,[498]6.3238,[499]6.3190,[500]6.3089,[501]6.3020,[502]6.3019,[503]6.3012,[504]6.2919,[505]6.2942,[506]6.2951,[507]6.2897,[508]6.2860,[509]6.2853,[510]6.2893,[511]6.2940,[512]6.2975,[513]6.2995,[514]6.3062,[515]6.3008,[516]6.3000,[517]6.3011,[518]6.3014,[519]6.3048,[520]6.3076,[521]6.3091,[522]6.3120,[523]6.3127,[524]6.3183,[525]6.3223,[526]6.3237,[527]6.3254,[528]6.3204,[529]6.3209,[530]6.3161,[531]6.3149,[532]6.3196,[533]6.3219,[534]6.3200,[535]6.3221,[536]6.3164,[537]6.3141,[538]6.3188,[539]6.3196,[540]6.3235,[541]6.3241,[542]6.3252,[543]6.3267,[544]6.3282,[545]6.3258,[546]6.3268,[547]6.3225,[548]6.3175,[549]6.3173,[550]6.3142,[551]6.3103,[552]6.3082,[553]6.3042,[554]6.3018,[555]6.2988,[556]6.2986,[557]6.3010,[558]6.2971,[559]6.2964,[560]6.2964,[561]6.2963,[562]6.2944,[563]6.2940,[564]6.2981,[565]6.3002,[566]6.2999,[567]6.2977,[568]6.2980,[569]6.2966,[570]6.2991,[571]6.2994,[572]6.3003,[573]6.3004,[574]6.2968,[575]6.2967,[576]6.2969,[577]6.2956,[578]6.2936,[579]6.2943,[580]6.2875,[581]6.2834,[582]6.2822,[583]6.2829,[584]6.2830,[585]6.2754,[586]6.2686,[587]6.2690,[588]6.2736,[589]6.2792,[590]6.2821,[591]6.2841,[592]6.2825,[593]6.2790,[594]6.2799,[595]6.2775,[596]6.2812,[597]6.2789,[598]6.2757,[599]6.2778,[600]6.2773,[601]6.2759,[602]6.2774,[603]6.2805,[604]6.2815,[605]6.2849,[606]6.2868,[607]6.2852,[608]6.2816,[609]6.2822,[610]6.2859,[611]6.2838,[612]6.2865,[613]6.2828,[614]6.2774,[615]6.2698,[616]6.2727,[617]6.2668,[618]6.2617,[619]6.2562,[620]6.2418,[621]6.2345,[622]6.2327,[623]6.2344,[624]6.2347,[625]6.2347,[626]6.2333,[627]6.2351,[628]6.2353,[629]6.2346,[630]6.2379,[631]6.2441,[632]6.2496,[633]6.2478,[634]6.2513,[635]6.2518,[636]6.2490,[637]6.2459,[638]6.2487,[639]6.2456,[640]6.2465,[641]6.2470,[642]6.2538,[643]6.2558,[644]6.2569,[645]6.2549,[646]6.2592,[647]6.2554,[648]6.2565,[649]6.2566,[650]6.2607,[651]6.2665,[652]6.2674,[653]6.2715,[654]6.2649,[655]6.2644,

llama_print_timings:        load time = 16520.56 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 6079851.63 ms / 334705 tokens (   18.16 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 6107834.08 ms

Basically, we use two Q4_0 quantizations, each having 16 weights,
to a quantize a set of 32 weights. We get two separate scaling
factors, which we store as fp16, ending up using the exact same
5 bits per weight as the current Q4_0.

We end up witn an rmse of ~0.00159, so basically the same as
the improved Q4_1. But this should run faster than `Q4_1`
(unless fp16 -> fp32 conversion is somehow very slow).
As last commit, but Q4_1 type, using the same memory as
existing Q4_1 via fp16.

We end up with
rmse 0.00125125, maxerr 0.11657715, 95pct<0.0024, median<0.0010
after a quantize - dequantize roundtrip.

This is quite a bit better than Q4_1 with groups of 32 weights,
but by far not as good as 5-bit quantization that uses the same
amount of memory where we had
rmse 0.00076131, maxerr 0.05273438, 95pct<0.0016, median<0.0006
@ikawrakow
Copy link
Contributor Author

@ggerganov I'm going on vacation today and it is unlikely I will have time to work on this when I come back. There is some interesting stuff in here but I leave it up to you to decide if you want to close/merge/cherry-pick bits and pieces of it. From what I have seen in these few days, improvements to "classic" 4-bit quantization (i.e., Q4_0 and Q4_1 on groups of 32 weights) do not make a real difference on the quality of the results (at least not as measured by perplexity). Q4_0 and Q4_1 on groups of 16 weights looks more promising, and even more so Q5_1. But at the end, avoiding quantization of intermediate results will always be more important than any improvements to model weight quantization.

@ggerganov
Copy link
Owner

ggerganov commented Apr 13, 2023

@ikawrakow

Thank you for the analysis - there are definitely interesting results and techniques here.
The marginal gains in perplexity via better quantization are indeed a little disappointing to see (was hoping for more), but I think it's still better than nothing. As long as we are able to make an efficient implementation of any of the proposed approaches, we should merge them.

But at the end, avoiding quantization of intermediate results will always be more important than any improvements to model weight quantization.

Btw, I just realized something that might address your second point.
When multiplying z = x * y, where x is 4-bit and y is 32-bit, we currently quantize y to 4-bit, only to unpack it back to 8-bits in the dot-product call. This is obviously a big waste. We should quantize y straight to 8-bit instead, which would save a lot of precision.

I will add this idea to #909

q8_0 : rmse 0.00010729, maxerr 0.01030385, 95pct<0.0002, median<0.0002
@ikawrakow
Copy link
Contributor Author

Btw, I just realized something that might address your second point. When multiplying z = x * y, where x is 4-bit and y is 32-bit, we currently quantize y to 4-bit, only to unpack it back to 8-bits in the dot-product call. This is obviously a big waste. We should quantize y straight to 8-bit instead, which would save a lot of precision.

I will add this idea to #909

Great point. I had a few minutes before leaving for the airport and tried 8-bit quantization. Just the simplest possible (and very fast) variant 127/max. We get

rmse 0.00010729, maxerr 0.01030385, 95pct<0.0002, median<0.0002

which shows that you will indeed get a massive gain in accuracy if you quantized directly into 8 bits.

@ggerganov ggerganov linked an issue Apr 14, 2023 that may be closed by this pull request
}
};
int nthread = std::min(nchunk, int(std::thread::hardware_concurrency()));
std::vector<std::thread> workers(nthread-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::jthread so you don't have to loop join them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are targeting C++11, and std::jthread seems to be C++20.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no.

CXXFLAGS += -std=c++23 -DGGML_BIG_ENDIAN

ikawrakow pushed a commit that referenced this pull request Apr 19, 2023
For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.
ikawrakow added a commit that referenced this pull request Apr 19, 2023
* Q4_2 quantization with rmse-optimized scale and quants

For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.

* ggml : satisfy the sanitizer builds

Not sure why this makes them fail

* Better follow ggml conventions for function names

* Fixed type as per reviewer comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ikawrakow
Copy link
Contributor Author

I think we can close this now. Most of what was here is now in PR #1106 implemented in C rather than C++ as done here.

@ikawrakow ikawrakow closed this Apr 21, 2023
@ikawrakow ikawrakow deleted the quantize_experiments branch April 21, 2023 15:33
@fernando-neto-ai
Copy link

Hi! I am hacing the following errors while trying to build it:

I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC: cc (Ubuntu 7.5.0-3ubuntu118.04) 7.5.0
I CXX: g++ (Ubuntu 7.5.0-3ubuntu1
18.04) 7.5.0

cc -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -c ggml.c -o ggml.o
ggml.c: In function 'bytes_from_nibbles_16':
ggml.c:439:19: warning: implicit declaration of function '_mm_loadu_si64'; did you mean '_mm_loadl_epi64'? [-Wimplicit-function-declaration]
__m128i tmp = _mm_loadu_si64( ( const __m128i* )rsi );
^~~~~~~~~~~~~~
_mm_loadl_epi64
ggml.c:439:19: error: incompatible types when initializing type '__m128i {aka __vector(2) long long int}' using type 'int'
ggml.c: In function 'ggml_vec_dot_q4_2_q8_0':
ggml.c:2826:40: warning: implicit declaration of function '_mm256_set_m128'; did you mean '_mm256_set_epi8'? [-Wimplicit-function-declaration]
const __m256 d = _mm256_mul_ps(_mm256_set_m128(d1, d0), _mm256_broadcast_ss(&y[i].d));
^~~~~~~~~~~~~~~
_mm256_set_epi8
ggml.c:2826:40: error: incompatible type for argument 1 of '_mm256_mul_ps'
In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:41:0,
from ggml.c:189:
/usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h:318:1: note: expected '__m256 {aka __vector(8) float}' but argument is of type 'int'
_mm256_mul_ps (__m256 __A, __m256 __B)
^~~~~~~~~~~~~
ggml.c:2830:22: warning: implicit declaration of function '_mm256_set_m128i'; did you mean '_mm256_set_epi8'? [-Wimplicit-function-declaration]
__m256i bx = _mm256_set_m128i(bx1, bx0);
^~~~~~~~~~~~~~~~
_mm256_set_epi8
ggml.c:2830:22: error: incompatible types when initializing type '__m256i {aka __vector(4) long long int}' using type 'int'
ggml.c: In function 'ggml_vec_dot_q4_3_q8_0':
ggml.c:2956:27: error: incompatible types when initializing type '__m256 {aka const __vector(8) float}' using type 'int'
const __m256 dx = _mm256_set_m128(d1, d0);
^~~~~~~~~~~~~~~
ggml.c:2963:28: error: incompatible types when initializing type '__m256i {aka const __vector(4) long long int}' using type 'int'
const __m256i bx = _mm256_set_m128i(bx1, bx0);
^~~~~~~~~~~~~~~~
At top level:
ggml.c:1139:13: warning: 'quantize_row_q4_2_reference' defined but not used [-Wunused-function]
static void quantize_row_q4_2_reference(const float * restrict x, block_q4_2 * restrict y, int k) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
Makefile:161: recipe for target 'ggml.o' failed
make: *** [ggml.o] Error 1

Any Idea?

@slaren
Copy link
Collaborator

slaren commented Apr 23, 2023

Your version of gcc is too old, check #1120.

@fernando-neto-ai
Copy link

I haven't noticed that my gcc was old in the docker image I was using. I'm sorry! Thank you very much!

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this pull request Aug 30, 2024
This allows local build options (like LLAMA_*) to be set in the local
file instead of having to edit Makefile, or provide a long gmake command
line on every build.

Using '-include' avoids generating a warning if Makefile.local doesn't
exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Investigate alternative approach for Q4 quantization
8 participants