New SOTA 2-Bit Quant released: QuIP-Sharp #4327

Dampfinchen · 2023-12-04T17:28:03Z

Dampfinchen
Dec 4, 2023

Oobabooga implemented this into the webui and certainly in terms of memory, it seems a lot better than current Q2K, by a landslide. A Q2_K 13B model needs around 5.4 GB, while a 2-BIT QuIP model only needs around 3.8 GB https://huggingface.co/relaxml/Llama-2-13b-E8P-2Bit/tree/main . This means a 13B model can be fully offloaded on a 6 GB GPU.

Likewise according to Oobabooga, a 70B model now fits entirely within 24 GB VRAM and at a context of 3072.

This is likely because its true 2 Bit and not a mixture of different bits like its the case with K quants in Llama.cpp.

Based on this table provided by Oobabooga, perplexity looks promising:

I think answers to the following questions need to be persued to check if an implementation would make sense.

How does a Llama 2 7B model at Q4K_S (which is 3.8 GB in size) compare to a Llama 2 13B QuIP-Sharp model which is also 3.8 GB, perplexity wise? If its better than 7B, then it would absolutely make sense to implement it.
How does Q2_K compare to 2-Bit QuIP-Sharp complexity wise? Now, even if it would be worse than Q2_K which is pretty likely, the massive memory savings can't be ignored. A 13B model measuring just around 3.8 GB is truly unprecedented.

According to Ooba's data, it's pretty interesting how pure 2-Bit QuIP outperforms Exllama's 2.5 BPW one.

Note: You may have heard about QuIP in the past. QuIP-Sharp is a new one that is drastically improved.

Here's the link to the repo: https://github.com/Cornell-RelaxML/quip-sharp

@slaren @ggerganov @ikawrakow Curious to hear your opinions.

cmp-nct · 2023-12-04T19:08:47Z

cmp-nct
Dec 4, 2023

Just my few bits:

Given the quantization is totally different the better question might be: Is the Q_K2 quant worse than that new one on the same model.
If it is worse: by what margin - if the margin is big enough an implementation makes sense

Comparing to another model just by binary size is not too useful, more parameters are known to perform better even at high quantization.
Stay within the same model when judging it.

Also how the quantization works is important, the K quants are super nice to work within, by just loading a small block.
If that new quantization is supposed to be so much better, I'd expect it to be not nice to handle. Though that remains to be seen ;)

0 replies

mirek190 · 2023-12-04T20:32:10Z

mirek190
Dec 4, 2023

Oobabooga implemented this into the webui and certainly in terms of memory, it seems a lot better than current Q2K, by a landslide. A Q2_K 13B model needs around 5.4 GB, while a 2-BIT QuIP model only needs around 3.8 GB https://huggingface.co/relaxml/Llama-2-13b-E8P-2Bit/tree/main . This means a 13B model can be fully offloaded on a 6 GB GPU.

Likewise according to Oobabooga, a 70B model now fits entirely within 24 GB VRAM and at a context of 3072.

This is likely because its true 2 Bit and not a mixture of different bits like its the case with K quants in Llama.cpp.

Based on this table provided by Oobabooga, perplexity looks promising:

I think answers to the following questions need to be persued to check if an implementation would make sense.

How does a Llama 2 7B model at Q4K_S (which is 3.8 GB in size) compare to a Llama 2 13B QuIP-Sharp model which is also 3.8 GB, perplexity wise? If its better than 7B, then it would absolutely make sense to implement it.

How does Q2_K compare to 2-Bit QuIP-Sharp complexity wise? Now, even if it would be worse than Q2_K which is pretty likely, the massive memory savings can't be ignored. A 13B model measuring just around 3.8 GB is truly unprecedented.

According to Ooba's data, it's pretty interesting how pure 2-Bit QuIP outperforms Exllama's 2.5 BPW one.

Note: You may have heard about QuIP in the past. QuIP-Sharp is a new one that is drastically improved.

Here's the link to the repo: https://github.com/Cornell-RelaxML/quip-sharp

@slaren @ggerganov @ikawrakow Curious to hear your opinions.

Do you see perplexity? That 2 bit model is drunk as hell :D

0 replies

ikawrakow · 2023-12-06T08:59:09Z

ikawrakow
Dec 6, 2023

@Dampfinchen Where do the perplexities you posted above come from? For Q4_K_M I have PPL = 4.2081 for llama-v1-30b and PPL = 3.4725 for llama-v2-70b.

But if there is still interest in better quantization approaches, I can publish k-quants models with lower perplexities, including (almost) pure 2-bit models. Using Q2_K, I get the following model sizes and perplexities for LLaMA-v2:

Model	Size (GiB)	Perplexity
llama-v2-7b	2.121	7.2637
llama-v2-13b	4.055	6.0572
llama-v2-70b	21.32	4.2019

It is possible to do better than this (slightly smaller model sizes and lower perplexities), but that requires a significant change in ggml to be able to operate on rows of quantized data rather than the current approach of blocks of fixed size.

1 reply

BarfingLemurs Dec 6, 2023

But if there is still interest in better quantization approaches, I can publish k-quants models with lower perplexities, including (almost) pure 2-bit models.

I would really like to see those 2-bit models!

The evaluation was done in text-gen-webui with a wrapper to load llama.cpp with huggingface transformers.

ikawrakow · 2023-12-06T15:25:54Z

ikawrakow
Dec 6, 2023

OK, here is a more apples-to-apples comparison to the results published in the QuIP# paper. They computed perplexities using a context window of 2048 for LLaMA-1 and 4096 for LLaMA-2 (see table near the end of the paper). In the table below Q2_K refers to a pure 2-bit quantization (except for the output.weight tensor, which is quantized with Q6_K). Q2_K uses blocks of 16 weights in super-blocks of 256. It is a "type-1" quantization, i.e., weight = a * q + b, where q is the 2-bit quant, and a, b are per block coefficients quantized with 4 bits. The super-blocks have 2 additional fp16 coefficients, so a standard Q2_K quantization (as in the official llama.cpp repository) ends up using 256 * 2 + 16 * 2 * 4 + 2 * 16 = 672 bits per super-block of 256, which is 2.625 bits per weight (bpw). To further reduce k-quants model size and make it more comparable to the QuIP quantization, I addedQ2_K*, which is a modified version of Q2_K using blocks of 32 in super-blocks of 256, resulting in 2.3125 bpw.

Model	QuIP PPL	QuIP size	Q2_K* PPL	Q2_K* size	Q2_K PPL	Q2_K size
LLaMA-2-7B	8.201	2.15 GB	6.766	2.03 GB	6.025	2.23 GB
LLaMA-2-13B	6.003	3.83 GB	5.523	3.86 GB	5.152	4.26 GB
LLaMA-2-70B	4.156	18.2 GB	3.915	20.3 GB	3.671	22.9 GB
LLaMA-1-7B	8.146	2.15 GB	7.114	2.03 GB	6.355	2.23 GB
LLaMA-1-13B	6.353	3.83 GB	5.846	3.86 GB	5.371	4.26 GB
LLaMA-1-30B	5.311	8.89 GB	4.827	9.54 GB	4.489	10.55 GB

As there isn't a one-to-one correspondence in model sizes between Q2_K/Q2_K* and QuIP, and as quantization error strongly depends on the total amount of bits in the model, I'm also including a graph for LLaMA-2 that shows perplexity versus model size. To put all LLaMA-2 results on the same graph, the x-axis is the ratio of Q2_K or Q2_K* model size to the corresponding QuIP model size, and the y-axis is the ratio of the Q2_K or Q2_K* perplexity to the corresponding QuIP perplexity. Given this plot, Q2_K clearly outperforms QuIP for the 7B and 13B models. At 70B, there is perhaps the possibility that QuIP would be similar to Q2_K if one would further squeeze the Q2_K model size to match the QuIP model size.

5 replies

Dampfinchen Dec 6, 2023
Author

@ikawrakow Thank you for adding Q2_K*. Seems really interesting!

What I am a bit puzzled about though is the model size for Q2_K. In your table you mentioned Llama 2 13B Q2_K has a size of 4.26 GB. Yet, when I look at GGUF models for that model, the size is much larger, namely 5.4 GB. (https://huggingface.co/TheBloke/Llama-2-13B-GGUF/tree/main)

Are you sure the size figures are correct?

BarfingLemurs Dec 6, 2023

The current Q2_K is ~3.3bpw, this would be a different one, also I think he's not referring to the model name, rather the quantization for most layers

TheBloke Dec 6, 2023

Also Hugging Face shows figures in MB (1000s), not MiB (1024s) (this has always annoyed me.)

This accounts for some but not all of the difference:

ᐅ ls -alh llama-2-13b.Q2_K.gguf
-rw-rw-r-- 1 tomj tomj 5.1G Sep  4 17:33 llama-2-13b.Q2_K.gguf

BarfingLemurs Dec 6, 2023

Hope to clear things up:

Q2_K #1 uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.

Q2_K #2 (current) is now mostly a Q3_K quantization. All tensors are quantized using Q3_K, except for attention K and Q, which are Q2_K, and output.weight, which is Q6_K as usual.

Q2_K #3 The one talked about above is now everything at Q2_K with output.weight at Q6_K

(this one is already working, there's just no quantization script to force everything to Q2_K)

jerry-chee Dec 7, 2023

Hi, another QuIP# author here. Thanks for the lively discussion about our work! Can you point to the script that you used for your perplexity evaluations?

tsengalb99 · 2023-12-06T17:00:11Z

tsengalb99
Dec 6, 2023

Hi, one of the QuIP# authors here. Thanks for your interest in our work and for putting in the effort to run this comparison! QuIP# has two core components: incoherence processing and a lattice codebook.

We perform incoherence processing by using a randomized hadamard transformation to make the weights approximately Gaussian distributed during quantization.
We then quantize with multidimensional lattice-based codebooks, such as E8P.

During inference, we first run a hadamard transform, do a matmul with the quantized weight matrix, and then do a reverse hadamard transformation on the input vector. Our implementation for E8P can be found here. We have CUDA kernels that can be used to guide integration into llama.cpp as "efficient" ways of doing these operations. QuIP# should be relatively straightforward to implement since it uses the same compression scheme for every weight, vs mixed-precision methods that use different precisions for different weights.

QuIP# achieves true 2 bit models, whereas other "2 bit" methods with grouping usually end up with significantly more than 2 bits per weight. Our experiments show that our method achieves state of the art results at true 2 bits. Regarding your comparison with Q2K (2.6 bits) and Q2K* (2.3 bits), this is not really an apples-to-apples comparison because it compares approaches that quantize the embedding and output tensors with those that don't. If you want to compare methods that quantize all the tensors, we'll need to produce some QuIP# models that do that. One easy way to test the performance QuIP# with a quantized embedding and output tensor is by copying your Q2K*-quantized embedding and output tensors into QuIP# and calculating perplexity with that. @ikawrakow Can you share instructions on how to reproduce your numbers?

Finally, this project is in active development, so we expect our method to improve in the coming months.

4 replies

ikawrakow Dec 7, 2023

Hello @tsengalb99,

thanks for your detailed description. Yes, I did read your paper and think to have understood the approach. It is a very nice piece of work, so congratulations! I thought the E8P idea was novel and very interesting (and my brain is already exploring ideas if I can utilize this, or something similar, in what I'm doing).

Concerning "true" vs "non-true" 2-bit quantization: doesn't QuIP# need additional bits to convert the E8P codebook entries into actual weights? There must be a scaling factor somewhere. Is it one single scaling factor for the entire model, or is it per tensor, or perhaps even per tensor row? The scaling factor(s) is(are) bits in addition to the 2 bits per weight, so I guess there is a magic threshold for extra bits that distinguishes between "true" and "non-true" 2-bit quantizations?

Concerning quantized vs not quantized token embedding and output tensors: why is it that QuIP# does not quantize those? Here in this repo we have noticed a while ago that not quantizing the output weight tensor (or quantizing it with more bits) leads to a significant reduction in quantization error. Hence, since about June, llama.cpp quantized models have been using output.weight quantized with Q6_K (6.5625 bpw) or Q8_0 (8.5 bpw) for the Falcon models where Q6_K is not good enough. This is also utilized in the results I posted. The token embedding tensor is quantized with "non-true" 2 bits. As Q6_K has been available in llama.cpp/ggml since June, it should be easy for you to try using a higher bit quantization for output.weight (and to quantize token embeddings).

tsengalb99 Dec 7, 2023

QuIP# uses a scaling factor per linear layer, as well as one sign vector per dimension of a linear layer. This works out to (16 + m + n)/(m*n) bits per weight, or < 0.01 in most/all layers. In fact, we are currently storing the sign vectors in fp16 and not even as booleans because they take up so little space. This means that in practice, QuIP# 2 bit works out to < 2.01 bits per decoder layer weight, which is very close to 2. To us, this qualifies as true 2 bits vs something like 2.3 bits that is very far off of 2 bits. For comparison, our experiments with a 2.37 bit E8 codebook (by increasing the size of the ball intersected with E8) got 3.7 perplexity on Wikitext2, which is around what your 2.6 bit Q2K method gets.

We did not quantize the embedding layer because it's not a bottleneck for inference. The embedding layer is doing a lookup per token id rather than reading the entire matrix. For the output tensor, it’s known to be more sensitive to quantization than the transformer blocks, so we want to use more bits there. In order to avoid using two quantization methods in one model we just didn’t quantize it. Furthermore, most existing works in this area (eg GPTQ) do not quantize the output layer, so we stayed consistent with that to make ablations and comparisons easier. If you want to run a quick apples-to-apples comparison with QuIP#, you can also rerun your Q2K numbers without quantizing the embedding and output layers.

JeevanBhoot Feb 5, 2024

Do you have a python/Pytorch implementation of the dequantization (/scaling) that is done at inference time?

ikawrakow Feb 5, 2024

No. But the kernels for de-quantization and matrix-vector dot products are in ggml-cuda.cu, so people familiar with Pytorch can implement.

ikawrakow · 2023-12-06T17:26:42Z

ikawrakow
Dec 6, 2023

I have posted the LLaMA-v2 models quantized with the improved Q2_K quantization method here. My Internet connection is not quite as fast as the 10 Gb/s available to @TheBloke, so it takes some time to upload. I will continue with LLaMA-v1 and Mistral-7B tomorrow.

9 replies

ikawrakow Dec 7, 2023

Yes, the approach improves all available llama.cpp/ggml quantizations. The improvement is most noticeable for the "traditional" ggml quants (Q4_0, Q4_1, Q5_0, Q5_1), as I have taken the liberty of replacing their simple round-to-nearest approach with importance-weighted MSE minimization. I could post such quantized models as well, but observing the zero downloads of the 2-bit models that I posted on Huggingface, my guess is that people just prefer to use "official" models.

Dampfinchen Dec 7, 2023
Author

The low download rate may be because not many people have interest in the LLama 2 base models. They are text completion models after all. And yes you are right, people prefer models they want from TheBloke and others.

I suggest you opening up a PR like the one you've made for the K-Quant improvements. Once merged, Bloke and others can use the convert.py script to make everyone's favorite models with the improved quants. There's a high demand for that.

Thank you for all your great work, it is very appreciated!

TheBloke Dec 7, 2023

Don't trust HF download counters for GGUF/GGML - they're hugely unreliable. I'm not sure they even update live

This GGUF I uploaded 19 hours ago has 9 likes, and I know a lot of people wanted it. Yet, still claims zero downloads:

This might be an even better example - 5 days ago, 118 AWQ downloads, 119 GPTQ downloads, but only 22 GGUF downloads despite having 5 GGUF liks vs 1 each for GPTQ and AWQ.

I have a feeling that only a fraction of GGUF uploads count; possibly depending on how it's downloaded. My guess is that if GGUF counted properly, it'd have more GGUF downloads than the other formats.

BarfingLemurs Dec 7, 2023

observing the zero downloads

This is a hf thing. I tested llama 2 2bit yesterday evening.

I'm attempting to run the 34B/33B on my pixel device with 12gb for my amusement.Well, it didn't load in ram 🥲

TheBloke Dec 7, 2023

Yes, the approach improves all available llama.cpp/ggml quantizations. The improvement is most noticeable for the "traditional" ggml quants (Q4_0, Q4_1, Q5_0, Q5_1), as I have taken the liberty of replacing their simple round-to-nearest approach with importance-weighted MSE minimization.

Awesome, can't wait to try it!

Regarding your quants: if you want more feedback, I'd suggest doing some current models, like Mistral 7B and Yi fine tunes. I can then link them on my Discord to get more people to try them out.

For example the neural-chat model I screenshotted above, and ise-uiuc/Magicoder-S-DS-6.7B - note that this one doesn't have tokenizer.model so I had to use the HF tokenizer convert.py from strutive's PR: #3633

TheBloke · 2023-12-06T17:32:09Z

TheBloke
Dec 6, 2023

Interesting!

This is just experimental right now, is that correct? Not committed to llama.cpp yet

3 replies

Dampfinchen Dec 6, 2023
Author

Not yet merged, no. But hopefully soon. This seems a game changer to me.

I just did some math and the 13B model at Q2_K with a perplexity of 5.1 handily beats LLama 2 7B at FP16 which has a perplexity of 5.7, let alone Q4_K_M. This means 8 and 6 GB GPU, which are the majority of users, can fully offload 13B models and do not have to resort to 7B anymore.

And with Q2_K*, 24 GB GPU owners can offload LLama 2 70B.

This is simply a revolution in terms of local LLMs.

ikawrakow Dec 7, 2023

Remember that the perplexities posted above are for a context size of 4096 (LLaMA-v2) or 2048 (LLaMA-v1), so not the perplexities typically reported in the llama.cpp repo, which are for a context size of 512. I did this to have a better comparison to the perplexities published in the QuIP paper. LLaMA-v2-7b perplexity for context of 4096 is around 5.0, so the 2-bit quantized 13B model does not quite match the not-quantized 7B perplexity.

Dampfinchen Dec 7, 2023
Author

Aww shucks, you are quite right about that.

I wonder how the Llama 2 13B Q2_K model compares to LLama 2 7B at Q4_K_M at the same context size then, considering both have a similar size.

ikawrakow · 2023-12-06T17:54:10Z

ikawrakow
Dec 6, 2023

This is just experimental right now, is that correct? Not committed to llama.cpp yet

Yes, I have not contributed the improved quantization method to llama.cpp. But the models I have posted will work out-of-the-box with the official llama.cpp release. I'm experimenting in a private clone of llama.cpp that I will eventually make public. But for now I prefer to be not distracted by pull and feature requests, so I guess I will keep it private for a bit longer. If/how the result gets merged into ggml/llama.cpp remains to be seen: k_quants.c (which @ggerganov has renamed to ggml-quants.c) has increased from 7.4k to 17k+ LOC (so almost as large as ggml.c), along with 7k+ new LOC in ggml-metal.metal and ggml-cuda.cu, so that would be a lot of new code that comes with the associated maintenance concern.

4 replies

TheBloke Dec 6, 2023

OK fair enough.

That's awesome to hear it's inference compatible with existing llama.cpp. In that case once you're ready to make the quant code public, I'd be happy to do new-Q2_K for all models I've done in the past.

What are your thoughts on how it will be named, if/when it's merged to llama.cpp?

BarfingLemurs Dec 6, 2023

Do you have a pure 2bit quantization script (non q_k*) available? I'd like to play with yi 34B on my 12gb 😊

ikawrakow Dec 7, 2023

What are your thoughts on how it will be named, if/when it's merged to llama.cpp?

I was considering iQuants, with types IQ4_0, IQ4_K, etc.

Do you have a pure 2bit quantization script (non q_k*) available? I'd like to play with yi 34B on my 12gb 😊

This is easy to accomplish by changing this function in llama.cpp to return GGML_TYPE_Q2_K for all tensors except output.weight when called with type = LLAMA_FTYPE_MOSTLY_Q2_K. But with the current llama.cpp 2-bit quantization you will get exceedingly bad results. The changed quantization approach is required to get to usable "pure" 2-bit quantized models.

TheBloke Dec 7, 2023

I was considering iQuants, with types IQ4_0, IQ4_K, etc.

Sounds good to me!

ikawrakow · 2023-12-07T11:27:41Z

ikawrakow
Dec 7, 2023

I have posted the 2-bit quantized LLaMA-v1 models on Huggingface in this repository. For some reason the 65B model is being rejected (yes, I have setup the repository to accept files larger than 5 GB and have successfully pushed the 34B model, which is 10.8 GB).

0 replies

tsengalb99 · 2023-12-08T20:07:07Z

tsengalb99
Dec 8, 2023

Hi @ikawrakow, we ran your Q2K (not *, since those were not released) models from https://huggingface.co/ikawrakow/llama-v1-2bit-gguf/tree/main on our evaluation pipeline that we used to generate the QuIP# numbers. The numbers we are getting do not match the numbers you report. The Q2K perplexities we are getting are higher than what you report, and our model size on disk for QuIP# is also slightly smaller than what you report.

These are the numbers we are getting

Model	Quant	Size on disk (du KiB)	Seqlen	C4	Wikitext2
1-07b	fp16	13163436	2048	7.343	5.677
1-13b	fp16	28157892	2048	6.798	5.091
1-30b	fp16	63743676	2048	6.130	4.101
1-65b	fp16	127513756	2048	5.811	3.532
1-07b	q2k	2223904	2048	9.197	7.016
1-13b	q2k	4251704	2048	7.888	5.935
1-30b	q2k	10529404	2048	7.010	4.906
1-65b	q2k	21051268	2048	6.448	4.286
1-07b	quip#	2097788	2048	10.927	8.146
1-07b	quip# quant emb (4)	~1905788	2048	10.941	8.169
1-07b	quip# quant emb (4) + output (8)	~1777788	2048	10.940	8.184
1-07b	quip# quant emb (6) + output (6)	~1777788	2048	10.947	8.175
1-07b	quip# quant emb (4) + output (6)	~1745788	2048	10.962	8.185
1-07b	quip# quant emb (4) + output (4)	~1713788	2048	11.153	8.323
1-07b	quip# quant emb (2) + output (6)	~1713788	2048	11.273	8.457
1-13b	quip#	3744972	2048	8.426	6.353
1-30b	quip#	8683724	2048	7.465	5.311
1-65b	quip#	16858084	2048	6.749	4.573

The trend here is basically "you get what you pay for" in that the larger the model the better the results, which is not surprising. We also ran a simple experiment with quantizing the embedding and output layers by doing the following very naive algorithm:

Run a Hadamard transform on the non-input/output side of the matrix
Round it to the k bit half integer grid after choosing some matrix-wide scale factor that minimizes rounding error.
Run a reverse Hadamard transform

This is possibly the "dumbest" thing one can do with a Hadamard transform as there is no Hessian information, adaptive rounding, or groupwise scaling. Thus, the results here for quantizing QuIP# embedding/output layers with this algorithm should be taken as an upper bound on perplexities achievable with QuiP# and quantized embedding/output layers. These are indicated in the table as QuIP# quant emb (k1) + output (k2) where k1 is the number of bits for the embedding and k2 is the number of bits for the output. QuIP# can achieve a significant reduction in size without sacrificing performance from quantizing these two layers, which I suspect is true in Q2K as well. In this setting, the difference in size a 1.77G QuIP# model (QuIP# quant emb (4) + output (6)) and Q2K* (2.22G) is more than 20%.

Our code that generated these numbers is available at https://github.com/Cornell-RelaxML/quip-sharp/tree/q2k_test and the commands to run the code are in q2k.sh. The regular QuIP# numbers can be obtained from the code in the main branch. The embedding/output quantized models (which we saved as fp16 to avoid writing unnecessary code for this test) were generated by hack_emb.py and use the eval scripts in the main branch. We would highly appreciate it if you could tell us how you got your numbers in case we are misunderstanding something here.

5 replies

mirek190 Dec 8, 2023

In short the model is retarded after that lobotomy.

ikawrakow Dec 9, 2023

Hi @tsengalb99,

it is well known that perplexity calculations do differ from one implementation to another. This is why I was careful to state in the Huggingface repository that the perplexity values shown there were computed with llama.cpp. The values I get for LLaMA-v1-7b with a context length of 2048 tokens are 5.2351 for fp16, and 6.4023 for Q2_K. Here are the outputs of the llama.cpp perplexity runs:

fp16


./bin/perplexity -m ../models/L1_7B/ggml-model-f16.gguf -f ../tests/wiki.test.raw -t 1 -ngl 100 -c 2048
main: build = 1619 (bcc0eb4)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1702099213
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9
llama_model_loader: loaded meta data with 15 key-value pairs and 291 tensors from ../models/7B/ggml-model-f16.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   10:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   15:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   19:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   20:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   24:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   27:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   28:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   29:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   33:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   36:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   37:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   38:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   42:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   45:              blk.4.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   46:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   47:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   51:         blk.5.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   54:              blk.5.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   55:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   56:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   60:         blk.6.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   63:              blk.6.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   64:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   65:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   69:         blk.7.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   72:              blk.7.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   73:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   74:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   78:         blk.8.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   81:              blk.8.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   82:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   87:         blk.9.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   90:              blk.9.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   91:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   96:        blk.10.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.10.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  100:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  105:        blk.11.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.11.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  109:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  112:             blk.12.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:             blk.12.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  114:        blk.12.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  116:           blk.12.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  117:             blk.12.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  118:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  121:             blk.13.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:             blk.13.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  123:        blk.13.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  125:           blk.13.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.13.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  127:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  130:             blk.14.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:             blk.14.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  132:        blk.14.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  134:           blk.14.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.14.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  136:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:             blk.15.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:             blk.15.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  141:        blk.15.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  143:           blk.15.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  145:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:             blk.16.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  150:        blk.16.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  152:           blk.16.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  154:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:             blk.17.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  159:        blk.17.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  161:           blk.17.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  163:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:             blk.18.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  168:        blk.18.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  170:           blk.18.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  172:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:             blk.19.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  177:        blk.19.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  179:           blk.19.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  181:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:             blk.20.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  186:        blk.20.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  188:           blk.20.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  190:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:             blk.21.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  195:        blk.21.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  197:           blk.21.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  199:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:             blk.22.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  204:        blk.22.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  206:           blk.22.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  208:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:             blk.23.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  213:        blk.23.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  215:           blk.23.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  217:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  222:        blk.24.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  224:           blk.24.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  226:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  231:        blk.25.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  233:           blk.25.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  235:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  240:        blk.26.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  242:           blk.26.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  244:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  249:        blk.27.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  253:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  258:        blk.28.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  260:           blk.28.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  262:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  263:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  267:        blk.29.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  269:           blk.29.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  271:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  272:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:             blk.30.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  276:        blk.30.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  278:           blk.30.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  280:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  281:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:             blk.31.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  285:        blk.31.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  287:           blk.31.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  289:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 12.55 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = LLaMA
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  250.11 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: VRAM used: 12603.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 1024.00 MB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB
llama_new_context_with_model: total VRAM used: 13783.02 MiB (model: 12603.02 MiB, context: 1180.00 MiB)
system_info: n_threads = 1 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 649.791 ms

perplexity: calculating perplexity over 163 chunks, batch_size=512

perplexity: 0.64 seconds per pass - ETA 1.73 minutes

[1]4.3724,[2]5.7040,[3]6.4390,[4]6.3481,[5]5.6368,[6]5.4056,[7]4.9095,[8]4.8042,[9]4.8030,[10]4.9065,[11]4.9335,[12]4.9494,[13]4.9305,[14]4.9845,[15]5.0887,[16]5.1598,[17]5.2217,[18]5.3370,[19]5.3838,[20]5.4650,[21]5.3822,[22]5.2368,[23]5.2830,[24]5.3554,[25]5.3454,[26]5.3853,[27]5.3854,[28]5.4622,[29]5.4559,[30]5.5454,[31]5.6162,[32]5.7003,[33]5.6941,[34]5.6629,[35]5.6345,[36]5.5710,[37]5.5533,[38]5.5489,[39]5.5549,[40]5.5507,[41]5.4867,[42]5.3907,[43]5.3304,[44]5.2539,[45]5.2089,[46]5.1837,[47]5.1889,[48]5.2577,[49]5.3215,[50]5.3591,[51]5.3832,[52]5.3963,[53]5.4132,[54]5.4207,[55]5.4282,[56]5.3966,[57]5.4297,[58]5.4139,[59]5.3901,[60]5.3543,[61]5.3215,[62]5.2999,[63]5.2663,[64]5.2129,[65]5.1800,[66]5.1347,[67]5.1249,[68]5.1341,[69]5.1534,[70]5.1737,[71]5.2024,[72]5.2248,[73]5.1828,[74]5.1830,[75]5.1650,[76]5.1398,[77]5.1385,[78]5.1257,[79]5.1003,[80]5.1160,[81]5.1184,[82]5.1345,[83]5.1523,[84]5.1456,[85]5.1328,[86]5.1378,[87]5.1466,[88]5.1448,[89]5.1572,[90]5.1567,[91]5.1770,[92]5.1694,[93]5.1584,[94]5.1520,[95]5.1508,[96]5.1441,[97]5.1353,[98]5.1199,[99]5.1085,[100]5.1191,[101]5.1271,[102]5.1302,[103]5.1700,[104]5.1979,[105]5.2156,[106]5.2340,[107]5.2564,[108]5.2790,[109]5.2833,[110]5.2898,[111]5.2972,[112]5.3074,[113]5.2893,[114]5.2945,[115]5.3023,[116]5.3063,[117]5.3144,[118]5.3169,[119]5.3125,[120]5.3243,[121]5.3219,[122]5.3171,[123]5.3106,[124]5.3012,[125]5.2870,[126]5.2734,[127]5.2618,[128]5.2670,[129]5.2642,[130]5.2737,[131]5.2820,[132]5.2814,[133]5.2752,[134]5.2612,[135]5.2718,[136]5.2730,[137]5.2630,[138]5.2546,[139]5.2453,[140]5.2444,[141]5.2469,[142]5.2439,[143]5.2440,[144]5.2356,[145]5.2276,[146]5.2272,[147]5.2305,[148]5.2225,[149]5.2251,[150]5.2294,[151]5.2357,[152]5.2271,[153]5.2292,[154]5.2161,[155]5.1928,[156]5.2014,[157]5.2081,[158]5.2274,[159]5.2283,[160]5.2235,[161]5.2274,[162]5.2247,[163]5.2351,

Final estimate: PPL = 5.2351 +/- 0.02790

llama_print_timings: load time = 2386.86 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 62868.27 ms / 333824 tokens ( 0.19 ms per token, 5309.90 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 99826.02 ms

Q2_K


./bin/perplexity -m ../models/L1_7B/llama-v1-7b-q2k.gguf  -f ../tests/wiki.test.raw -t 1 -ngl 100 -c 2048
main: build = 1619 (bcc0eb4)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1702098972
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no  
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes 
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9 
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from junk.bin (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q2_K     [  4096, 32000,     1,     1 ] 
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor    2:                    output.weight q6_K     [  4096, 32000,     1,     1 ] 
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor    7:            blk.0.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor    8:            blk.0.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor    9:              blk.0.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   10:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   11:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   12:              blk.1.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   13:              blk.1.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   14:              blk.1.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   15:         blk.1.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   16:            blk.1.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   17:            blk.1.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   18:              blk.1.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   19:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   20:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   21:              blk.2.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   22:              blk.2.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   23:              blk.2.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   24:         blk.2.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   25:            blk.2.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   26:            blk.2.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   27:              blk.2.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   28:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   29:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   30:              blk.3.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   31:              blk.3.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   32:              blk.3.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   33:         blk.3.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   34:            blk.3.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   35:            blk.3.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   36:              blk.3.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   37:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   38:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   39:              blk.4.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   40:              blk.4.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   41:              blk.4.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   42:         blk.4.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   43:            blk.4.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   44:            blk.4.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   45:              blk.4.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   46:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   47:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   48:              blk.5.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   49:              blk.5.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   50:              blk.5.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   51:         blk.5.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   52:            blk.5.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   53:            blk.5.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   54:              blk.5.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   55:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   56:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   57:              blk.6.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   58:              blk.6.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   59:              blk.6.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   60:         blk.6.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   61:            blk.6.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   62:            blk.6.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   63:              blk.6.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   64:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   65:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   66:              blk.7.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   67:              blk.7.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   68:              blk.7.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   69:         blk.7.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   70:            blk.7.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   71:            blk.7.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   72:              blk.7.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   73:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   74:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ] 
llama_model_loader: - tensor   75:              blk.8.attn_q.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   76:              blk.8.attn_k.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   77:              blk.8.attn_v.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   78:         blk.8.attn_output.weight q2_K     [  4096,  4096,     1,     1 ] 
llama_model_loader: - tensor   79:            blk.8.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ] 
llama_model_loader: - tensor   80:            blk.8.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ] 
llama_model_loader: - tensor   81:              blk.8.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   82:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   87:         blk.9.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   90:              blk.9.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   91:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   96:        blk.10.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.10.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  100:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  105:        blk.11.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.11.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  109:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  112:             blk.12.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:             blk.12.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  114:        blk.12.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  116:           blk.12.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  117:             blk.12.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  118:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  121:             blk.13.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:             blk.13.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  123:        blk.13.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  125:           blk.13.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.13.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  127:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  130:             blk.14.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:             blk.14.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  132:        blk.14.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  134:           blk.14.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.14.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  136:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:             blk.15.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:             blk.15.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  141:        blk.15.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  143:           blk.15.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  145:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:             blk.16.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  150:        blk.16.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  152:           blk.16.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  154:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:             blk.17.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  159:        blk.17.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  161:           blk.17.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  163:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:             blk.18.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  168:        blk.18.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  170:           blk.18.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  172:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:             blk.19.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  177:        blk.19.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  179:           blk.19.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  181:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:             blk.20.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  186:        blk.20.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  188:           blk.20.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  190:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:             blk.21.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  195:        blk.21.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  197:           blk.21.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  199:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:             blk.22.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  204:        blk.22.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  206:           blk.22.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  208:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:             blk.23.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  213:        blk.23.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  215:           blk.23.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  217:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  222:        blk.24.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  224:           blk.24.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  226:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  231:        blk.25.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  233:           blk.25.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  235:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  240:        blk.26.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  242:           blk.26.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  244:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  249:        blk.27.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  253:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  258:        blk.28.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  260:           blk.28.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  262:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
lama_model_loader: - tensor  263:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  267:        blk.29.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  269:           blk.29.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  271:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  272:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:             blk.30.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  276:        blk.30.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  278:           blk.30.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  280:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  281:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_q.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:             blk.31.attn_k.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.attn_v.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  285:        blk.31.attn_output.weight q2_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_gate.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  287:           blk.31.ffn_down.weight q2_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.ffn_up.weight q2_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  289:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 10
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q2_K
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 2.12 GiB (2.70 BPW)
llm_load_print_meta: general.name     = LLaMA
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 ''
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   41.12 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: VRAM used: 2130.05 MiB
................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 1024.00 MB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB
llama_new_context_with_model: total VRAM used: 3310.06 MiB (model: 2130.05 MiB, context: 1180.00 MiB)
system_info: n_threads = 1 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

perplexity: tokenizing the input ..

perplexity: tokenization took 670.985 ms

perplexity: calculating perplexity over 163 chunks, batch_size=512

perplexity: 0.75 seconds per pass - ETA 2.03 minutes

[1]5.6994,[2]7.1145,[3]8.0707,[4]8.0252,[5]7.0198,[6]6.7408,[7]6.1574,[8]6.0285,[9]6.0114,[10]6.1434,[11]6.1306,[12]6.1271,[13]6.0813,[14]6.1267,[15]6.2442,[16]6.3133,[17]6.3771,[18]6.5025,[19]6.5419,[20]6.6362,[21]6.5353,[22]6.3875,[23]6.4445,[24]6.5399,[25]6.5164,[26]6.5582,[27]6.5572,[28]6.6390,[29]6.6397,[30]6.7468,[31]6.8259,[32]6.9162,[33]6.9097,[34]6.8764,[35]6.8503,[36]6.7803,[37]6.7630,[38]6.7632,[39]6.7710,[40]6.7640,[41]6.6752,[42]6.5573,[43]6.4723,[44]6.3746,[45]6.3206,[46]6.2810,[47]6.2963,[48]6.3733,[49]6.4473,[50]6.4917,[51]6.5326,[52]6.5495,[53]6.5645,[54]6.5707,[55]6.5754,[56]6.5327,[57]6.5664,[58]6.5480,[59]6.5256,[60]6.4824,[61]6.4416,[62]6.4153,[63]6.3792,[64]6.3123,[65]6.2703,[66]6.2440,[67]6.2342,[68]6.2450,[69]6.2718,[70]6.2936,[71]6.3237,[72]6.3521,[73]6.3042,[74]6.3059,[75]6.2874,[76]6.2536,[77]6.2478,[78]6.2304,[79]6.2066,[80]6.2309,[81]6.2416,[82]6.2610,[83]6.2858,[84]6.2747,[85]6.2584,[86]6.2625,[87]6.2733,[88]6.2689,[89]6.2835,[90]6.2800,[91]6.3063,[92]6.2905,[93]6.2708,[94]6.2596,[95]6.2579,[96]6.2463,[97]6.2317,[98]6.2138,[99]6.2009,[100]6.2118,[101]6.2191,[102]6.2245,[103]6.2711,[104]6.3054,[105]6.3285,[106]6.3566,[107]6.3857,[108]6.4110,[109]6.4142,[110]6.4228,[111]6.4296,[112]6.4408,[113]6.4276,[114]6.4337,[115]6.4439,[116]6.4515,[117]6.4667,[118]6.4728,[119]6.4781,[120]6.5029,[121]6.5047,[122]6.5004,[123]6.4942,[124]6.4850,[125]6.4677,[126]6.4528,[127]6.4428,[128]6.4494,[129]6.4463,[130]6.4571,[131]6.4649,[132]6.4653,[133]6.4552,[134]6.4365,[135]6.4489,[136]6.4489,[137]6.4346,[138]6.4233,[139]6.4118,[140]6.4128,[141]6.4209,[142]6.4147,[143]6.4141,[144]6.4042,[145]6.3936,[146]6.3906,[147]6.3933,[148]6.3797,[149]6.3808,[150]6.3845,[151]6.3951,[152]6.3846,[153]6.3886,[154]6.3710,[155]6.3385,[156]6.3477,[157]6.3563,[158]6.3813,[159]6.3840,[160]6.3805,[161]6.3865,[162]6.3879,[163]6.4023,

Final estimate: PPL = 6.4023 +/- 0.03495

llama_print_timings: load time = 979.71 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 77866.70 ms / 333824 tokens ( 0.23 ms per token, 4287.12 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 116769.91 ms

It is somewhat surprising that the perplexities you compute are higher, not just for Q2_K but also for fp16, considering that llama.cpp is known to overestimate the perplexity (the way it is implemented results in a shorter context being effectively used, see e.g., #2714)

In any case, after quantizing the token embedding and output tensors, the QuIP# models are exceptionally small, I'm really impressed by that. But, as you say, you get what you pay for, so the perplexities of your models are significantly higher. In my opinion, if you want to be able to claim SOTA performance, your models need to either achieve the same perplexity as Q2_K at a smaller model size, or have a lower perplexity at the Q2_K model size.

tsengalb99 Dec 10, 2023

@ikawrakow The actual formula for perplexity is well defined, so if you are getting different numbers that means you are using a different dataset or data sampling method. We use the GPTQ data sampling method (see lib/utils/gptq_data_utils.py), which just about every recent quantization paper uses. This makes it very easy to compare numbers between papers by verifying the fp16 numbers match.

jerry-chee Dec 10, 2023

Hi another quip-sharp author here. In our experience we've found that changing the sampling logic for the same dataset does effect perplexity, which is why it's important to use the same script/logic to evaluate perplexity.

cmp-nct Dec 10, 2023

Probably a good point on perplexity todo - given it's used for comparisons it should be checked by the values differ.
I suppose that wasn't too important a while ago but given the high quality llama.cpp provides by now it's more important today than it was

ikawrakow · 2023-12-23T15:44:58Z

ikawrakow
Dec 23, 2023

I have spent some more time looking into this, and have created 3 new quantizations: "pure" 2-bit, 2.156-bit, and 2.28 bits per weight (bpw). The results for these along with the original Q2_K (2.5625 bpw) are given in the table below. Perplexities are for a context of 4096 (LLaMA-v2 and Mistral) or 2048 (LLaMA-v1) as computed by the perplexity tool of llama.cpp. Sizes are in GB (10^9 bytes). I have used Q5_K for the output.weight tensor and Q2_K for the token embeddings.

Model	2-bit PPL	2.156-bit PPL	2.28-bit PPL	Q2_K PPL	2-bit size	2.156-bit size	2.28-bit size	Q2_K size
Mistral-7B	6.469	6.093	5.820	5.675	1.946	2.066	2.169	2.456
LLaMA-v2-7B	7.030	6.507	6.282	6.118	1.809	1.922	2.012	2.260
LLaMA-v2-13B	5.699	5.404	5.273	5.192	3.447	3.668	3.844	4.332
LLaMA-v2-70B	4.063	3.865	3.776	3.671	17.75	19.03	20.05	22.90
LLaMA-v1-7B	7.395	6.907	6.688	6.402	1.809	1.922	2.012	2.280
LLaMA-v1-13B	5.968	5.682	5.539	5.397	3.447	3.667	3.844	4.332
LLaMA-v1-30B	5.012	4.789	4.678	4.507	8.506	9.073	9.520	10.75
LLaMA-v1-65B	4.317	4.151	4.041	3.914	17.19	18.14	19.04	21.60

Is there interest to integrate one or more of these into llama.cpp?

If there is interest but 4 different 2-bit variants are considered too much, which should I pick?

6 replies

laurids-reichardt Dec 23, 2023

Thanks for your fantastic work! I believe there is significant demand for smaller quantization types, as they'd allow Macs with 32 GiB memory to comfortably run LLaMA-v2-70B models. In my own testing, models up to approximately 25 GiB run well on these machines and the current Q2_K size of 70B models is just slightly too large.

Could you clarify whether these new quantization types use a new quant algorithm?

As you suggested previously in the discussion above, I modified the current quantization logic in this function to produce a Q2_K LLaMA-v2-70B model with all "ffn_*" layers quantized to GGML_TYPE_Q2_K. This reduced the model size to below 25 GiB, but the perplexity appeared to increase significantly, going well above 5.0. How are you able to produce Q2_K LLaMA-v2-70B with a model size of 22.90 GiB and perplexity of 3.671?

ikawrakow Dec 23, 2023

How are you able to produce Q2_K LLaMA-v2-70B with a model size of 22.90 GiB and perplexity of 3.671?

By using an "importance matrix". The quantization approach that is currently available in llama.cpp minimizes some similarity measure (e.g., RMSE, mean absolute difference, etc.) between the fp16 weights and the quantized weights without worrying about weight importance differences. Because in practice some weights really are more important than others, one cannot go for full minimization of the difference as this sometimes leads to very undesirable results. The ffn_down tensors are particularly sensitive, especially the first few layers.

Getting the infrastructure to generate and then use the importance matrix during quantization requires quite a bit of changes to llama.cpp/ggml. This is why, at least for now, I do the quantization using my forked llama.cpp version and publish the resulting models on HF.

ikawrakow Dec 23, 2023

@ggerganov All 3 of them?

There are no special measures required as I can implement these using the existing approach of quantization blocks. This will lead to a small penalty in size of one addition fp16 per block of 256 (so 0.06125 bpw) compared to what I have used for the above table, where one has a single scale per tensor row (so < 0.01 bpw).

cmp-nct Dec 24, 2023

Regarding the naming conventions, maybe we should use the actual average bpw instead of the other names ?
QK_

ggerganov Dec 24, 2023
Maintainer

All 3 of them?

I was thinking about the column Q2_K, but I now realize this is probably using the existing quantizations combined with the "importance matrix" approach, is that correct? In that case, then start with the 2.28 bit quantization?

Are you planning to release some quantized instruct or chat models? These would be helpful to get feedback about the applicability of these quantizations as most people do not use the base models (or at least this is my impression).

tsengalb99 · 2023-12-23T17:34:30Z

tsengalb99
Dec 23, 2023

Do you have the models from your latest table available online? And do you have fp16 numbers for your table? Thanks Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Kawrakow ***@***.***> Sent: Saturday, December 23, 2023 9:31:47 AM To: ggerganov/llama.cpp ***@***.***> Cc: Albert Tseng ***@***.***>; Mention ***@***.***> Subject: Re: [ggerganov/llama.cpp] New SOTA 2-Bit Quant released: QuIP-Sharp (Discussion #4327) How are you able to produce Q2_K LLaMA-v2-70B with a model size of 22.90 GiB and perplexity of 3.671? By using an "importance matrix". The quantization approach that is currently available in llama.cpp minimizes some similarity measure (e.g., RMSE, mean absolute difference, etc.) between the fp16 weights and the quantized weights without worrying about weight importance differences. Because in practice some weights really are more important than others, one cannot go for full minimization of the difference as this sometimes leads to very undesirable results. The ffn_down tensors are particularly sensitive, especially the first few layers. Getting the infrastructure to generate and then use the importance matrix during quantization requires quite a bit of changes to llama.cpp/ggml. This is why, at least for now, I do the quantization using my forked llama.cpp version and publish the resulting models on HF. — Reply to this email directly, view it on GitHub<#4327 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSD4VMLKDUHPZWXSBWDYK4IQHAVCNFSM6AAAAABAGLCR56VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TSMZUHA4TM>. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

ikawrakow Dec 23, 2023

Do you have the models from your latest table available online?

No, because they would be useless without me first adding the necessary dequantization and matrix multiplication kernels to llama.cpp. Hence, I want to first get consensus that people want these added to llama.cpp, then I will prepare a PR, and then I will publish the quantized models.

And do you have fp16 numbers for your table?

Basically, llama.cpp fp16 perplexities are consistently about 3% (LLaMA-v2) or 9% (LLaMA-v1) lower than the fp16 values you report in your paper. But just in case, here are the PPL values I get:

Model	fp16 PPL
Mistral-7B	4.792
LLaMA-v2-7B	4.935
LLaMA-v2-13B	4.417
LLaMA-v2-70B	3.026
LLaMA-v1-7B	5.235
LLaMA-v1-13B	4.665
LLaMA-v1-30B	3.735
LLaMA-v1-65B	3.2038

I didn't want to get into comparisons with QuIP#, but if the comparison is of interest to people, I can also post a graph that makes what I believe is a fair comparison.

New SOTA 2-Bit Quant released: QuIP-Sharp #4327

Replies: 12 comments · 38 replies

Dampfinchen Dec 6, 2023 Author

Dampfinchen Dec 7, 2023 Author

Dampfinchen Dec 6, 2023 Author

Dampfinchen Dec 7, 2023 Author

ggerganov Dec 24, 2023 Maintainer

Replies: 12 comments 38 replies

Dampfinchen Dec 6, 2023
Author

Dampfinchen Dec 7, 2023
Author

Dampfinchen Dec 6, 2023
Author

Dampfinchen Dec 7, 2023
Author

ggerganov Dec 24, 2023
Maintainer