int8 quantization attempt #2 #364

karpathy · 2023-08-26T22:30:44Z

Ok attempt number two

separate and new runq.c file
the Makefile currently compiles both I'm not sure I'm super happy with it
only did very quick testing and the results look good - they are similar to fp32 but diverge after a few dozen tokens. not sure that this is salvageable in principle

There could be bugs, I hastily ported this from my previous int8 version (attempt 1) into the new code. TODO I think I want to gain more confidence that this is good first before merging. Makefile needs some thought.

atamurad · 2023-08-27T00:49:48Z

I'd suggest to make more use of QuantizedTensor struct/abstraction as it's already there; make it's internals opaque inside forward() and change functions signatures to use it as argument:

matmul(float *xout, QuantizedTensor *x, QuantizedTensor *w, int n, int d)
dequantize(QuantizedTensor *qx, float *x)
quantize(QuantizedTensor *qx, float *x)

This will help with 1) code readability/simplicity - when I'm reading forward() function, I'm already (and only) thinking about vectors, and not how it's quantized, 2) It will make it easier to experiment/add other quantization techniques, as we don't have to touch transformer code at all.

These two lines could be shorter, IMHO:

matmul(s->hb, s->xq.q, s->xq.s, w->w1.q + l*dim*hidden_dim, w->w1.s + l*dim*hidden_dim/GS, dim, hidden_dim);
matmul(s->hb2, s->xq.q, s->xq.s, w->w3.q + l*dim*hidden_dim, w->w3.s + l*dim*hidden_dim/GS, dim, hidden_dim);

I can draft PR if you want to see how it all would look like.

karpathy · 2023-08-27T02:57:13Z

Quite agree, good idea.

atamurad · 2023-08-27T03:16:17Z

First draft refactor is here:
atamurad@f850a97

matmul(), quantize() and dequantize() all take QuantizedTensor
Cleaned up memory_map_weights()
Code does compile and runs as before refactor

EDIT:
In this abstraction - QuantizedTensor is atomic, we do not index into or slice from it. We can matmul with it or dequantize all elements. This abstraction model breaks token embedding (sliced from it) & shared classifier weights (matmul-ed). So I split token embedding into separate rows and exported wcls as one quantizedtensor as well (as if it is not shared). Need to think about what to do with this.

karpathy · 2023-08-27T03:28:36Z

Actually, on a quick skim other than the slight weirdness with wcls and wemb this looks very nice! I'll stare at this more tomorrow.

atamurad · 2023-08-27T05:30:09Z

Actually, on a quick skim other than the slight weirdness with wcls and wemb this looks very nice! I'll stare at this more tomorrow.

Sounds good! Please review this branch instead: https://github.com/atamurad/llama2.c/tree/int8_refactor
I added another commit to properly handle token embedding / wcls shared weights. Hopefully this one looks less weird.

rdentato · 2023-08-28T06:19:51Z

Maybe if we find the proper abstraction on tensors, we might push all the differences there and have a clean run.c that operates on tensors (which will make extremely clear how it works) and some more messy implementations of those operations, one for each version (float32, int8, int4-cuda, ...)
Probably something to check after the initial merge? At that point, we already will have two "official" versions (float32 and int8) that should coexist and if that can be made "transparent" (from the inference point of view), it will set up the "standard" for the other versions.

atamurad · 2023-08-28T12:36:08Z

I did some tests with llama2-7b-chat model. Quantization, v2 export and inference with runq all worked.

On my laptop I'm getting about ~0.22 tokens/second. For comparison, llama.cpp q8_0 does ~0.25 tokens/sec.
(4 threads, AVX2, No AVX512)

This int8 implementation vectorizes well. I tried some AVX2 intrinsics for matmul and basically getting the same result/speed as plain C when testing with 7b-chat model.

karpathy · 2023-09-05T11:00:07Z

@atamurad sorry I'm slow btw, I'm traveling right now for the next ~2 weeks. I will probably have chunks of time available to dig into your PR but I'm not sure when as my schedule is chaotic. I do think that from a quick skim your approach is better than my first draft, and much simpler to lay out the scale factors right next to the quantized values.

atamurad · 2023-09-05T11:30:22Z

@karpathy safe travels! It might be early to merge to master yet but I'll submit my refactor as PR to be merged to your feature/int8_try2 branch for now.

I don't have input on Makefile or as @rdentato suggested if two versions can be in single run.c file.

karpathy · 2023-09-05T11:37:37Z

sounds good

int8 refactor

karpathy · 2023-09-05T22:44:02Z

On the 110M model with make runfast I'm currently seeing 29 tok/s on legacy and 38 tok/s on int8 quantized.
(i.e. ~30% speedup, less than I'd hope for originally)

atamurad · 2023-09-05T23:30:08Z

Specifying GS at compile time (const int GS = 64;) gets me another ~25% speedup on 110M model.

karpathy · 2023-09-05T23:50:48Z

ugh that's annoying. i tried it but i onyl go up to 40 tok/s. This isn't exactly a thing we can merge though

karpathy · 2023-09-05T23:51:56Z

There is another interesting baseline where we could in principle let all the hyperparameters be # DEFINEs at compile time, basically fixed and specific to a single model. This might make things even faster.

karpathy · 2023-09-05T23:53:07Z

The idea would be to make ./run binary be very specific to a single model architecture, and if you want to run a different model you'd recompile for that model specifically. It's not totally out of the question 🤔

karpathy · 2023-09-05T23:53:56Z

Another reason I like that is that you could erase the need for malloc. Everything would be statically pre-allocated and known at compile time. Kind of appealing! :)
Sorry for comment spam

rdentato · 2023-09-07T07:27:40Z

To push this forward, why don't we create a "run-generator" that takes a model and produces an optimized run.c specific to that model? One could ask for the type of quantization, whether it is for CPU, Cuda, Metal, WASM, ... and get a run_xxxx.c for that model.
It may seem strange now that we are all playing with LLMs and we are interested in running different models, but let's consider a larger system that has a fine-tuned LLM as a component, I would only be interested in having that model running as fast as possible.

Majdoddin · 2023-09-07T08:39:56Z

run-generator just needs to generate a header file, say, model.h. And add #include "model.h" to run.c
This header file contains all the #DEFINEs for hyperparameters and model shapes

rdentato · 2023-09-07T11:10:45Z

I've just done that but the benefits are minimal (at least on my machine):

// Model: ../models/stories110M.bin
#define CONFIG_dim 768
#define CONFIG_hidden_dim 2048
#define CONFIG_n_layers 12
#define CONFIG_n_heads 12
#define CONFIG_n_kv_heads 12
#define CONFIG_vocab_size 32000
#define CONFIG_seq_len 1024
#define CONFIG_shared_weights 1

All the references to parameter models are now constants in the code.

I got some benefits on my machine (AMD Ryzen 5, 16GB RAM) but didn't take any accurate measurements.

I tried eliminating the calloc for Runstate but performances decrease.

karpathy · 2023-10-09T16:40:15Z

I'm not sure what happened but I'm seeing a 3X speedup on Llama 2 7B, which is surprisingly nice, up for 30% before. Maybe it's a model size dependence.

I am threatning to merge this PR for reals now.

karpathy · 2023-10-09T19:54:39Z

Alright let's go.

int8 quantization attempt karpathy#2

draft of int8 attempt number two

df80471

karpathy changed the title ~~draft of int8 attempt number two~~ int8 quantization attempt #2 Aug 26, 2023

draft refactor to use QuantizedTensor in function arguments

f850a97

karpathy mentioned this pull request Aug 27, 2023

int8 quantization #312

Closed

atamurad added 2 commits August 27, 2023 06:47

free() quantizedtensors

06175b9

properly handle token embeddings & shared classifier wcls

6e52df9

atamurad mentioned this pull request Sep 5, 2023

int8 refactor #383

Merged

Merge pull request #383 from atamurad/int8_refactor

5186b50

int8 refactor

karpathy marked this pull request as ready for review October 9, 2023 15:32

karpathy added 2 commits October 9, 2023 08:34

Merge branch 'master' into feature/int8_try2

1f8af82

add some docs for runq

b233b77

karpathy added the high-priority label Oct 9, 2023

karpathy merged commit d986206 into master Oct 9, 2023
6 checks passed

clebert mentioned this pull request Oct 19, 2023

Add support for 8-bit Quantization clebert/llama2.zig#3

Open

vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024

Merge pull request karpathy#364 from karpathy/feature/int8_try2

4b63058

int8 quantization attempt karpathy#2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8 quantization attempt #2 #364

int8 quantization attempt #2 #364

karpathy commented Aug 26, 2023

atamurad commented Aug 27, 2023

karpathy commented Aug 27, 2023

atamurad commented Aug 27, 2023 •

edited

Loading

karpathy commented Aug 27, 2023

atamurad commented Aug 27, 2023

rdentato commented Aug 28, 2023

atamurad commented Aug 28, 2023

karpathy commented Sep 5, 2023

atamurad commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023 •

edited

Loading

atamurad commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023

rdentato commented Sep 7, 2023 •

edited

Loading

Majdoddin commented Sep 7, 2023

rdentato commented Sep 7, 2023 •

edited

Loading

karpathy commented Oct 9, 2023

karpathy commented Oct 9, 2023

int8 quantization attempt #2 #364

int8 quantization attempt #2 #364

Conversation

karpathy commented Aug 26, 2023

atamurad commented Aug 27, 2023

karpathy commented Aug 27, 2023

atamurad commented Aug 27, 2023 • edited Loading

karpathy commented Aug 27, 2023

atamurad commented Aug 27, 2023

rdentato commented Aug 28, 2023

atamurad commented Aug 28, 2023

karpathy commented Sep 5, 2023

atamurad commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023 • edited Loading

atamurad commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023

karpathy commented Sep 5, 2023

rdentato commented Sep 7, 2023 • edited Loading

Majdoddin commented Sep 7, 2023

rdentato commented Sep 7, 2023 • edited Loading

karpathy commented Oct 9, 2023

karpathy commented Oct 9, 2023

atamurad commented Aug 27, 2023 •

edited

Loading

karpathy commented Sep 5, 2023 •

edited

Loading

rdentato commented Sep 7, 2023 •

edited

Loading

rdentato commented Sep 7, 2023 •

edited

Loading