-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
int8 quantization attempt #2 #364
Conversation
I'd suggest to make more use of QuantizedTensor struct/abstraction as it's already there; make it's internals opaque inside forward() and change functions signatures to use it as argument: matmul(float *xout, QuantizedTensor *x, QuantizedTensor *w, int n, int d)
dequantize(QuantizedTensor *qx, float *x)
quantize(QuantizedTensor *qx, float *x) This will help with 1) code readability/simplicity - when I'm reading forward() function, I'm already (and only) thinking about vectors, and not how it's quantized, 2) It will make it easier to experiment/add other quantization techniques, as we don't have to touch transformer code at all. These two lines could be shorter, IMHO: matmul(s->hb, s->xq.q, s->xq.s, w->w1.q + l*dim*hidden_dim, w->w1.s + l*dim*hidden_dim/GS, dim, hidden_dim);
matmul(s->hb2, s->xq.q, s->xq.s, w->w3.q + l*dim*hidden_dim, w->w3.s + l*dim*hidden_dim/GS, dim, hidden_dim); I can draft PR if you want to see how it all would look like. |
Quite agree, good idea. |
First draft refactor is here:
EDIT: |
Actually, on a quick skim other than the slight weirdness with wcls and wemb this looks very nice! I'll stare at this more tomorrow. |
Sounds good! Please review this branch instead: https://github.com/atamurad/llama2.c/tree/int8_refactor |
Maybe if we find the proper abstraction on tensors, we might push all the differences there and have a clean |
I did some tests with llama2-7b-chat model. Quantization, v2 export and inference with runq all worked. On my laptop I'm getting about ~0.22 tokens/second. For comparison, llama.cpp q8_0 does ~0.25 tokens/sec. This int8 implementation vectorizes well. I tried some AVX2 intrinsics for matmul and basically getting the same result/speed as plain C when testing with 7b-chat model. |
@atamurad sorry I'm slow btw, I'm traveling right now for the next ~2 weeks. I will probably have chunks of time available to dig into your PR but I'm not sure when as my schedule is chaotic. I do think that from a quick skim your approach is better than my first draft, and much simpler to lay out the scale factors right next to the quantized values. |
sounds good |
int8 refactor
On the 110M model with |
Specifying GS at compile time ( |
ugh that's annoying. i tried it but i onyl go up to 40 tok/s. This isn't exactly a thing we can merge though |
There is another interesting baseline where we could in principle let all the hyperparameters be # DEFINEs at compile time, basically fixed and specific to a single model. This might make things even faster. |
The idea would be to make ./run binary be very specific to a single model architecture, and if you want to run a different model you'd recompile for that model specifically. It's not totally out of the question 🤔 |
Another reason I like that is that you could erase the need for malloc. Everything would be statically pre-allocated and known at compile time. Kind of appealing! :) |
To push this forward, why don't we create a "run-generator" that takes a model and produces an optimized |
run-generator just needs to generate a header file, say, |
I've just done that but the benefits are minimal (at least on my machine):
All the references to parameter models are now constants in the code. I got some benefits on my machine (AMD Ryzen 5, 16GB RAM) but didn't take any accurate measurements. I tried eliminating the |
I'm not sure what happened but I'm seeing a 3X speedup on Llama 2 7B, which is surprisingly nice, up for 30% before. Maybe it's a model size dependence. I am threatning to merge this PR for reals now. |
Alright let's go. |
int8 quantization attempt karpathy#2
Ok attempt number two
There could be bugs, I hastily ported this from my previous int8 version (attempt 1) into the new code. TODO I think I want to gain more confidence that this is good first before merging. Makefile needs some thought.