Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 quantization attempt #2 #364

Merged
merged 7 commits into from
Oct 9, 2023
Merged

int8 quantization attempt #2 #364

merged 7 commits into from
Oct 9, 2023

Conversation

karpathy
Copy link
Owner

Ok attempt number two

  • separate and new runq.c file
  • the Makefile currently compiles both I'm not sure I'm super happy with it
  • only did very quick testing and the results look good - they are similar to fp32 but diverge after a few dozen tokens. not sure that this is salvageable in principle

There could be bugs, I hastily ported this from my previous int8 version (attempt 1) into the new code. TODO I think I want to gain more confidence that this is good first before merging. Makefile needs some thought.

@karpathy karpathy changed the title draft of int8 attempt number two int8 quantization attempt #2 Aug 26, 2023
@atamurad
Copy link
Contributor

I'd suggest to make more use of QuantizedTensor struct/abstraction as it's already there; make it's internals opaque inside forward() and change functions signatures to use it as argument:

matmul(float *xout, QuantizedTensor *x, QuantizedTensor *w, int n, int d)
dequantize(QuantizedTensor *qx, float *x)
quantize(QuantizedTensor *qx, float *x)

This will help with 1) code readability/simplicity - when I'm reading forward() function, I'm already (and only) thinking about vectors, and not how it's quantized, 2) It will make it easier to experiment/add other quantization techniques, as we don't have to touch transformer code at all.

These two lines could be shorter, IMHO:

matmul(s->hb, s->xq.q, s->xq.s, w->w1.q + l*dim*hidden_dim, w->w1.s + l*dim*hidden_dim/GS, dim, hidden_dim);
matmul(s->hb2, s->xq.q, s->xq.s, w->w3.q + l*dim*hidden_dim, w->w3.s + l*dim*hidden_dim/GS, dim, hidden_dim);

I can draft PR if you want to see how it all would look like.

@karpathy
Copy link
Owner Author

Quite agree, good idea.

@karpathy karpathy mentioned this pull request Aug 27, 2023
@atamurad
Copy link
Contributor

atamurad commented Aug 27, 2023

First draft refactor is here:
atamurad@f850a97

  • matmul(), quantize() and dequantize() all take QuantizedTensor
  • Cleaned up memory_map_weights()
  • Code does compile and runs as before refactor

EDIT:
In this abstraction - QuantizedTensor is atomic, we do not index into or slice from it. We can matmul with it or dequantize all elements. This abstraction model breaks token embedding (sliced from it) & shared classifier weights (matmul-ed). So I split token embedding into separate rows and exported wcls as one quantizedtensor as well (as if it is not shared). Need to think about what to do with this.

@karpathy
Copy link
Owner Author

Actually, on a quick skim other than the slight weirdness with wcls and wemb this looks very nice! I'll stare at this more tomorrow.

@atamurad
Copy link
Contributor

Actually, on a quick skim other than the slight weirdness with wcls and wemb this looks very nice! I'll stare at this more tomorrow.

Sounds good! Please review this branch instead: https://github.com/atamurad/llama2.c/tree/int8_refactor
I added another commit to properly handle token embedding / wcls shared weights. Hopefully this one looks less weird.

@rdentato
Copy link
Contributor

Maybe if we find the proper abstraction on tensors, we might push all the differences there and have a clean run.c that operates on tensors (which will make extremely clear how it works) and some more messy implementations of those operations, one for each version (float32, int8, int4-cuda, ...)
Probably something to check after the initial merge? At that point, we already will have two "official" versions (float32 and int8) that should coexist and if that can be made "transparent" (from the inference point of view), it will set up the "standard" for the other versions.

@atamurad
Copy link
Contributor

I did some tests with llama2-7b-chat model. Quantization, v2 export and inference with runq all worked.

On my laptop I'm getting about ~0.22 tokens/second. For comparison, llama.cpp q8_0 does ~0.25 tokens/sec.
(4 threads, AVX2, No AVX512)

This int8 implementation vectorizes well. I tried some AVX2 intrinsics for matmul and basically getting the same result/speed as plain C when testing with 7b-chat model.

@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

@atamurad sorry I'm slow btw, I'm traveling right now for the next ~2 weeks. I will probably have chunks of time available to dig into your PR but I'm not sure when as my schedule is chaotic. I do think that from a quick skim your approach is better than my first draft, and much simpler to lay out the scale factors right next to the quantized values.

@atamurad
Copy link
Contributor

atamurad commented Sep 5, 2023

@karpathy safe travels! It might be early to merge to master yet but I'll submit my refactor as PR to be merged to your feature/int8_try2 branch for now.

I don't have input on Makefile or as @rdentato suggested if two versions can be in single run.c file.

@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

sounds good

@atamurad atamurad mentioned this pull request Sep 5, 2023
@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

On the 110M model with make runfast I'm currently seeing 29 tok/s on legacy and 38 tok/s on int8 quantized.
(i.e. ~30% speedup, less than I'd hope for originally)

@atamurad
Copy link
Contributor

atamurad commented Sep 5, 2023

Specifying GS at compile time (const int GS = 64;) gets me another ~25% speedup on 110M model.

@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

ugh that's annoying. i tried it but i onyl go up to 40 tok/s. This isn't exactly a thing we can merge though

@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

There is another interesting baseline where we could in principle let all the hyperparameters be # DEFINEs at compile time, basically fixed and specific to a single model. This might make things even faster.

@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

The idea would be to make ./run binary be very specific to a single model architecture, and if you want to run a different model you'd recompile for that model specifically. It's not totally out of the question 🤔

@karpathy
Copy link
Owner Author

karpathy commented Sep 5, 2023

Another reason I like that is that you could erase the need for malloc. Everything would be statically pre-allocated and known at compile time. Kind of appealing! :)
Sorry for comment spam

@rdentato
Copy link
Contributor

rdentato commented Sep 7, 2023

To push this forward, why don't we create a "run-generator" that takes a model and produces an optimized run.c specific to that model? One could ask for the type of quantization, whether it is for CPU, Cuda, Metal, WASM, ... and get a run_xxxx.c for that model.
It may seem strange now that we are all playing with LLMs and we are interested in running different models, but let's consider a larger system that has a fine-tuned LLM as a component, I would only be interested in having that model running as fast as possible.

@Majdoddin
Copy link
Contributor

run-generator just needs to generate a header file, say, model.h. And add #include "model.h" to run.c
This header file contains all the #DEFINEs for hyperparameters and model shapes

@rdentato
Copy link
Contributor

rdentato commented Sep 7, 2023

I've just done that but the benefits are minimal (at least on my machine):

// Model: ../models/stories110M.bin
#define CONFIG_dim 768
#define CONFIG_hidden_dim 2048
#define CONFIG_n_layers 12
#define CONFIG_n_heads 12
#define CONFIG_n_kv_heads 12
#define CONFIG_vocab_size 32000
#define CONFIG_seq_len 1024
#define CONFIG_shared_weights 1

All the references to parameter models are now constants in the code.

I got some benefits on my machine (AMD Ryzen 5, 16GB RAM) but didn't take any accurate measurements.

I tried eliminating the calloc for Runstate but performances decrease.

@karpathy karpathy marked this pull request as ready for review October 9, 2023 15:32
@karpathy
Copy link
Owner Author

karpathy commented Oct 9, 2023

I'm not sure what happened but I'm seeing a 3X speedup on Llama 2 7B, which is surprisingly nice, up for 30% before. Maybe it's a model size dependence.

I am threatning to merge this PR for reals now.

@karpathy karpathy merged commit d986206 into master Oct 9, 2023
6 checks passed
@karpathy
Copy link
Owner Author

karpathy commented Oct 9, 2023

Alright let's go.

vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants