Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml (backend v2) #3912

Merged
merged 19 commits into from
Nov 13, 2023
Merged

sync : ggml (backend v2) #3912

merged 19 commits into from
Nov 13, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 2, 2023

This is a first step towards bringing the new ggml backend interface to llama.cpp. There should be no functional change at this point - merely transitioning to the new API in some places where it is necessary. This PR will likely remain open until we confirm that everything works correctly, so help with testing will be very appreciated.

The main part of the code that we expect issues with are the training examples:

  • finetune
  • train-from-scratch
  • baby-llama

I'll put a notice in the readme to direct people here. In general, if you care about some specific functionality in llama.cpp, please checkout this branch and make sure that it works as expected and post a comment below. This will help to ensure that it will not break when this is merged.

For more detailed information about this change:

ggml.h Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
Comment on lines 1772 to 1776
gf = ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, true);
gf->order = (enum ggml_cgraph_eval_order) order;
gb = ggml_new_graph(ctx_compute);
gb = ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, false);
gb_tmp = params.common.use_checkpointing
? ggml_new_graph(ctx_compute)
? ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, false)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Does this look OK?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, I don't know if gb here needs grads or not.

Copy link
Collaborator

@xaedes xaedes Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gb needs grads, because gb also contains the gf nodes, which have grads.

Changing the bool grads argument from false to true resolves a triggered assert in ggml.c ggml_graph_cpy.

GGML_ASSERT(dst->grads != NULL);

With this change finetune runs, I will report back if the results are good as well.

Copy link
Collaborator

@slaren slaren Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ggml_graph_cpy be changed to allow skipping the grads if the src has them but not the dst?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an additional - unrelated to this PR - regression in finetune and train-text-from-scratch due to new yarn rope implementation.

Changing bool grads argument to true and applying #3974 to fix the backward process of rope, the output of finetune is correct.

Copy link

@CoruNethron CoruNethron Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, lines 1805 and 1807 below needs that change as well, I missed them at first attempt to copy this fix.
Also, mentioned regression and #3974 fix seems to be critical, because otherwise finetune produces LORA's without any progress from one checkpoint to another.

@ggerganov ggerganov added help wanted Extra attention is needed refactoring Refactoring labels Nov 2, 2023
@ggerganov ggerganov marked this pull request as ready for review November 2, 2023 18:41
@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Nov 2, 2023
ggerganov added a commit that referenced this pull request Nov 2, 2023
@ggerganov
Copy link
Owner Author

Pinging @xaedes - in case you get the chance to take a look and see if the training examples work as expected

@ggerganov ggerganov mentioned this pull request Nov 2, 2023
13 tasks
@CoruNethron
Copy link

@ggerganov, is funetune on this branch expected to produce file with same hash, given all the same RNG states at begining compared to master branch? Or testing should be more manual, like quering the resulting LORA's ? I'll run few short tests on 3B model to check today.

@ggerganov
Copy link
Owner Author

@CoruNethron The results should be identical for same RNG state

@KerfuffleV2
Copy link
Collaborator

Hmm, doesn't seem to calculate the correct context size for a 70B model I tried (dolphin-2.1-70b.q4_k_s.gguf):

ggml_new_object: not enough space in the context's memory pool (needed 852464, available 852112)

Other models I tried seemed to work (Mistral, Orca3B, CausalLM 14B). Doesn't seem related to GPU support, I tried compiling for CPU only and it made no difference.

@ggerganov
Copy link
Owner Author

@KerfuffleV2 And this does not fail on master correct?

@KerfuffleV2
Copy link
Collaborator

@ggerganov

And this does not fail on master correct?

Yes, that's correct. I also just tried with a Q2_K 70B and it failed the same - the numbers also didn't change. So quantization, GPU or no GPU doesn't seem to matter.

@xaedes
Copy link
Collaborator

xaedes commented Nov 3, 2023

I will look into the training examples.

@ggerganov
Copy link
Owner Author

@KerfuffleV2 Should be fixed now

@xaedes Thanks

@KerfuffleV2
Copy link
Collaborator

Should be fixed now

Thanks. I can confirm it seems good now.

@CoruNethron
Copy link

CoruNethron commented Nov 7, 2023

Sorry about delay. Trying to compare master (381efbf) vs sync (081a86d), I face assertion within ggml.c:16209: dst->grads != NULL when run finetune on sync branch
Oh, I see @xaedes did resolved this already.

@@ -1769,7 +1769,7 @@ int main(int argc, char ** argv) {
alloc = ggml_allocr_new_measure(tensor_alignment);
gf = ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, true);
gf->order = (enum ggml_cgraph_eval_order) order;
gb = ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, false);
gb = ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, true);
gb_tmp = params.common.use_checkpointing
? ggml_new_graph_custom(ctx_compute, LLAMA_TRAIN_MAX_NODES, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gb_tmp also needs the grads=true argument.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge mostly applies 875fb42

@ggerganov
Copy link
Owner Author

We should probably merge this soon. Anybody found any issues with the latest version of this branch?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 13, 2023

I did some testing with ROCM. Mainly just loading the model and running a few blocks of perplexity. I didn't notice a difference in performance. The models I tested returned identical perplexity results compared to master.

Tested:

  1. Orca 3B
  2. Mistral 7B
  3. CausalLM 14B
  4. Yi 34B
  5. LLaMA2 70B

I couldn't test with Persimmon due to missing CUDA ops (#4041). It's not possible to offload layers or even run perplexity when compiled with CUDA/ROCM. I was able to load the model and do a little TG though. It seemed fine.

Mistral and LLaMA2 70B were the ones that had issues with this pull previously. Currently everything seems to work as well as master though.

(Not that my approval really means anything. Also, I didn't test anything esoteric like training models.)

@ggerganov ggerganov merged commit 4760e7c into master Nov 13, 2023
39 checks passed
@ggerganov ggerganov deleted the sync branch November 13, 2023 12:38
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
* sync : ggml (backend v2) (wip)

* sync : migrate examples and llama.cpp to dynamic graphs (wip)

* sync : update tests + fix max op params to 64

ggml-ci

* sync : ggml-cuda

ggml-ci

* llama : fix save/load state context size

ggml-ci

* sync : try to fix build on tvOS

* sync : pass custom graph sizes in training examples

* sync : update graph copies to new ggml API

* sync : update sync-ggml.sh with new files

* scripts : fix header in sync script

* train : fix context size calculations

* llama : increase inference graph size up to 4096 nodes

* train : allocate grads for backward graphs

* train : allocate grads for gb_tmp
@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 27, 2023

With this change, you no longer get a coredump when you hit a GGML_ASSERT, and you can't even catch the assertion with gdb without e.g. catch syscall exit_group:

GGML_ASSERT: /home/jared/src/forks/llama.cpp/ggml-cuda.cu:5644: false
[Detaching after fork from child process 509162]
stack module disabled
warning: process 509151 is already traced by process 509126
ptrace: Operation not permitted.
No stack.
The program is not being run.
[Thread 0x7fffbcfde000 (LWP 509160) exited]
[Thread 0x7fffc9162000 (LWP 509159) exited]
[Thread 0x7fffcbbff000 (LWP 509155) exited]
[Thread 0x7ffff7f36000 (LWP 509151) exited]
[Thread 0x7fffb4fde000 (LWP 509161) exited]
[New process 509151]
[Inferior 1 (process 509151) exited with code 01]
>>>

Could we please change the exit(1) back to an abort()?

@slaren
Copy link
Collaborator

slaren commented Nov 28, 2023

Yes, I changed it to exit(1) because abort ends without flushing the buffers, and sometimes you don't get all the output after a crash. But the fflush should already address that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed need feedback Testing and feedback with results are needed refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants