Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: invalid device function when compiling and running for amd gfx 1032 #4762

Closed
nasawyer7 opened this issue Jan 3, 2024 · 4 comments

Comments

@nasawyer7
Copy link

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
I have a 6700s amd gpu, 8gb vram. I got ooga to work on this computer, but I can't get llama.ccp to work. I compiled with
make clean && make -j16 LLAMA_HIPBLAS=1 AMDGPU_TARGETS=gxf1032
And everything went fine. However, when I try to run, I do export HSA_OVERRIDE_GFX_VERSION=10.3.0
then HIP_VISIBLE_DEVICES=0 ./main -ngl 50 -m /home/lenovoubuntu/Downloads/text-generation-webui-main/models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf -p "Write a function in TypeScript that sums numbers".
(I do HIP devices function since my devices has an igpu as well).

It returns .................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 4232.06 MiB (model: 4095.06 MiB, context: 137.00 MiB)
CUDA error: invalid device function
current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:7971
hipGetLastError()
GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error"
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

So, I ran it as as sudo, as it suggested using this command. sudo LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH HSA_OVERRIDE_GFX_VERSION=10.3.0 HIP_VISIBLE_DEVICES=0 ./main -ngl 50 -m /home/lenovoubuntu/Downloads/text-generation-webui-main/models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf -p "Write a function in TypeScript that sums numbers"
I used all of those environment variables since ooga required them, and I was hoping they would fix things here too.

However, that just returns this after seemingly loading the model.

CUDA error: invalid device function
current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:7971
hipGetLastError()
GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error"
[New LWP 23593]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f34398ea42f in __GI___wait4 (pid=23599, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f34398ea42f in __GI___wait4 (pid=23599, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000055fb56cca7fb in ggml_print_backtrace ()
#2 0x000055fb56d90f95 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#3 0x000055fb56d9da1e in ggml_cuda_op_flatten(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void ()(ggml_tensor const, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, ihipStream_t*)) ()
#4 0x000055fb56d92df3 in ggml_cuda_compute_forward ()
#5 0x000055fb56cf8898 in ggml_graph_compute_thread ()
#6 0x000055fb56cfca98 in ggml_graph_compute ()
#7 0x000055fb56dbc41e in ggml_backend_cpu_graph_compute ()
#8 0x000055fb56dbcf0b in ggml_backend_graph_compute ()
#9 0x000055fb56d2b046 in llama_decode_internal(llama_context&, llama_batch) ()
#10 0x000055fb56d2bb63 in llama_decode ()
#11 0x000055fb56d66316 in llama_init_from_gpt_params(gpt_params&) ()
#12 0x000055fb56cbc31a in main ()
[Inferior 1 (process 23582) detached]
Aborted

@dariox1337
Copy link

dariox1337 commented Jan 4, 2024

I get a similar error CUDA error: invalid device function current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:7971 on amd 780m (igpu) while trying to run any model.
llama.cpp compiled with LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1100 and ran with HSA_OVERRIDE_GFX_VERSION=gfx1100
ROCm version 5.7.1

@TheAceBlock
Copy link

I also had similar error when running on my gfx90c device (which needs to be overridden to gfx900).

What solved the problem for me was also setting the environment variable HSA_OVERRIDE_GFX_VERSION when running make (together with the AMDGPU_TARGETS, although I'm not exactly sure if this value actually changes anything).

So for me, the make command would look like this:

HSA_OVERRIDE_GFX_VERSION=9.0.0 make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx900

I honestly didn't think that this would work at all, but it certainly did! For me, though, since my iGPU lacks INT8 operators, performance was worse than just using CPU, but it did run on the iGPU (checked with nvtop).

Hope that this works for you too!


My guess on why this hasn't been reported much

I would say that quite a few people have already ran export HSA_OVERRIDE_GFX_VERSION=xxx before, which would make this environment variable available to all programs running in the shell, making subsequent explicit declaration unnecessary.

@dariox1337
Copy link

What solved the problem for me was also setting the environment variable HSA_OVERRIDE_GFX_VERSION when running make (together with the AMDGPU_TARGETS, although I'm not exactly sure if this value actually changes anything).

Thank you! This hint finally allowed me to run all 33 layers of Mixtral Q5_K_M on iGPU. Since it's an APU with shared ram, it can't compete with dGPUs, but the speedup is close to 70% nonetheless.

CPU (7840u):

llama_print_timings:        load time =    2052.23 ms
llama_print_timings:      sample time =     111.57 ms /   727 runs   (    0.15 ms per token,  6515.97 tokens per second)
llama_print_timings: prompt eval time =   34619.23 ms /   538 tokens (   64.35 ms per token,    15.54 tokens per second)
llama_print_timings:        eval time =  248061.72 ms /   726 runs   (  341.68 ms per token,     2.93 tokens per second)
llama_print_timings:       total time =  283023.52 ms

GPU (780m):

llama_print_timings:        load time =   39038.83 ms
llama_print_timings:      sample time =     132.02 ms /   867 runs   (    0.15 ms per token,  6567.34 tokens per second)
llama_print_timings: prompt eval time =   44011.30 ms /   538 tokens (   81.81 ms per token,    12.22 tokens per second)
llama_print_timings:        eval time =  181460.51 ms /   866 runs   (  209.54 ms per token,     4.77 tokens per second)
llama_print_timings:       total time =  225876.68 ms

Strangely, prompt processing is slower on GPU.

@github-actions github-actions bot added stale and removed stale labels Mar 22, 2024
@github-actions github-actions bot added the stale label Apr 24, 2024
Copy link
Contributor

github-actions bot commented May 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants