Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on AMD graphics card on Windows #202

Open
tempstudio opened this issue Aug 3, 2024 · 28 comments
Open

Crash on AMD graphics card on Windows #202

tempstudio opened this issue Aug 3, 2024 · 28 comments
Labels
bug Something isn't working llama.cpp

Comments

@tempstudio
Copy link

Describe the bug

Crash with abort when trying to use AMD graphics card in editor
Model is mistral-7b-instruct-v0.2.Q4_K_M.gguf

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.30 MiB
d3d12: upload buffer was full! Waited for COPY queue for 1.118 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.902 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.897 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.896 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.901 ms.
[Licensing::Client] Successfully resolved entitlement details
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4095.05 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..............................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.24 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
[1722650470] warming up the model with an empty run
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml-cuda.cu:13061
err
Asset Pipeline Refresh (id=5fe1348313ec9e4439edb8aa2e9d608c): Total: 0.010 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
Asset Pipeline Refresh (id=a398558039bd1ba4a8f2fc04f6154810): Total: 0.007 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)

Steps to reproduce

No response

LLMUnity version

2.0.3

Operating System

Windows

@tempstudio tempstudio added the bug Something isn't working label Aug 3, 2024
@amakropoulos
Copy link
Collaborator

It seems to be an open llama.cpp issue (issue 1, issue 2)

@amakropoulos
Copy link
Collaborator

@tempstudio could you check if the issue remains with the latest release (v2.2.0)?

@tempstudio
Copy link
Author

tempstudio commented Sep 5, 2024

I see the same issue with 2.2.1

`
(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137)

INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1"
INFO [ init] system info | tid="27560" timestamp=1725497899 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from E:/.../Assets/StreamingAssets/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
Loaded scene 'Temp/__Backupscenes/0.backup'
Deserialize: 5.726 ms
Integration: 341.064 ms
Integration of assets: 0.002 ms
Thread Wait Time: 0.004 ms
Total Operation Time: 346.796 ms
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 3
INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1"
INFO [ init] system info | tid="27560" timestamp=1725497899 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
UnityEngine.StackTraceUtility:ExtractStackTrace ()
UnityEngine.DebugLogHandler:LogFormat (UnityEngine.LogType,UnityEngine.Object,string,object[])
UnityEngine.Logger:Log (UnityEngine.LogType,object)
UnityEngine.Debug:LogWarning (object)
LLMUnity.LLMUnitySetup:LogWarning (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs:143)
LLMUnity.StreamWrapper:Update () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMLib.cs:66)
LLMUnity.LLM:Update () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:483)

(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 143)

llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.27 MiB
d3d12: upload buffer was full! Waited for COPY queue for 1.133 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.901 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.895 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.905 ms.
d3d12: upload buffer was full! Waited for COPY queue for 0.899 ms.
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4095.05 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
...........[Licensing::Client] Successfully resolved entitlement details
...................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.24 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369
err
D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error
Asset Pipeline Refresh (id=2eaefcb7421ebc541b64109c390c5c15): Total: 0.008 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
`

@amakropoulos
Copy link
Collaborator

Thank you for testing!
I can't implement support for this card because it is down to llama.cpp.
I'll see if I can wrap around the error however so that Unity doesn't crash and you can use the GPU with Vulkan.
I'll send you later a build to try 🙏

@amakropoulos
Copy link
Collaborator

Could you try the new build by changing the LlamaLib version here from v1.1.10 to v1.1.10-dev?
You will also need to delete the undreamai-v1.1.10-llamacpp folder from Assets/StreamingAssets.

With this build it should skip the HIP build and use the Vulkan instead 🤞

@tempstudio
Copy link
Author

tempstudio commented Sep 6, 2024

Apologies:
I was using the wrong binaries yesterday, so even though the C# code was 2.2.1 the native code in StreamingAssets were probably still the old version. I deleted the "StreamingAssets" directory and tried it again.

It didn't crash this time, after I deleted things from StreamingAssets and reinstalled the package. but I'm pretty sure it's using the CPU, with very slow speed, high CPU usage.

Server command: -m "C:/Users/.../AppData/Roaming/LLMUnity/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" -c 4096 -b 512 --log-disable -np 1 -ngl -1
UnityEngine.StackTraceUtility:ExtractStackTrace ()
UnityEngine.DebugLogHandler:LogFormat (UnityEngine.LogType,UnityEngine.Object,string,object[])
UnityEngine.Logger:Log (UnityEngine.LogType,object)
UnityEngine.Debug:Log (object)
LLMUnity.LLMUnitySetup:Log (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs:137)
LLMUnity.LLM:StartLLMServer (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:373)
LLMUnity.LLM/<>c__DisplayClass45_0:b__0 () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:119)
System.Threading.Tasks.Task:InnerInvoke ()
System.Threading.Tasks.Task:Execute ()
System.Threading.Tasks.Task:ExecutionContextCallback (object)
System.Threading.ExecutionContext:RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool)
System.Threading.ExecutionContext:Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool)
System.Threading.Tasks.Task:ExecuteWithThreadLocal (System.Threading.Tasks.Task&)
System.Threading.Tasks.Task:ExecuteEntry (bool)
System.Threading.Tasks.Task:System.Threading.IThreadPoolWorkItem.ExecuteWorkItem ()
System.Threading.ThreadPoolWorkQueue:Dispatch ()
System.Threading._ThreadPoolWaitCallback:PerformWaitCallback ()

(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137)

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support

...

llm_load_tensors: CPU buffer size = 4685.30 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama_new_context_with_model: CPU compute buffer size = 296.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1

Giving the 1.1.0-dev a try now

@tempstudio
Copy link
Author

The behavior is the same with 1.1.0-dev.

@amakropoulos
Copy link
Collaborator

You are using num GPU layers -1 which will not use the GPU. Could you try e.g. with 10?
There should be debug messages that start with "Tried architecture", can you post those as well?

@tempstudio
Copy link
Author

I thought -1 would mean all / max?
With 9999 GPU Layers it crashed with the same error even on 1.1.10-dev :/
I think it's been the same issue.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4403.50 MiB
llm_load_tensors: CPU buffer size = 281.81 MiB
........................................Asset Pipeline Refresh (id=2020b226d14d319468ddb810101aa4ca): Total: 0.008 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
...............................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 258.50 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 903
llama_new_context_with_model: graph splits = 2
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369
err
D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error
Asset Pipeline Refresh (id=7f5d46cd6ec704f4ba373546e19f8732): Total: 0.006 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)

@tempstudio
Copy link
Author

Tried it with flash attention OFF and it's the same:
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4403.50 MiB
llm_load_tensors: CPU buffer size = 281.81 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369
err
D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error
Unable to find style 'TemplatesPromo' in skin 'DarkSkin' Layout

@amakropoulos
Copy link
Collaborator

Thanks a lot!
Could you do one more test with v1.1.10-dev2?

@tempstudio
Copy link
Author

Couple problems I encounted with 1.1.0-dev2:
First, the install didn't work, it just installed an empty folder. I manually downloaded the entire zip and unzipped into the streaming assets folder.
After that, the same error happens.
Third, I deleted the two "windows-cuda" folders from the directory. It crashed again.
Finally, I deleted the "windows-hip" folder from the directory, it doesn't crash anymore, but it doesn't use the GPU. It seems it's not even going to try Vulkan.

@amakropoulos
Copy link
Collaborator

Thanks a lot.
I have fixed the issue with the empty folder in v2.2.2.
It seems I can't do much at the moment for the specific GPU unfortunately.
I'll keep an eye on the llama.cpp updates and let you know once I find a solution.

@amakropoulos
Copy link
Collaborator

I'm going through some issues and I have an idea.
I may have to specify your GPU architecture in the HIP build.

@amakropoulos
Copy link
Collaborator

Could you try the v1.1.11 build?
I have specifically set AMD architectures included the one of your GPU (gfx1030).

@tempstudio
Copy link
Author

The good news is that it doesn't crash anymore.
The bad news is that the performance is much worse than CPU only. running the chat pegs GPU usage to 100% and it stutters. It also took extremely long to generate anything.
I recall having with llamafile and it was running at least 20x faster than this (this is with only 1 layer on the GPU; using all layers makes the OS unresponsive):

INFO [ print_timings] prompt eval time = 192189.92 ms / 399 tokens ( 481.68 ms per token, 2.08 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 n_prompt_tokens_processed=399 t_token=481.67899749373436 n_tokens_second=2.0760714193543555
INFO [ print_timings] generation eval time = 24258.31 ms / 41 runs ( 591.67 ms per token, 1.69 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_token_generation=24258.305 n_decoded=41 t_token=591.6659756097561 n_tokens_second=1.6901428191293661
INFO [ print_timings] total time = 216448.23 ms | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 t_token_generation=24258.305 t_total=216448.225
INFO [ update_slots] slot released | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 n_ctx=2048 n_past=439 n_system_tokens=0 n_cache_tokens=439 truncated=false
INFO [ update_slots] all slots are idle | tid="5292" timestamp=1725926634
INFO [ update_slots] all slots are idle | tid="5292" timestamp=1725926634

I have updated to the latest drivers and also just restarted my system.

@amakropoulos
Copy link
Collaborator

Yes! That works!
What happens if you use more layers but not extreme ones e.g. 10, 25, 50?

@tempstudio
Copy link
Author

Performance is equally bad with 10/30 layers.

10 layers:
prompt processing 2tk/s generation 1tk/s
30 layers:
prompt processing 2tk/s generation 0.5tk/s

@tempstudio
Copy link
Author

Is there any possibility of the performance issue being fixed in llamalib?
If not, is it possible to provide a 2.x build that uses llamafile as a backend?

@amakropoulos
Copy link
Collaborator

amakropoulos commented Sep 13, 2024

I really doubt it is a problem of LlamaLib because I use and extend code directly from llama.cpp and llamafile.

This is an overview of the different libraries:

  • llama.cpp
    it is the main implementation that all libraries use.
    Specifically for GPU it uses CUDA (Nvidia) and CUDA+HIP (AMD).
    This is the fastest but including CUDA in the builds increases the build size to 1GB / build.
    To support most Nvidia GPUs I include both CUDA 11 and 12 builds that would mean 2 GBs.
  • llamafile
    It packages and serves llama.cpp in just a single file for all OSes.
    Specifically for GPUs, it uses CUDA (Nvidia) and CUDA+HIP (AMD) if the system has CUDA already installed (rare, unless you are into AI).
    Otherwise it uses its own tinyBLAS implementation which has speed lower or equal to CUDA (from version 0.7 onwards).
    The benefit is that it needs less than 100MB to include in the build.
  • LlamaLib
    It extends llama.cpp with functionality needed to use as a Unity / C# library and builds binaries for the different architectures.
    I use the llama.cpp implementation but specifically for GPUs I hack it and use tinyBLAS to keep the build size small.

The source of the speed issue is most probably on the tinyBLAS implementation of llamafile.
If you have CUDA installed or use llamafile with a version earlier than 0.7, llamafile will still use CUDA which will give you the speed boost.

@amakropoulos
Copy link
Collaborator

amakropoulos commented Sep 13, 2024

There are reasons why I don't use llamafile anymore, although I love the project:

  • it has antivirus issues (false positives). This is because it builds llama.cpp on the fly directly on the system that uses it.
    I actually whitelisted it myself for McAfee antivirus.
  • it can only be included as a server, not as DLL.
    This can only be used in IL2CPP builds.
    Also someone could create a similar server locally and take over your game.
    I have spent a lot of time for workarounds to try and prevent that.
  • It can't be used for mobile deployment (Android / iOS).

For these reasons I can't bring it back to the project.
I'd prefer to find where the source of the problem is and solve it there.
It is tricky for me to work with AMD because I don't have one and there is none available on the cloud that is supported.

You could try the following to understand more about the issue using the latest llamafile.

Check the timings for both cases:
llamafile without CUDA

  • Uninstall CUDA
  • Delete the .llamafile folder from your system. It will be on your user directory (C:/Users/<USER>)
  • From cmd:
    • cd inside the directory that contains llamafile
    • run llamafile-0.8.13.exe -m <path_to_model> -ngl 10 -p "to be or" --nocompile --tinyblas

llamafile with CUDA

  • Install CUDA
  • Delete the .llamafile folder from your system. It will be on your user directory (C:/Users/<USER>)
  • From cmd:
    • cd inside the directory that contains llamafile
    • run llamafile-0.8.13.exe -m <path_to_model> -ngl 10 -p "to be or"

Then we could find out which implementation is the culprit.

@tempstudio
Copy link
Author

I will give those a try. Can you build llamalib into a command line standalone so that I can test that too, just in case there's something wonky going on with gpu resource sharing between the ai and unity?

@tempstudio
Copy link
Author

Here is the performance with tinyBLAS. I don't believe the CUDA run is needed as I'm using an AMD system and it doesn't support CUDA. I will be very happy if I can get this type of performance inside Unity.

.\llamafile-0.8.13.exe -m .\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 99 -p "to be or" --nocompile --tinyblas -c 2048

llama_print_timings: load time = 2364.86 ms
llama_print_timings: sample time = 55.42 ms / 773 runs ( 0.07 ms per token, 13948.79 tokens per second)
llama_print_timings: prompt eval time = 36.01 ms / 4 tokens ( 9.00 ms per token, 111.08 tokens per second)
llama_print_timings: eval time = 22152.88 ms / 772 runs ( 28.70 ms per token, 34.85 tokens per second)
llama_print_timings: total time = 22420.02 ms / 776 tokens
Log end

More logs that might be helpful:

import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++.exe not found on $PATH
get_rocm_bin_path: note: /D/Drivers/ROCM/5.7//bin/amdclang++.exe does not exist
get_rocm_bin_path: note: clang++.exe not found on $PATH
link_cuda_dso: note: dynamically linking /C/Users/Tony/.llamafile/v/0.8.13/ggml-rocm.dll
ggml_cuda_link: welcome to ROCm SDK with tinyBLAS
link_cuda_dso: GPU support loaded
llm_load_print_meta: model size = 4.58 GiB (4.89 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.32 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 3992.51 MiB
llm_load_tensors: CPU buffer size = 4685.30 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 669.48 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 4

Another piece of info: during the execution I see that the GPU usage is at 1% instead of 99% that I see when using llamalib in task manager. This might be inaccurate.

@tempstudio
Copy link
Author

FYI I got llama.cpp's vulkan build to work (need to set GGML_VK_VISIBLE_DEVICES=0) and timing is like this:

llama_perf_sampler_print: sampling time = 63.40 ms / 780 runs ( 0.08 ms per token, 12303.03 tokens per second)
llama_perf_context_print: load time = 2719.64 ms
llama_perf_context_print: prompt eval time = 184.44 ms / 4 tokens ( 46.11 ms per token, 21.69 tokens per second)
llama_perf_context_print: eval time = 12409.56 ms / 775 runs ( 16.01 ms per token, 62.45 tokens per second)
llama_perf_context_print: total time = 12738.39 ms / 779 tokens

So it's (potentially) faster to run vulkan than HIP w./ tinyBLAS.
Maybe that's an easier thing to get working than HIP?

@amakropoulos
Copy link
Collaborator

Thanks for all the testing!
I have already included Vulkan as a fallback but is called if HIP doesn't work.
You can switch to that if you disable these 2 lines:
https://github.com/undreamai/LLMUnity/blob/main/Runtime/LLMLib.cs#L368
https://github.com/undreamai/LLMUnity/blob/main/Runtime/LLMLib.cs#L374
Could you try if that works better?

@amakropoulos
Copy link
Collaborator

Could you also try the following to see if the build works at the same speed as tinyBLAS?

  • Setup
  • from command line:
    • cd inside the directory, and inside the windows-hip directory
    • run undreamai_server.exe -m <path_to_Llama-3.1> -ngl 99 -c 2048 --port 13333 --template "llama3 chat"
  • You can then use it from Unity as a remote server
    • Open the SimpleInteraction sample
    • Delete the LLM GameObject
    • Enable the LLMCharacter GameObject Remote flag
    • Run the scene and start a chat
  • Check from command line the timings

@amakropoulos
Copy link
Collaborator

Could we maybe have a call to resolve this? It would be really helpful!
You can find me at the Discord server.

@tempstudio
Copy link
Author

tempstudio commented Sep 18, 2024

(1) Vulkan doesn't work because of this problem, it detects the same graphics card twice and then fails to load:
ggerganov/llama.cpp#9516
I tried to use C# API to set ENV variables but that seems to behave very strangely. It seems to take effect after a full restart of the editor. So it doesn't work and will refuse to work despite

UnityEngine.Debug.Log(Environment.GetEnvironmentVariable("GGML_VK_VISIBLE_DEVICES"));

prints 0 - until the editor and the unity hub is restarted.
This isn't going to fly for a production build.

(2) The performance for hip server is as bad as it is in editor:

INFO [           print_timings] prompt eval time     =   90492.37 ms /   195 tokens (  464.06 ms per token,     2.15 tokens per second) | tid="2696" timestamp=1726616971 id_slot=0 id_task=0 t_prompt_processing=90492.37 n_prompt_tokens_processed=195 t_token=464.06343589743585 n_tokens_second=2.154877809035171
INFO [           print_timings] generation eval time =   72538.86 ms /    45 runs   ( 1611.97 ms per token,     0.62 tokens per second) | tid="2696" timestamp=1726616971 id_slot=0 id_task=0 t_token_generation=72538.859 n_decoded=45 t_token=1611.9746444444443 n_tokens_second=0.6203571522954339

(3) The vulkan server works with the right env variable. The performance of vulkan server matches llama.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working llama.cpp
Projects
None yet
Development

No branches or pull requests

2 participants