Qwen-7B-Instruct Model numpy.core._exceptions._ArrayMemoryError: #1542

khoinpd0411 · 2024-06-18T08:59:09Z

I cannot run Qwen2-7B-instruct quantized version locally. System keep notifying about MemoryError which seems quite strange. The same problem does not happen with other models such as Mistral-7B-instruct. I have also tried a lower-bit quantized version but it does not work out. My local configuration is 16 GB CPU RAM.

I have also run to update the latest version for package
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /home/user/llama.cpp/models/Qwen2/qwen2-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = qwen2-7b
llama_model_loader: - kv 2: qwen2.block_count u32 = 28
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 17
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q5_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name = qwen2-7b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_tensors: ggml ctx size = 0.16 MiB
llm_load_tensors: CPU buffer size = 5186.92 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 304.00 MiB
llama_new_context_with_model: graph nodes = 875
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Traceback (most recent call last):

File "/home/user/anaconda3/envs/llm-app/lib/python3.8/site-packages/llama_cpp/llama.py", line 406, in init
self.scores: npt.NDArray[np.single] = np.ndarray(
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 18.6 GiB for an array with shape (32768, 152064) and data type float32

abetlen · 2024-06-20T14:48:52Z

@khoinpd0411 sorry about that, currently the Llama class keeps all past logits in memory which can take up a lot of memory for larger context sizes, I do plan to fix this in the future but it requires a larger change to the Llama class internals. However, for now I would recommend reducing your context size from 32k and that should work.

khoinpd0411 · 2024-06-21T02:03:32Z

Thank you so much for your response! Reducing the context size actually helps to load the model. It is also noticed that that even the default context size for Qwen2-7B and Mistral-7B are both 32k, the Qwen2-7B's vocab size is 4 times larger compared to its encounter, leading to the memory problem which does not occur in Mistral-7B.

abetlen · 2024-06-22T05:37:49Z

@khoinpd0411 the size of that array is also proportional to vocab size, Qwen2 has a vocab of 150k while Mistral only has a 32k vocab.

abetlen added the bug Something isn't working label Jun 20, 2024

abetlen closed this as completed in 29afcfd Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-7B-Instruct Model numpy.core._exceptions._ArrayMemoryError: #1542

Qwen-7B-Instruct Model numpy.core._exceptions._ArrayMemoryError: #1542

khoinpd0411 commented Jun 18, 2024 •

edited

Loading

abetlen commented Jun 20, 2024

khoinpd0411 commented Jun 21, 2024

abetlen commented Jun 22, 2024

Qwen-7B-Instruct Model numpy.core._exceptions._ArrayMemoryError: #1542

Qwen-7B-Instruct Model numpy.core._exceptions._ArrayMemoryError: #1542

Comments

khoinpd0411 commented Jun 18, 2024 • edited Loading

abetlen commented Jun 20, 2024

khoinpd0411 commented Jun 21, 2024

abetlen commented Jun 22, 2024

khoinpd0411 commented Jun 18, 2024 •

edited

Loading