Help me... #745

AndreyRGW · 2023-04-03T17:05:10Z

Starting the web UI...
Warning: --gptq_bits is deprecated and will be removed. Use --wbits instead.
Warning: --gptq_pre_layer is deprecated and will be removed. Use --prelayer instead.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll...
C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
The following models are available:

1. alpaca-13b
2. chatgpt4all
3. codegen-6B-multi
4. llama-13b-hf-int4
5. llama-7b-hf
6. llama-7b-hf-int4
7. rugpt3large_based_on_gpt2

Which one do you want to load? 1-7

6

Loading llama-7b-hf-int4...
CUDA extension not installed.
Loading model ...
Traceback (most recent call last):
  File "F:\WBC\text-generation-webui\text-generation-webui\server.py", line 276, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\GPTQ_loader.py", line 114, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\GPTQ_loader.py", line 45, in _load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 809, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 1172, in _load
    result = unpickler.load()
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 1142, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 1116, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 217, in default_restore_location
    result = fn(storage, location)
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 182, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "C:\Users\RGWyo\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serialization.py", line 166, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I just installed webui in clear folder.

Win11

The text was updated successfully, but these errors were encountered:

AndreyRGW · 2023-04-03T17:17:41Z

Why does webui use my python and not the environment in micromamba?

AndreyRGW · 2023-04-03T19:02:40Z

Tried making my environment in anaconda3, the error is exactly the same, except now the folder with anaconda3 is used, not my python.

AndreyRGW · 2023-04-03T19:33:58Z

Loading llama-7b-hf-int4...
Traceback (most recent call last):
  File "F:\WBC\text-generation-webui\text-generation-webui\server.py", line 276, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\GPTQ_loader.py", line 114, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\GPTQ_loader.py", line 36, in _load_quant
    make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)
TypeError: make_quant() got an unexpected keyword argument 'faster'

Got a new error

AndreyRGW · 2023-04-03T19:42:37Z

Now I have the error with LLaMa:

RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros", "model.layers.0.mlp.down_proj.qzeros", "model.layers.0.mlp.gate_proj.qzeros", "model.layers.0.mlp.up_proj.qzeros", "model.layers.1.self_attn.k_proj.qzeros", "model.layers.1.self_attn.o_proj.qzeros", "model.layers.1.self_attn.q_proj.qzeros", "model.layers.1.self_attn.v_proj.qzeros", "model.layers.1.mlp.down_proj.qzeros", "model.layers.1.mlp.gate_proj.qzeros", "model.layers.1.mlp.up_proj.qzeros", "model.layers.2.self_attn.k_proj.qzeros", "model.layers.2.self_attn.o_proj.qzeros", "model.layers.2.self_attn.q_proj.qzeros", "model.layers.2.self_attn.v_proj.qzeros", "model.layers.2.mlp.down_proj.qzeros", "model.layers.2.mlp.gate_proj.qzeros", "model.layers.2.mlp.up_proj.qzeros", "model.layers.3.self_attn.k_proj.qzeros", "model.layers.3.self_attn.o_proj.qzeros", "model.layers.3.self_attn.q_proj.qzeros", "model.layers.3.self_attn.v_proj.qzeros", "model.layers.3.mlp.down_proj.qzeros", "model.layers.3.mlp.gate_proj.qzeros", "model.layers.3.mlp.up_proj.qzeros", "model.layers.4.self_attn.k_proj.qzeros", "model.layers.4.self_attn.o_proj.qzeros", "model.layers.4.self_attn.q_proj.qzeros", "model.layers.4.self_attn.v_proj.qzeros", "model.layers.4.mlp.down_proj.qzeros", "model.layers.4.mlp.gate_proj.qzeros", "model.layers.4.mlp.up_proj.qzeros", "model.layers.5.self_attn.k_proj.qzeros", "model.layers.5.self_attn.o_proj.qzeros", "model.layers.5.self_attn.q_proj.qzeros", "model.layers.5.self_attn.v_proj.qzeros", "model.layers.5.mlp.down_proj.qzeros", "model.layers.5.mlp.gate_proj.qzeros", "model.layers.5.mlp.up_proj.qzeros", "model.layers.6.self_attn.k_proj.qzeros", "model.layers.6.self_attn.o_proj.qzeros", "model.layers.6.self_attn.q_proj.qzeros", "model.layers.6.self_attn.v_proj.qzeros", "model.layers.6.mlp.down_proj.qzeros", "model.layers.6.mlp.gate_proj.qzeros", "model.layers.6.mlp.up_proj.qzeros", "model.layers.7.self_attn.k_proj.qzeros", "model.layers.7.self_attn.o_proj.qzeros", "model.layers.7.self_attn.q_proj.qzeros", "model.layers.7.self_attn.v_proj.qzeros", "model.layers.7.mlp.down_proj.qzeros", etc...

AndreyRGW · 2023-04-03T19:59:27Z

Downloaded LLaMa from here.

No errors so far

AndreyRGW · 2023-04-03T20:07:22Z

Loading llama-7b-4bit...
Loading model ...
Done.
Loaded the model in 13.98 seconds.
Adding the LoRA chatgpt4all to the model...
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\gradio\routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\gradio\blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\gradio\blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "F:\WBC\text-generation-webui\text-generation-webui\server.py", line 76, in load_lora_wrapper
    add_lora_to_model(selected_lora)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\LoRA.py", line 34, in add_lora_to_model
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_name}"), **params)
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\peft\peft_model.py", line 143, in from_pretrained
    model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config)
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\peft\peft_model.py", line 514, in __init__
    super().__init__(model, peft_config)
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\peft\peft_model.py", line 79, in __init__
    self.base_model = LoraModel(peft_config, model)
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\peft\tuners\lora.py", line 118, in __init__
    self._find_and_replace()
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\peft\tuners\lora.py", line 179, in _find_and_replace
    self._replace_module(parent, target_name, new_module, target)
UnboundLocalError: local variable 'new_module' referenced before assignment

Got errors with lora

AndreyRGW · 2023-04-03T20:19:39Z

Again, errors with Alpaca-13B-int4:

Big error

Loading alpaca-13b...
Loading model ...
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\gradio\routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\gradio\blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\gradio\blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "F:\WBC\text-generation-webui\text-generation-webui\server.py", line 71, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\GPTQ_loader.py", line 114, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "F:\WBC\text-generation-webui\text-generation-webui\modules\GPTQ_loader.py", line 45, in _load_quant
    model.load_state_dict(torch.load(checkpoint))
  File "C:\ProgramData\Anaconda3\envs\textgen2\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros", "model.layers.0.mlp.down_proj.qzeros", "model.layers.0.mlp.gate_proj.qzeros", "model.layers.0.mlp.up_proj.qzeros", "model.layers.1.self_attn.k_proj.qzeros", "model.layers.1.self_attn.o_proj.qzeros", "model.layers.1.self_attn.q_proj.qzeros", "model.layers.1.self_attn.v_proj.qzeros", "model.layers.1.mlp.down_proj.qzeros", "model.layers.1.mlp.gate_proj.qzeros", "model.layers.1.mlp.up_proj.qzeros", "model.layers.2.self_attn.k_proj.qzeros", "model.layers.2.self_attn.o_proj.qzeros", "model.layers.2.self_attn.q_proj.qzeros", "model.layers.2.self_attn.v_proj.qzeros", "model.layers.2.mlp.down_proj.qzeros", "model.layers.2.mlp.gate_proj.qzeros", "model.layers.2.mlp.up_proj.qzeros", "model.layers.3.self_attn.k_proj.qzeros", "model.layers.3.self_attn.o_proj.qzeros", "model.layers.3.self_attn.q_proj.qzeros", "model.layers.3.self_attn.v_proj.qzeros", "model.layers.3.mlp.down_proj.qzeros", "model.layers.3.mlp.gate_proj.qzeros", "model.layers.3.mlp.up_proj.qzeros", "model.layers.4.self_attn.k_proj.qzeros", "model.layers.4.self_attn.o_proj.qzeros", "model.layers.4.self_attn.q_proj.qzeros", "model.layers.4.self_attn.v_proj.qzeros", "model.layers.4.mlp.down_proj.qzeros", "model.layers.4.mlp.gate_proj.qzeros", "model.layers.4.mlp.up_proj.qzeros", "model.layers.5.self_attn.k_proj.qzeros", "model.layers.5.self_attn.o_proj.qzeros", "model.layers.5.self_attn.q_proj.qzeros", "model.layers.5.self_attn.v_proj.qzeros", "model.layers.5.mlp.down_proj.qzeros", "model.layers.5.mlp.gate_proj.qzeros", "model.layers.5.mlp.up_proj.qzeros", "model.layers.6.self_attn.k_proj.qzeros", "model.layers.6.self_attn.o_proj.qzeros", "model.layers.6.self_attn.q_proj.qzeros", "model.layers.6.self_attn.v_proj.qzeros", "model.layers.6.mlp.down_proj.qzeros", "model.layers.6.mlp.gate_proj.qzeros", "model.layers.6.mlp.up_proj.qzeros", "model.layers.7.self_attn.k_proj.qzeros", "model.layers.7.self_attn.o_proj.qzeros", "model.layers.7.self_attn.q_proj.qzeros", "model.layers.7.self_attn.v_proj.qzeros", "model.layers.7.mlp.down_proj.qzeros", "model.layers.7.mlp.gate_proj.qzeros", "model.layers.7.mlp.up_proj.qzeros", "model.layers.8.self_attn.k_proj.qzeros", "model.layers.8.self_attn.o_proj.qzeros", "model.layers.8.self_attn.q_proj.qzeros", "model.layers.8.self_attn.v_proj.qzeros", "model.layers.8.mlp.down_proj.qzeros", "model.layers.8.mlp.gate_proj.qzeros", "model.layers.8.mlp.up_proj.qzeros", "model.layers.9.self_attn.k_proj.qzeros", "model.layers.9.self_attn.o_proj.qzeros", "model.layers.9.self_attn.q_proj.qzeros", "model.layers.9.self_attn.v_proj.qzeros", "model.layers.9.mlp.down_proj.qzeros", "model.layers.9.mlp.gate_proj.qzeros", "model.layers.9.mlp.up_proj.qzeros", "model.layers.10.self_attn.k_proj.qzeros", "model.layers.10.self_attn.o_proj.qzeros", "model.layers.10.self_attn.q_proj.qzeros", "model.layers.10.self_attn.v_proj.qzeros", "model.layers.10.mlp.down_proj.qzeros", "model.layers.10.mlp.gate_proj.qzeros", "model.layers.10.mlp.up_proj.qzeros", "model.layers.11.self_attn.k_proj.qzeros", "model.layers.11.self_attn.o_proj.qzeros", "model.layers.11.self_attn.q_proj.qzeros", "model.layers.11.self_attn.v_proj.qzeros", "model.layers.11.mlp.down_proj.qzeros", "model.layers.11.mlp.gate_proj.qzeros", "model.layers.11.mlp.up_proj.qzeros", "model.layers.12.self_attn.k_proj.qzeros", "model.layers.12.self_attn.o_proj.qzeros", "model.layers.12.self_attn.q_proj.qzeros", "model.layers.12.self_attn.v_proj.qzeros", "model.layers.12.mlp.down_proj.qzeros", "model.layers.12.mlp.gate_proj.qzeros", "model.layers.12.mlp.up_proj.qzeros", "model.layers.13.self_attn.k_proj.qzeros", "model.layers.13.self_attn.o_proj.qzeros", "model.layers.13.self_attn.q_proj.qzeros", "model.layers.13.self_attn.v_proj.qzeros", "model.layers.13.mlp.down_proj.qzeros", "model.layers.13.mlp.gate_proj.qzeros", "model.layers.13.mlp.up_proj.qzeros", "model.layers.14.self_attn.k_proj.qzeros", "model.layers.14.self_attn.o_proj.qzeros", "model.layers.14.self_attn.q_proj.qzeros", "model.layers.14.self_attn.v_proj.qzeros", "model.layers.14.mlp.down_proj.qzeros", "model.layers.14.mlp.gate_proj.qzeros", "model.layers.14.mlp.up_proj.qzeros", "model.layers.15.self_attn.k_proj.qzeros", "model.layers.15.self_attn.o_proj.qzeros", "model.layers.15.self_attn.q_proj.qzeros", "model.layers.15.self_attn.v_proj.qzeros", "model.layers.15.mlp.down_proj.qzeros", "model.layers.15.mlp.gate_proj.qzeros", "model.layers.15.mlp.up_proj.qzeros", "model.layers.16.self_attn.k_proj.qzeros", "model.layers.16.self_attn.o_proj.qzeros", "model.layers.16.self_attn.q_proj.qzeros", "model.layers.16.self_attn.v_proj.qzeros", "model.layers.16.mlp.down_proj.qzeros", "model.layers.16.mlp.gate_proj.qzeros", "model.layers.16.mlp.up_proj.qzeros", "model.layers.17.self_attn.k_proj.qzeros", "model.layers.17.self_attn.o_proj.qzeros", "model.layers.17.self_attn.q_proj.qzeros", "model.layers.17.self_attn.v_proj.qzeros", "model.layers.17.mlp.down_proj.qzeros", "model.layers.17.mlp.gate_proj.qzeros", "model.layers.17.mlp.up_proj.qzeros", "model.layers.18.self_attn.k_proj.qzeros", "model.layers.18.self_attn.o_proj.qzeros", "model.layers.18.self_attn.q_proj.qzeros", "model.layers.18.self_attn.v_proj.qzeros", "model.layers.18.mlp.down_proj.qzeros", "model.layers.18.mlp.gate_proj.qzeros", "model.layers.18.mlp.up_proj.qzeros", "model.layers.19.self_attn.k_proj.qzeros", "model.layers.19.self_attn.o_proj.qzeros", "model.layers.19.self_attn.q_proj.qzeros", "model.layers.19.self_attn.v_proj.qzeros", "model.layers.19.mlp.down_proj.qzeros", "model.layers.19.mlp.gate_proj.qzeros", "model.layers.19.mlp.up_proj.qzeros", "model.layers.20.self_attn.k_proj.qzeros", "model.layers.20.self_attn.o_proj.qzeros", "model.layers.20.self_attn.q_proj.qzeros", "model.layers.20.self_attn.v_proj.qzeros", "model.layers.20.mlp.down_proj.qzeros", "model.layers.20.mlp.gate_proj.qzeros", "model.layers.20.mlp.up_proj.qzeros", "model.layers.21.self_attn.k_proj.qzeros", "model.layers.21.self_attn.o_proj.qzeros", "model.layers.21.self_attn.q_proj.qzeros", "model.layers.21.self_attn.v_proj.qzeros", "model.layers.21.mlp.down_proj.qzeros", "model.layers.21.mlp.gate_proj.qzeros", "model.layers.21.mlp.up_proj.qzeros", "model.layers.22.self_attn.k_proj.qzeros", "model.layers.22.self_attn.o_proj.qzeros", "model.layers.22.self_attn.q_proj.qzeros", "model.layers.22.self_attn.v_proj.qzeros", "model.layers.22.mlp.down_proj.qzeros", "model.layers.22.mlp.gate_proj.qzeros", "model.layers.22.mlp.up_proj.qzeros", "model.layers.23.self_attn.k_proj.qzeros", "model.layers.23.self_attn.o_proj.qzeros", "model.layers.23.self_attn.q_proj.qzeros", "model.layers.23.self_attn.v_proj.qzeros", "model.layers.23.mlp.down_proj.qzeros", "model.layers.23.mlp.gate_proj.qzeros", "model.layers.23.mlp.up_proj.qzeros", "model.layers.24.self_attn.k_proj.qzeros", "model.layers.24.self_attn.o_proj.qzeros", "model.layers.24.self_attn.q_proj.qzeros", "model.layers.24.self_attn.v_proj.qzeros", "model.layers.24.mlp.down_proj.qzeros", "model.layers.24.mlp.gate_proj.qzeros", "model.layers.24.mlp.up_proj.qzeros", "model.layers.25.self_attn.k_proj.qzeros", "model.layers.25.self_attn.o_proj.qzeros", "model.layers.25.self_attn.q_proj.qzeros", "model.layers.25.self_attn.v_proj.qzeros", "model.layers.25.mlp.down_proj.qzeros", "model.layers.25.mlp.gate_proj.qzeros", "model.layers.25.mlp.up_proj.qzeros", "model.layers.26.self_attn.k_proj.qzeros", "model.layers.26.self_attn.o_proj.qzeros", "model.layers.26.self_attn.q_proj.qzeros", "model.layers.26.self_attn.v_proj.qzeros", "model.layers.26.mlp.down_proj.qzeros", "model.layers.26.mlp.gate_proj.qzeros", "model.layers.26.mlp.up_proj.qzeros", "model.layers.27.self_attn.k_proj.qzeros", "model.layers.27.self_attn.o_proj.qzeros", "model.layers.27.self_attn.q_proj.qzeros", "model.layers.27.self_attn.v_proj.qzeros", "model.layers.27.mlp.down_proj.qzeros", "model.layers.27.mlp.gate_proj.qzeros", "model.layers.27.mlp.up_proj.qzeros", "model.layers.28.self_attn.k_proj.qzeros", "model.layers.28.self_attn.o_proj.qzeros", "model.layers.28.self_attn.q_proj.qzeros", "model.layers.28.self_attn.v_proj.qzeros", "model.layers.28.mlp.down_proj.qzeros", "model.layers.28.mlp.gate_proj.qzeros", "model.layers.28.mlp.up_proj.qzeros", "model.layers.29.self_attn.k_proj.qzeros", "model.layers.29.self_attn.o_proj.qzeros", "model.layers.29.self_attn.q_proj.qzeros", "model.layers.29.self_attn.v_proj.qzeros", "model.layers.29.mlp.down_proj.qzeros", "model.layers.29.mlp.gate_proj.qzeros", "model.layers.29.mlp.up_proj.qzeros", "model.layers.30.self_attn.k_proj.qzeros", "model.layers.30.self_attn.o_proj.qzeros", "model.layers.30.self_attn.q_proj.qzeros", "model.layers.30.self_attn.v_proj.qzeros", "model.layers.30.mlp.down_proj.qzeros", "model.layers.30.mlp.gate_proj.qzeros", "model.layers.30.mlp.up_proj.qzeros", "model.layers.31.self_attn.k_proj.qzeros", "model.layers.31.self_attn.o_proj.qzeros", "model.layers.31.self_attn.q_proj.qzeros", "model.layers.31.self_attn.v_proj.qzeros", "model.layers.31.mlp.down_proj.qzeros", "model.layers.31.mlp.gate_proj.qzeros", "model.layers.31.mlp.up_proj.qzeros".
        Unexpected key(s) in state_dict: "model.layers.32.self_attn.q_proj.zeros", "model.layers.32.self_attn.q_proj.scales", "model.layers.32.self_attn.q_proj.bias", "model.layers.32.self_attn.q_proj.qweight", "model.layers.32.self_attn.k_proj.zeros", "model.layers.32.self_attn.k_proj.scales", "model.layers.32.self_attn.k_proj.bias", "model.layers.32.self_attn.k_proj.qweight", "model.layers.32.self_attn.v_proj.zeros", "model.layers.32.self_attn.v_proj.scales", "model.layers.32.self_attn.v_proj.bias", "model.layers.32.self_attn.v_proj.qweight", "model.layers.32.self_attn.o_proj.zeros", "model.layers.32.self_attn.o_proj.scales", "model.layers.32.self_attn.o_proj.bias", "model.layers.32.self_attn.o_proj.qweight", "model.layers.32.self_attn.rotary_emb.inv_freq", "model.layers.32.mlp.gate_proj.zeros", "model.layers.32.mlp.gate_proj.scales", "model.layers.32.mlp.gate_proj.bias", "model.layers.32.mlp.gate_proj.qweight", "model.layers.32.mlp.down_proj.zeros", "model.layers.32.mlp.down_proj.scales", "model.layers.32.mlp.down_proj.bias", "model.layers.32.mlp.down_proj.qweight", "model.layers.32.mlp.up_proj.zeros", "model.layers.32.mlp.up_proj.scales", "model.layers.32.mlp.up_proj.bias", "model.layers.32.mlp.up_proj.qweight", "model.layers.32.input_layernorm.weight", "model.layers.32.post_attention_layernorm.weight", "model.layers.33.self_attn.q_proj.zeros", "model.layers.33.self_attn.q_proj.scales", "model.layers.33.self_attn.q_proj.bias", "model.layers.33.self_attn.q_proj.qweight", "model.layers.33.self_attn.k_proj.zeros", "model.layers.33.self_attn.k_proj.scales", "model.layers.33.self_attn.k_proj.bias", "model.layers.33.self_attn.k_proj.qweight", "model.layers.33.self_attn.v_proj.zeros", "model.layers.33.self_attn.v_proj.scales", "model.layers.33.self_attn.v_proj.bias", "model.layers.33.self_attn.v_proj.qweight", "model.layers.33.self_attn.o_proj.zeros", "model.layers.33.self_attn.o_proj.scales", "model.layers.33.self_attn.o_proj.bias", "model.layers.33.self_attn.o_proj.qweight", "model.layers.33.self_attn.rotary_emb.inv_freq", "model.layers.33.mlp.gate_proj.zeros", "model.layers.33.mlp.gate_proj.scales", "model.layers.33.mlp.gate_proj.bias", "model.layers.33.mlp.gate_proj.qweight", "model.layers.33.mlp.down_proj.zeros", "model.layers.33.mlp.down_proj.scales", "model.layers.33.mlp.down_proj.bias", "model.layers.33.mlp.down_proj.qweight", "model.layers.33.mlp.up_proj.zeros", "model.layers.33.mlp.up_proj.scales", "model.layers.33.mlp.up_proj.bias", "model.layers.33.mlp.up_proj.qweight", "model.layers.33.input_layernorm.weight", "model.layers.33.post_attention_layernorm.weight", "model.layers.34.self_attn.q_proj.zeros", "model.layers.34.self_attn.q_proj.scales", "model.layers.34.self_attn.q_proj.bias", "model.layers.34.self_attn.q_proj.qweight", "model.layers.34.self_attn.k_proj.zeros", "model.layers.34.self_attn.k_proj.scales", "model.layers.34.self_attn.k_proj.bias", "model.layers.34.self_attn.k_proj.qweight", "model.layers.34.self_attn.v_proj.zeros", "model.layers.34.self_attn.v_proj.scales", "model.layers.34.self_attn.v_proj.bias", "model.layers.34.self_attn.v_proj.qweight", "model.layers.34.self_attn.o_proj.zeros", "model.layers.34.self_attn.o_proj.scales", "model.layers.34.self_attn.o_proj.bias", "model.layers.34.self_attn.o_proj.qweight", "model.layers.34.self_attn.rotary_emb.inv_freq", "model.layers.34.mlp.gate_proj.zeros", "model.layers.34.mlp.gate_proj.scales", "model.layers.34.mlp.gate_proj.bias", "model.layers.34.mlp.gate_proj.qweight", "model.layers.34.mlp.down_proj.zeros", "model.layers.34.mlp.down_proj.scales", "model.layers.34.mlp.down_proj.bias", "model.layers.34.mlp.down_proj.qweight", "model.layers.34.mlp.up_proj.zeros", "model.layers.34.mlp.up_proj.scales", "model.layers.34.mlp.up_proj.bias", "model.layers.34.mlp.up_proj.qweight", "model.layers.34.input_layernorm.weight", "model.layers.34.post_attention_layernorm.weight", "model.layers.35.self_attn.q_proj.zeros", "model.layers.35.self_attn.q_proj.scales", "model.layers.35.self_attn.q_proj.bias", "model.layers.35.self_attn.q_proj.qweight", "model.layers.35.self_attn.k_proj.zeros", "model.layers.35.self_attn.k_proj.scales", "model.layers.35.self_attn.k_proj.bias", "model.layers.35.self_attn.k_proj.qweight", "model.layers.35.self_attn.v_proj.zeros", "model.layers.35.self_attn.v_proj.scales", "model.layers.35.self_attn.v_proj.bias", "model.layers.35.self_attn.v_proj.qweight", "model.layers.35.self_attn.o_proj.zeros", "model.layers.35.self_attn.o_proj.scales", "model.layers.35.self_attn.o_proj.bias", "model.layers.35.self_attn.o_proj.qweight", "model.layers.35.self_attn.rotary_emb.inv_freq", "model.layers.35.mlp.gate_proj.zeros", "model.layers.35.mlp.gate_proj.scales", "model.layers.35.mlp.gate_proj.bias", "model.layers.35.mlp.gate_proj.qweight", "model.layers.35.mlp.down_proj.zeros", "model.layers.35.mlp.down_proj.scales", "model.layers.35.mlp.down_proj.bias", "model.layers.35.mlp.down_proj.qweight", "model.layers.35.mlp.up_proj.zeros", "model.layers.35.mlp.up_proj.scales", "model.layers.35.mlp.up_proj.bias", "model.layers.35.mlp.up_proj.qweight", "model.layers.35.input_layernorm.weight", "model.layers.35.post_attention_layernorm.weight", "model.layers.36.self_attn.q_proj.zeros", "model.layers.36.self_attn.q_proj.scales", "model.layers.36.self_attn.q_proj.bias", "model.layers.36.self_attn.q_proj.qweight", "model.layers.36.self_attn.k_proj.zeros", "model.layers.36.self_attn.k_proj.scales", "model.layers.36.self_attn.k_proj.bias", "model.layers.36.self_attn.k_proj.qweight", "model.layers.36.self_attn.v_proj.zeros", "model.layers.36.self_attn.v_proj.scales", "model.layers.36.self_attn.v_proj.bias", "model.layers.36.self_attn.v_proj.qweight", "model.layers.36.self_attn.o_proj.zeros", "model.layers.36.self_attn.o_proj.scales", "model.layers.36.self_attn.o_proj.bias", "model.layers.36.self_attn.o_proj.qweight", "model.layers.36.self_attn.rotary_emb.inv_freq", "model.layers.36.mlp.gate_proj.zeros", "model.layers.36.mlp.gate_proj.scales", "model.layers.36.mlp.gate_proj.bias", "model.layers.36.mlp.gate_proj.qweight", "model.layers.36.mlp.down_proj.zeros", "model.layers.36.mlp.down_proj.scales", "model.layers.36.mlp.down_proj.bias", "model.layers.36.mlp.down_proj.qweight", "model.layers.36.mlp.up_proj.zeros", "model.layers.36.mlp.up_proj.scales", "model.layers.36.mlp.up_proj.bias", "model.layers.36.mlp.up_proj.qweight", "model.layers.36.input_layernorm.weight", "model.layers.36.post_attention_layernorm.weight", "model.layers.37.self_attn.q_proj.zeros", "model.layers.37.self_attn.q_proj.scales", "model.layers.37.self_attn.q_proj.bias", "model.layers.37.self_attn.q_proj.qweight", "model.layers.37.self_attn.k_proj.zeros", "model.layers.37.self_attn.k_proj.scales", "model.layers.37.self_attn.k_proj.bias", "model.layers.37.self_attn.k_proj.qweight", "model.layers.37.self_attn.v_proj.zeros", "model.layers.37.self_attn.v_proj.scales", "model.layers.37.self_attn.v_proj.bias", "model.layers.37.self_attn.v_proj.qweight", "model.layers.37.self_attn.o_proj.zeros", "model.layers.37.self_attn.o_proj.scales", "model.layers.37.self_attn.o_proj.bias", "model.layers.37.self_attn.o_proj.qweight", "model.layers.37.self_attn.rotary_emb.inv_freq", "model.layers.37.mlp.gate_proj.zeros", "model.layers.37.mlp.gate_proj.scales", "model.layers.37.mlp.gate_proj.bias", "model.layers.37.mlp.gate_proj.qweight", "model.layers.37.mlp.down_proj.zeros", "model.layers.37.mlp.down_proj.scales", "model.layers.37.mlp.down_proj.bias", "model.layers.37.mlp.down_proj.qweight", "model.layers.37.mlp.up_proj.zeros", "model.layers.37.mlp.up_proj.scales", "model.layers.37.mlp.up_proj.bias", "model.layers.37.mlp.up_proj.qweight", "model.layers.37.input_layernorm.weight", "model.layers.37.post_attention_layernorm.weight", "model.layers.38.self_attn.q_proj.zeros", "model.layers.38.self_attn.q_proj.scales", "model.layers.38.self_attn.q_proj.bias", "model.layers.38.self_attn.q_proj.qweight", "model.layers.38.self_attn.k_proj.zeros", "model.layers.38.self_attn.k_proj.scales", "model.layers.38.self_attn.k_proj.bias", "model.layers.38.self_attn.k_proj.qweight", "model.layers.38.self_attn.v_proj.zeros", "model.layers.38.self_attn.v_proj.scales", "model.layers.38.self_attn.v_proj.bias", "model.layers.38.self_attn.v_proj.qweight", "model.layers.38.self_attn.o_proj.zeros", "model.layers.38.self_attn.o_proj.scales", "model.layers.38.self_attn.o_proj.bias", "model.layers.38.self_attn.o_proj.qweight", "model.layers.38.self_attn.rotary_emb.inv_freq", "model.layers.38.mlp.gate_proj.zeros", "model.layers.38.mlp.gate_proj.scales", "model.layers.38.mlp.gate_proj.bias", "model.layers.38.mlp.gate_proj.qweight", "model.layers.38.mlp.down_proj.zeros", "model.layers.38.mlp.down_proj.scales", "model.layers.38.mlp.down_proj.bias", "model.layers.38.mlp.down_proj.qweight", "model.layers.38.mlp.up_proj.zeros", "model.layers.38.mlp.up_proj.scales", "model.layers.38.mlp.up_proj.bias", "model.layers.38.mlp.up_proj.qweight", "model.layers.38.input_layernorm.weight", "model.layers.38.post_attention_layernorm.weight", "model.layers.39.self_attn.q_proj.zeros", "model.layers.39.self_attn.q_proj.scales", "model.layers.39.self_attn.q_proj.bias", "model.layers.39.self_attn.q_proj.qweight", "model.layers.39.self_attn.k_proj.zeros", "model.layers.39.self_attn.k_proj.scales", "model.layers.39.self_attn.k_proj.bias", "model.layers.39.self_attn.k_proj.qweight", "model.layers.39.self_attn.v_proj.zeros", "model.layers.39.self_attn.v_proj.scales", "model.layers.39.self_attn.v_proj.bias", "model.layers.39.self_attn.v_proj.qweight", "model.layers.39.self_attn.o_proj.zeros", "model.layers.39.self_attn.o_proj.scales", "model.layers.39.self_attn.o_proj.bias", "model.layers.39.self_attn.o_proj.qweight", "model.layers.39.self_attn.rotary_emb.inv_freq", "model.layers.39.mlp.gate_proj.zeros", "model.layers.39.mlp.gate_proj.scales", "model.layers.39.mlp.gate_proj.bias", "model.layers.39.mlp.gate_proj.qweight", "model.layers.39.mlp.down_proj.zeros", "model.layers.39.mlp.down_proj.scales", "model.layers.39.mlp.down_proj.bias", "model.layers.39.mlp.down_proj.qweight", "model.layers.39.mlp.up_proj.zeros", "model.layers.39.mlp.up_proj.scales", "model.layers.39.mlp.up_proj.bias", "model.layers.39.mlp.up_proj.qweight", "model.layers.39.input_layernorm.weight", "model.layers.39.post_attention_layernorm.weight", "model.layers.0.self_attn.k_proj.zeros", "model.layers.0.self_attn.o_proj.zeros", "model.layers.0.self_attn.q_proj.zeros", "model.layers.0.self_attn.v_proj.zeros", "model.layers.0.mlp.down_proj.zeros", "model.layers.0.mlp.gate_proj.zeros", "model.layers.0.mlp.up_proj.zeros", "model.layers.1.self_attn.k_proj.zeros", "model.layers.1.self_attn.o_proj.zeros", "model.layers.1.self_attn.q_proj.zeros", "model.layers.1.self_attn.v_proj.zeros", "model.layers.1.mlp.down_proj.zeros", "model.layers.1.mlp.gate_proj.zeros", "model.layers.1.mlp.up_proj.zeros", "model.layers.2.self_attn.k_proj.zeros", "model.layers.2.self_attn.o_proj.zeros", "model.layers.2.self_attn.q_proj.zeros", "model.layers.2.self_attn.v_proj.zeros", "model.layers.2.mlp.down_proj.zeros", "model.layers.2.mlp.gate_proj.zeros", "model.layers.2.mlp.up_proj.zeros", "model.layers.3.self_attn.k_proj.zeros", "model.layers.3.self_attn.o_proj.zeros", "model.layers.3.self_attn.q_proj.zeros", "model.layers.3.self_attn.v_proj.zeros", "model.layers.3.mlp.down_proj.zeros", "model.layers.3.mlp.gate_proj.zeros", "model.layers.3.mlp.up_proj.zeros", "model.layers.4.self_attn.k_proj.zeros", "model.layers.4.self_attn.o_proj.zeros", "model.layers.4.self_attn.q_proj.zeros", "model.layers.4.self_attn.v_proj.zeros", "model.layers.4.mlp.down_proj.zeros", "model.layers.4.mlp.gate_proj.zeros", "model.layers.4.mlp.up_proj.zeros", "model.layers.5.self_attn.k_proj.zeros", "model.layers.5.self_attn.o_proj.zeros", "model.layers.5.self_attn.q_proj.zeros", "model.layers.5.self_attn.v_proj.zeros", "model.layers.5.mlp.down_proj.zeros", "model.layers.5.mlp.gate_proj.zeros", "model.layers.5.mlp.up_proj.zeros", "model.layers.6.self_attn.k_proj.zeros", "model.layers.6.self_attn.o_proj.zeros", "model.layers.6.self_attn.q_proj.zeros", "model.layers.6.self_attn.v_proj.zeros", "model.layers.6.mlp.down_proj.zeros", "model.layers.6.mlp.gate_proj.zeros", "model.layers.6.mlp.up_proj.zeros", "model.layers.7.self_attn.k_proj.zeros", "model.layers.7.self_attn.o_proj.zeros", "model.layers.7.self_attn.q_proj.zeros", "model.layers.7.self_attn.v_proj.zeros", "model.layers.7.mlp.down_proj.zeros", "model.layers.7.mlp.gate_proj.zeros", "model.layers.7.mlp.up_proj.zeros", "model.layers.8.self_attn.k_proj.zeros", "model.layers.8.self_attn.o_proj.zeros", "model.layers.8.self_attn.q_proj.zeros", "model.layers.8.self_attn.v_proj.zeros", "model.layers.8.mlp.down_proj.zeros", "model.layers.8.mlp.gate_proj.zeros", "model.layers.8.mlp.up_proj.zeros", "model.layers.9.self_attn.k_proj.zeros", "model.layers.9.self_attn.o_proj.zeros", "model.layers.9.self_attn.q_proj.zeros", "model.layers.9.self_attn.v_proj.zeros", "model.layers.9.mlp.down_proj.zeros", "model.layers.9.mlp.gate_proj.zeros", "model.layers.9.mlp.up_proj.zeros", "model.layers.10.self_attn.k_proj.zeros", "model.layers.10.self_attn.o_proj.zeros", "model.layers.10.self_attn.q_proj.zeros", "model.layers.10.self_attn.v_proj.zeros", "model.layers.10.mlp.down_proj.zeros", "model.layers.10.mlp.gate_proj.zeros", "model.layers.10.mlp.up_proj.zeros", "model.layers.11.self_attn.k_proj.zeros", "model.layers.11.self_attn.o_proj.zeros", "model.layers.11.self_attn.q_proj.zeros", "model.layers.11.self_attn.v_proj.zeros", "model.layers.11.mlp.down_proj.zeros", "model.layers.11.mlp.gate_proj.zeros", "model.layers.11.mlp.up_proj.zeros", "model.layers.12.self_attn.k_proj.zeros", "model.layers.12.self_attn.o_proj.zeros", "model.layers.12.self_attn.q_proj.zeros", "model.layers.12.self_attn.v_proj.zeros", "model.layers.12.mlp.down_proj.zeros", "model.layers.12.mlp.gate_proj.zeros", "model.layers.12.mlp.up_proj.zeros", "model.layers.13.self_attn.k_proj.zeros", "model.layers.13.self_attn.o_proj.zeros", "model.layers.13.self_attn.q_proj.zeros", "model.layers.13.self_attn.v_proj.zeros", "model.layers.13.mlp.down_proj.zeros", "model.layers.13.mlp.gate_proj.zeros", "model.layers.13.mlp.up_proj.zeros", "model.layers.14.self_attn.k_proj.zeros", "model.layers.14.self_attn.o_proj.zeros", "model.layers.14.self_attn.q_proj.zeros", "model.layers.14.self_attn.v_proj.zeros", "model.layers.14.mlp.down_proj.zeros", "model.layers.14.mlp.gate_proj.zeros", "model.layers.14.mlp.up_proj.zeros", "model.layers.15.self_attn.k_proj.zeros", "model.layers.15.self_attn.o_proj.zeros", "model.layers.15.self_attn.q_proj.zeros", "model.layers.15.self_attn.v_proj.zeros", "model.layers.15.mlp.down_proj.zeros", "model.layers.15.mlp.gate_proj.zeros", "model.layers.15.mlp.up_proj.zeros", "model.layers.16.self_attn.k_proj.zeros", "model.layers.16.self_attn.o_proj.zeros", "model.layers.16.self_attn.q_proj.zeros", "model.layers.16.self_attn.v_proj.zeros", "model.layers.16.mlp.down_proj.zeros", "model.layers.16.mlp.gate_proj.zeros", "model.layers.16.mlp.up_proj.zeros", "model.layers.17.self_attn.k_proj.zeros", "model.layers.17.self_attn.o_proj.zeros", "model.layers.17.self_attn.q_proj.zeros", "model.layers.17.self_attn.v_proj.zeros", "model.layers.17.mlp.down_proj.zeros", "model.layers.17.mlp.gate_proj.zeros", "model.layers.17.mlp.up_proj.zeros", "model.layers.18.self_attn.k_proj.zeros", "model.layers.18.self_attn.o_proj.zeros", "model.layers.18.self_attn.q_proj.zeros", "model.layers.18.self_attn.v_proj.zeros", "model.layers.18.mlp.down_proj.zeros", "model.layers.18.mlp.gate_proj.zeros", "model.layers.18.mlp.up_proj.zeros", "model.layers.19.self_attn.k_proj.zeros", "model.layers.19.self_attn.o_proj.zeros", "model.layers.19.self_attn.q_proj.zeros", "model.layers.19.self_attn.v_proj.zeros", "model.layers.19.mlp.down_proj.zeros", "model.layers.19.mlp.gate_proj.zeros", "model.layers.19.mlp.up_proj.zeros", "model.layers.20.self_attn.k_proj.zeros", "model.layers.20.self_attn.o_proj.zeros", "model.layers.20.self_attn.q_proj.zeros", "model.layers.20.self_attn.v_proj.zeros", "model.layers.20.mlp.down_proj.zeros", "model.layers.20.mlp.gate_proj.zeros", "model.layers.20.mlp.up_proj.zeros", "model.layers.21.self_attn.k_proj.zeros", "model.layers.21.self_attn.o_proj.zeros", "model.layers.21.self_attn.q_proj.zeros", "model.layers.21.self_attn.v_proj.zeros", "model.layers.21.mlp.down_proj.zeros", "model.layers.21.mlp.gate_proj.zeros", "model.layers.21.mlp.up_proj.zeros", "model.layers.22.self_attn.k_proj.zeros", "model.layers.22.self_attn.o_proj.zeros", "model.layers.22.self_attn.q_proj.zeros", "model.layers.22.self_attn.v_proj.zeros", "model.layers.22.mlp.down_proj.zeros", "model.layers.22.mlp.gate_proj.zeros", "model.layers.22.mlp.up_proj.zeros", "model.layers.23.self_attn.k_proj.zeros", "model.layers.23.self_attn.o_proj.zeros", "model.layers.23.self_attn.q_proj.zeros", "model.layers.23.self_attn.v_proj.zeros", "model.layers.23.mlp.down_proj.zeros", "model.layers.23.mlp.gate_proj.zeros", "model.layers.23.mlp.up_proj.zeros", "model.layers.24.self_attn.k_proj.zeros", "model.layers.24.self_attn.o_proj.zeros", "model.layers.24.self_attn.q_proj.zeros", "model.layers.24.self_attn.v_proj.zeros", "model.layers.24.mlp.down_proj.zeros", "model.layers.24.mlp.gate_proj.zeros", "model.layers.24.mlp.up_proj.zeros", "model.layers.25.self_attn.k_proj.zeros", "model.layers.25.self_attn.o_proj.zeros", "model.layers.25.self_attn.q_proj.zeros", "model.layers.25.self_attn.v_proj.zeros", "model.layers.25.mlp.down_proj.zeros", "model.layers.25.mlp.gate_proj.zeros", "model.layers.25.mlp.up_proj.zeros", "model.layers.26.self_attn.k_proj.zeros", "model.layers.26.self_attn.o_proj.zeros", "model.layers.26.self_attn.q_proj.zeros", "model.layers.26.self_attn.v_proj.zeros", "model.layers.26.mlp.down_proj.zeros", "model.layers.26.mlp.gate_proj.zeros", "model.layers.26.mlp.up_proj.zeros", "model.layers.27.self_attn.k_proj.zeros", "model.layers.27.self_attn.o_proj.zeros", "model.layers.27.self_attn.q_proj.zeros", "model.layers.27.self_attn.v_proj.zeros", "model.layers.27.mlp.down_proj.zeros", "model.layers.27.mlp.gate_proj.zeros", "model.layers.27.mlp.up_proj.zeros", "model.layers.28.self_attn.k_proj.zeros", "model.layers.28.self_attn.o_proj.zeros", "model.layers.28.self_attn.q_proj.zeros", "model.layers.28.self_attn.v_proj.zeros", "model.layers.28.mlp.down_proj.zeros", "model.layers.28.mlp.gate_proj.zeros", "model.layers.28.mlp.up_proj.zeros", "model.layers.29.self_attn.k_proj.zeros", "model.layers.29.self_attn.o_proj.zeros", "model.layers.29.self_attn.q_proj.zeros", "model.layers.29.self_attn.v_proj.zeros", "model.layers.29.mlp.down_proj.zeros", "model.layers.29.mlp.gate_proj.zeros", "model.layers.29.mlp.up_proj.zeros", "model.layers.30.self_attn.k_proj.zeros", "model.layers.30.self_attn.o_proj.zeros", "model.layers.30.self_attn.q_proj.zeros", "model.layers.30.self_attn.v_proj.zeros", "model.layers.30.mlp.down_proj.zeros", "model.layers.30.mlp.gate_proj.zeros", "model.layers.30.mlp.up_proj.zeros", "model.layers.31.self_attn.k_proj.zeros", "model.layers.31.self_attn.o_proj.zeros", "model.layers.31.self_attn.q_proj.zeros", "model.layers.31.self_attn.v_proj.zeros", "model.layers.31.mlp.down_proj.zeros", "model.layers.31.mlp.gate_proj.zeros", "model.layers.31.mlp.up_proj.zeros".
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
        size mismatch for model.layers.0.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.0.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.0.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.0.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.0.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.0.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.0.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.0.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.0.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.0.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.0.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.0.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.0.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.0.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.0.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.0.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.0.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.0.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.0.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.1.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.1.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.1.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.1.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.1.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.1.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.1.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.1.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.1.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.1.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.1.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.1.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.1.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.2.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.2.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.2.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.2.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.2.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.2.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.2.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.2.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.2.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.2.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.2.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.2.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.2.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.3.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.3.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.3.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.3.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.3.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.3.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.3.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.3.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.3.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.3.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.3.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.3.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.3.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.4.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.4.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.4.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.4.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.4.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.4.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.4.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.4.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.4.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.4.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.4.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.4.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.4.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.5.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.5.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.5.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.5.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.5.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.5.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.5.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.5.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.5.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.5.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.5.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.5.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.5.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.6.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.6.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.6.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.6.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.6.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.6.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.6.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.6.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.6.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.6.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.6.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.6.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.6.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.6.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.6.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.6.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.6.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.7.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.7.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.7.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.7.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.7.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.7.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.7.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.7.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.7.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.7.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.7.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.7.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.7.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.7.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.7.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.7.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.7.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.8.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.8.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.8.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.8.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.8.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.8.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.8.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.8.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.8.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.8.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.8.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.8.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.8.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.8.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.8.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.9.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.9.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.9.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.9.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.9.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.9.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.9.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.9.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.9.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.9.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.9.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.9.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.9.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.9.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.9.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.9.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.9.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.10.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.10.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.10.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.10.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.10.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.10.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.10.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.10.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.10.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.10.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.10.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.10.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.10.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.10.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.10.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.10.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.10.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.11.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.11.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.11.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.11.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.11.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.11.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.11.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.11.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.11.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.11.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.11.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.11.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.11.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.11.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.11.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.11.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.11.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.12.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.12.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.12.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.12.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.12.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.12.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.12.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.12.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.12.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.12.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.12.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.12.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.12.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.12.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.12.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.12.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.12.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.13.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.13.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.13.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.13.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.13.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.13.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.13.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.13.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.13.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.13.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.13.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.13.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.13.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.13.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.13.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.13.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.13.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.14.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.14.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.14.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.14.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.14.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.14.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.14.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.14.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.14.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.14.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.14.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.14.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.14.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.14.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.14.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.14.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.14.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.15.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.15.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.15.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.15.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.15.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.15.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.15.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.15.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.15.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.15.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.15.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.15.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.15.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.15.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.15.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.15.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.15.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.16.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.16.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.16.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.16.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.16.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.16.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.16.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.16.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.16.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.16.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.16.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.16.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.16.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.16.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.16.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.16.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.16.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.17.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.17.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.17.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.17.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.17.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.17.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.17.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.17.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.17.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.17.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.17.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.17.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.17.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.17.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.17.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.17.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.17.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.18.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.18.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.18.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.18.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.18.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.18.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.18.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.18.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.18.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.18.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.18.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.18.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.18.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.18.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.18.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.18.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.18.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.19.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.19.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.19.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.19.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.19.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.19.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.19.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.19.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.19.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.19.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.19.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.19.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.19.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.19.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.19.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.19.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.19.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.20.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.20.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.20.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.20.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.20.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.20.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.20.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.20.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.20.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.20.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.20.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.20.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.20.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.20.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.20.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.20.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.20.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.21.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.21.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.21.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.21.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.21.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.21.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.21.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.21.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.21.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.21.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.21.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.21.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.21.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.21.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.21.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.21.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.21.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.22.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.22.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.22.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.22.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.22.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.22.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.22.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.22.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.22.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.22.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.22.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.22.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.22.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.22.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.22.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.22.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.22.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.23.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.23.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.23.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.23.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.23.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.23.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.23.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.23.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.23.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.23.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.23.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.23.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.23.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.23.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.23.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.23.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.23.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.24.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.24.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.24.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.24.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.24.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.24.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.24.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.24.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.24.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.24.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.24.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.24.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.24.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.24.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.24.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.24.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.24.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.25.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.25.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.25.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.25.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.25.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.25.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.25.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.25.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.25.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.25.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.25.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.25.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.25.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.25.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.25.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.25.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.25.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.26.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.26.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.26.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.26.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.26.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.26.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.26.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.26.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.26.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.26.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.26.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.26.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.26.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.26.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.26.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.26.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.26.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.27.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.27.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.27.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.27.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.27.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.27.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.27.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.27.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.27.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.27.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.27.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.27.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.27.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.27.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.27.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.27.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.27.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.28.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.28.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.28.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.28.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.28.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.28.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.28.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.28.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.28.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.28.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.28.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.28.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.28.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.28.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.28.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.28.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.28.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.29.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.29.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.29.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.29.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.29.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.29.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.29.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.29.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.29.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.29.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.29.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.29.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.29.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.29.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.29.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.29.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.29.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.30.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.30.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.30.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.30.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.30.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.30.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.30.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.30.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.30.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.30.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.30.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.30.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.30.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.30.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.30.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.30.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.30.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.self_attn.k_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.31.self_attn.k_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.self_attn.k_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.31.self_attn.o_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.31.self_attn.o_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.self_attn.o_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.31.self_attn.q_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.31.self_attn.q_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.self_attn.q_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.31.self_attn.v_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.31.self_attn.v_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.self_attn.v_proj.qweight: copying a param with shape torch.Size([640, 5120]) from checkpoint, the shape in current model is torch.Size([512, 4096]).
        size mismatch for model.layers.31.mlp.down_proj.scales: copying a param with shape torch.Size([5120, 1]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for model.layers.31.mlp.down_proj.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.mlp.down_proj.qweight: copying a param with shape torch.Size([1728, 5120]) from checkpoint, the shape in current model is torch.Size([1376, 4096]).
        size mismatch for model.layers.31.mlp.gate_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.31.mlp.gate_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.31.mlp.gate_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.31.mlp.up_proj.scales: copying a param with shape torch.Size([13824, 1]) from checkpoint, the shape in current model is torch.Size([1, 11008]).
        size mismatch for model.layers.31.mlp.up_proj.bias: copying a param with shape torch.Size([13824]) from checkpoint, the shape in current model is torch.Size([11008]).
        size mismatch for model.layers.31.mlp.up_proj.qweight: copying a param with shape torch.Size([640, 13824]) from checkpoint, the shape in current model is torch.Size([512, 11008]).
        size mismatch for model.layers.31.input_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.layers.31.post_attention_layernorm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for model.norm.weight: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([4096]).
        size mismatch for lm_head.weight: copying a param with shape torch.Size([32000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

bartman081523 · 2023-04-05T14:33:06Z

Big error

I have this error too with all Llama-type models and 4-bit mode. I have updated today and reinstalled GPTQ from @oobabooga and transformers from huggingface github.

RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros",
...
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 5120]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

I think here is the info that helped me
#734 (comment)

oobabooga closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help me... #745

Help me... #745

AndreyRGW commented Apr 3, 2023 •

edited

Loading

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023 •

edited

Loading

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023 •

edited

Loading

bartman081523 commented Apr 5, 2023 •

edited

Loading

Help me... #745

Help me... #745

Comments

AndreyRGW commented Apr 3, 2023 • edited Loading

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023 • edited Loading

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023

AndreyRGW commented Apr 3, 2023 • edited Loading

bartman081523 commented Apr 5, 2023 • edited Loading

AndreyRGW commented Apr 3, 2023 •

edited

Loading

AndreyRGW commented Apr 3, 2023 •

edited

Loading

AndreyRGW commented Apr 3, 2023 •

edited

Loading

bartman081523 commented Apr 5, 2023 •

edited

Loading