You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue consist that, while using any 4bit model like LLaMa, Alpaca, etc, 2 issues can happen depending of the version of GPTQ that you use while generating a message.
This happens on either newest or "older" (older with group size but not with the latest quant). For the older models, used the models from here #530 (comment)
If using ooba GPTQ the issue is related to "TypeError: vecquant4matmul(): incompatible function arguments." and it generates just 1 token and it stops working. GPTQ build used is: https://github.com/oobabooga/GPTQ-for-LLaMa
If using qwopqwop200 GPTQ the issue is "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!". GPTQ build used is: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda
Is there an existing issue for this?
I have searched the existing issues
Reproduction
Use any 4-bit model on 2 GPUs and the issue should happen on either ooba GTPQ or qwopqwop200 GTPQ.
(For example, python server.py --chat --extensions api --listen --wbits 4 --listen-port 7990 --gpu-memory 10 10 and then choose any 4-bit 30b model on the webui, or gpu mem 5 5 and any 4-bit 13b model)
Then, try to generate any message or impersonate, and the issues should arise.
Hi there, how did you managed to specify the GPU you want to use? I know is something related to "Device: cuda:0" or "Device='cuda:1'", but where do you define that value? or do you pass it as parameter??
Describe the bug
The issue consist that, while using any 4bit model like LLaMa, Alpaca, etc, 2 issues can happen depending of the version of GPTQ that you use while generating a message.
This happens on either newest or "older" (older with group size but not with the latest quant). For the older models, used the models from here #530 (comment)
For the new models, used the models from here: https://huggingface.co/Neko-Institute-of-Science
If using ooba GPTQ the issue is related to "TypeError: vecquant4matmul(): incompatible function arguments." and it generates just 1 token and it stops working. GPTQ build used is: https://github.com/oobabooga/GPTQ-for-LLaMa
If using qwopqwop200 GPTQ the issue is "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!". GPTQ build used is: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda
Is there an existing issue for this?
Reproduction
Use any 4-bit model on 2 GPUs and the issue should happen on either ooba GTPQ or qwopqwop200 GTPQ.
(For example, python server.py --chat --extensions api --listen --wbits 4 --listen-port 7990 --gpu-memory 10 10 and then choose any 4-bit 30b model on the webui, or gpu mem 5 5 and any 4-bit 13b model)
Then, try to generate any message or impersonate, and the issues should arise.
Screenshot
For the ooba GPTQ, this is the issue.
For the qwopqwop200 GPTQ, this is the issue.
Logs
System Info
The text was updated successfully, but these errors were encountered: