Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Lora方式启动llm服务失败 #1110

Closed
jackaihfia2334 opened this issue Aug 15, 2023 · 11 comments
Closed

[BUG] Lora方式启动llm服务失败 #1110

jackaihfia2334 opened this issue Aug 15, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@jackaihfia2334
Copy link

按照readme5.1.3中lora方式加载llm,运行命令为
PEFT_SHARE_BASE_WEIGHTS=true python3 -m fastchat.serve.multi_model_worker
--model-path /data2/peft_chapi
--model-names peft-dummy-1
--num-gpus 1

报错如下
————————————————————————————————————————————————————
root@docker-desktop:/data1/llm/code/Langchain-Chatchat# PEFT_SHARE_BASE_WEIGHTS=true python3 -m fastchat.serve.multi_model_worker
--model-path /data2/peft_chapi
--model-names peft-dummy-1
--num-gpus 1
2023-08-15 08:46:30 | INFO | model_worker | args: Namespace(host='localhost', port=21002, worker_address='http://localhost:21002', controller_address='http://localhost:21001', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, model_path=['/data2/peft_chapi'], model_names=[['peft-dummy-1']], limit_worker_concurrency=5, stream_interval=2, no_register=False)
2023-08-15 08:46:30 | INFO | model_worker | Loading the model ['peft-dummy-1'] on worker b2222c56 ...
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 14%|██████████████████████▎ | 1/7 [00:17<01:47, 17.87s/it]
Loading checkpoint shards: 29%|████████████████████████████████████████████▌ | 2/7 [00:37<01:34, 18.91s/it]
Loading checkpoint shards: 43%|██████████████████████████████████████████████████████████████████▊ | 3/7 [00:56<01:15, 18.98s/it]
Loading checkpoint shards: 57%|█████████████████████████████████████████████████████████████████████████████████████████▏ | 4/7 [01:14<00:55, 18.50s/it]
Loading checkpoint shards: 71%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 5/7 [01:34<00:37, 18.95s/it]
Loading checkpoint shards: 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 6/7 [01:53<00:19, 19.07s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [02:04<00:00, 16.34s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [02:04<00:00, 17.73s/it]
2023-08-15 08:48:34 | ERROR | stderr |
2023-08-15 08:48:42 | INFO | model_worker | Register to controller
2023-08-15 08:48:42 | ERROR | stderr | Traceback (most recent call last):
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 174, in _new_conn
2023-08-15 08:48:42 | ERROR | stderr | conn = connection.create_connection(
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 95, in create_connection
2023-08-15 08:48:42 | ERROR | stderr | raise err
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection
2023-08-15 08:48:42 | ERROR | stderr | sock.connect(sa)
2023-08-15 08:48:42 | ERROR | stderr | ConnectionRefusedError: [Errno 111] Connection refused
2023-08-15 08:48:42 | ERROR | stderr |
2023-08-15 08:48:42 | ERROR | stderr | During handling of the above exception, another exception occurred:
2023-08-15 08:48:42 | ERROR | stderr |
2023-08-15 08:48:42 | ERROR | stderr | Traceback (most recent call last):
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
2023-08-15 08:48:42 | ERROR | stderr | httplib_response = self._make_request(
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 415, in _make_request
2023-08-15 08:48:42 | ERROR | stderr | conn.request(method, url, **httplib_request_kw)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 244, in request
2023-08-15 08:48:42 | ERROR | stderr | super(HTTPConnection, self).request(method, url, body=body, headers=headers)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/http/client.py", line 1282, in request
2023-08-15 08:48:42 | ERROR | stderr | self._send_request(method, url, body, headers, encode_chunked)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
2023-08-15 08:48:42 | ERROR | stderr | self.endheaders(body, encode_chunked=encode_chunked)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
2023-08-15 08:48:42 | ERROR | stderr | self._send_output(message_body, encode_chunked=encode_chunked)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
2023-08-15 08:48:42 | ERROR | stderr | self.send(msg)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/http/client.py", line 975, in send
2023-08-15 08:48:42 | ERROR | stderr | self.connect()
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 205, in connect
2023-08-15 08:48:42 | ERROR | stderr | conn = self._new_conn()
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 186, in _new_conn
2023-08-15 08:48:42 | ERROR | stderr | raise NewConnectionError(
2023-08-15 08:48:42 | ERROR | stderr | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb3a0304b20>: Failed to establish a new connection: [Errno 111] Connection refused
2023-08-15 08:48:42 | ERROR | stderr |
2023-08-15 08:48:42 | ERROR | stderr | During handling of the above exception, another exception occurred:
2023-08-15 08:48:42 | ERROR | stderr |
2023-08-15 08:48:42 | ERROR | stderr | Traceback (most recent call last):
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 486, in send
2023-08-15 08:48:42 | ERROR | stderr | resp = conn.urlopen(
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 798, in urlopen
2023-08-15 08:48:42 | ERROR | stderr | retries = retries.increment(
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 592, in increment
2023-08-15 08:48:42 | ERROR | stderr | raise MaxRetryError(_pool, url, error or ResponseError(cause))
2023-08-15 08:48:42 | ERROR | stderr | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=21001): Max retries exceeded with url: /register_worker (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb3a0304b20>: Failed to establish a new connection: [Errno 111] Connection refused'))
2023-08-15 08:48:42 | ERROR | stderr |
2023-08-15 08:48:42 | ERROR | stderr | During handling of the above exception, another exception occurred:
2023-08-15 08:48:42 | ERROR | stderr |
2023-08-15 08:48:42 | ERROR | stderr | Traceback (most recent call last):
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2023-08-15 08:48:42 | ERROR | stderr | return _run_code(code, main_globals, None,
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2023-08-15 08:48:42 | ERROR | stderr | exec(code, run_globals)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/multi_model_worker.py", line 207, in
2023-08-15 08:48:42 | ERROR | stderr | w = ModelWorker(
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/model_worker.py", line 225, in init
2023-08-15 08:48:42 | ERROR | stderr | self.init_heart_beat()
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/model_worker.py", line 93, in init_heart_beat
2023-08-15 08:48:42 | ERROR | stderr | self.register_to_controller()
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/model_worker.py", line 108, in register_to_controller
2023-08-15 08:48:42 | ERROR | stderr | r = requests.post(url, json=data)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 115, in post
2023-08-15 08:48:42 | ERROR | stderr | return request("post", url, data=data, json=json, **kwargs)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request
2023-08-15 08:48:42 | ERROR | stderr | return session.request(method=method, url=url, **kwargs)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
2023-08-15 08:48:42 | ERROR | stderr | resp = self.send(prep, **send_kwargs)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
2023-08-15 08:48:42 | ERROR | stderr | r = adapter.send(request, **kwargs)
2023-08-15 08:48:42 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 519, in send
2023-08-15 08:48:42 | ERROR | stderr | raise ConnectionError(e, request=request)
2023-08-15 08:48:42 | ERROR | stderr | requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=21001): Max retries exceeded with url: /register_worker (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb3a0304b20>: Failed to establish a new connection: [Errno 111] Connection refused'))

@jackaihfia2334 jackaihfia2334 added the bug Something isn't working label Aug 15, 2023
@xiongot
Copy link

xiongot commented Aug 15, 2023

ea2d906eff6f11091986033cf864e9d
有没有可能是模型路径的问题,我也还没搞懂要怎么命名

@jackaihfia2334
Copy link
Author

jackaihfia2334 commented Aug 15, 2023 via email

@xiongot
Copy link

xiongot commented Aug 15, 2023 via email

@xiongot
Copy link

xiongot commented Aug 15, 2023 via email

@jackaihfia2334
Copy link
Author

jackaihfia2334 commented Aug 15, 2023 via email

@jackaihfia2334
Copy link
Author

我的是p-tuning的,你是哪种peft呢,是路径里没带peft也成功了吗 Sent from my iPhone

------------------ 原始邮件 ------------------ From: Mao Yufan @.> 发送时间: 08/15/2023, 18:10 To: chatchat-space/Langchain-Chatchat @.> 抄送: Yixuan Xie @.>, Comment @.> Subject: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 我模型是加载成功的。它的意思我猜应该是存储adapter的文件命名需要带peft
---原始邮件--- 发件人: "Yixuan @.&gt; 发送时间: 2023年8月15日(周二) 晚上6:02 收件人: @.&gt;; 抄送: "Mao @.@.&gt;; 主题: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 有没有可能是模型路径的问题,我也还没搞懂要怎么命名 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.&gt; — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

参考fastchat文档,发现需要先启动controller服务(python3 -m fastchat.serve.controller )。然后python3 -m fastchat.serve.model_worker --model-path /data1/llm/peft_test --model-names peft_test --num-gpus 1
这样模型就能加载成功。但后续api调用似乎仍然存在接口不一致等问题,还在摸索

@Maxliag
Copy link

Maxliag commented Aug 16, 2023

我的是p-tuning的,你是哪种peft呢,是路径里没带peft也成功了吗 Sent from my iPhone

------------------ 原始邮件 ------------------ From: Mao Yufan @.> 发送时间: 08/15/2023, 18:10 To: chatchat-space/Langchain-Chatchat @.> 抄送: Yixuan Xie @.>, Comment @.> Subject: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 我模型是加载成功的。它的意思我猜应该是存储adapter的文件命名需要带peft
---原始邮件--- 发件人: "Yixuan @.&gt; 发送时间: 2023年8月15日(周二) 晚上6:02 收件人: @.&gt;; 抄送: "Mao @.@.&gt;; 主题: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 有没有可能是模型路径的问题,我也还没搞懂要怎么命名 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.&gt; — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

我的也是p-tuning微调,你的成功运行了吗

@xiongot
Copy link

xiongot commented Aug 16, 2023

我的是p-tuning的,你是哪种peft呢,是路径里没带peft也成功了吗 Sent from my iPhone

------------------ 原始邮件 ------------------ From: Mao Yufan @.> 发送时间: 08/15/2023, 18:10 To: chatchat-space/Langchain-Chatchat _@**._> 抄送: Yixuan Xie _@.>, Comment @._> Subject: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 我模型是加载成功的。它的意思我猜应该是存储adapter的文件命名需要带peft
---原始邮件--- 发件人: "Yixuan _
@.> 发送时间: 2023年8月15日(周二) 晚上6:02 收件人: @._>; 抄送: "Mao _@.@._>; 主题: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 有没有可能是模型路径的问题,我也还没搞懂要怎么命名 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@.> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.**_>

我的也是p-tuning微调,你的成功运行了吗

没成功,一直显示wait controller running,而且我也不知道peft训练后的.bin文件和config文件要怎么放,是跟base model放一起还是咋的,之前的版本是专门放到一个p-tuning文件夹读取的

@xiongot
Copy link

xiongot commented Aug 16, 2023

我的是p-tuning的,你是哪种peft呢,是路径里没带peft也成功了吗 Sent from my iPhone

------------------ 原始邮件 ------------------ From: Mao Yufan @.> 发送时间: 08/15/2023, 18:10 To: chatchat-space/Langchain-Chatchat _@**._> 抄送: Yixuan Xie _@.>, Comment @._> Subject: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 我模型是加载成功的。它的意思我猜应该是存储adapter的文件命名需要带peft
---原始邮件--- 发件人: "Yixuan _
@.> 发送时间: 2023年8月15日(周二) 晚上6:02 收件人: @._>; 抄送: "Mao _@.@._>; 主题: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 有没有可能是模型路径的问题,我也还没搞懂要怎么命名 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@.> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.**_>

我的也是p-tuning微调,你的成功运行了吗

可以直接把base model的config文件换成peft的,然后把.bin文件放进base model文件夹吗?

@jackaihfia2334
Copy link
Author

我的是p-tuning的,你是哪种peft呢,是路径里没带peft也成功了吗 Sent from my iPhone

------------------ 原始邮件 ------------------ From: Mao Yufan @.> 发送时间: 08/15/2023, 18:10 To: chatchat-space/Langchain-Chatchat _@**._> 抄送: Yixuan Xie _@.>, Comment @._> Subject: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 我模型是加载成功的。它的意思我猜应该是存储adapter的文件命名需要带peft
---原始邮件--- 发件人: "Yixuan _
@.> 发送时间: 2023年8月15日(周二) 晚上6:02 收件人: @._>; 抄送: "Mao _@.@._>; 主题: Re: [chatchat-space/Langchain-Chatchat] [BUG] Lora方式启动llm服务失败 (Issue #1110) 有没有可能是模型路径的问题,我也还没搞懂要怎么命名 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@.> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.**_>

我的也是p-tuning微调,你的成功运行了吗

没成功,一直显示wait controller running,而且我也不知道peft训练后的.bin文件和config文件要怎么放,是跟base model放一起还是咋的,之前的版本是专门放到一个p-tuning文件夹读取的

需要先启动controller服务(python3 -m fastchat.serve.controller )

@chenkaiC4
Copy link

我也是 P-Tuning v2 微调,具体问题如下,应该是一个很常见的问题:

#1130 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants