-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker thread stuck in die state #1815
Comments
Another example: similar thing, worker If I ssh into the pod and do the curl for the model:
|
With that said,
so basically there's really no way to catch the worker die stuck error? |
Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or timeout? ) and corresponding worker will stuck in a die-ish(zombie) state |
Hi @hgong-snap is there any chance you can share a repro? In particular your |
An update I tried to repro this offline Setup
Get patched clientUse this patched version of Get input dataI picked one of the slowest models I know of from sample_text.txt
Run test
EDIT: I tried just running Control C a few times and if you look at the torchserve logs outputted in the console where you type in
|
Hi @msaroufim here's mar file and config file: mar file:
|
for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request? |
@hgong-snap confirming we can finally repro, will get back to you with a solution soon |
I have the same problem, Any ideas ? |
Same Issue with only 1 worker The model works correctly for a while then the worker stops with the below error, then it just tries to start a new worker and fails again for a while
|
@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data. |
@lxning But the model does work in a normal state, only in stress testing does this happen. |
It works normally on normal condition the error only happens on large number of users, suddenly when the number of users increases it goes into infinite loop of trying to create the worker then failing |
@MHatemAbdelhamid According to your description ("the error only happens on large number of users, suddenly when the number of users increases"). It seems it is a capacity issue. I suggest you find the error log to identify the root cause. I believe the issue in this ticket is different from the issue you faced. |
Hi @lxning @msaroufim , thanks for the quick fix. unfortunately #1854 seems not fully mitigate the issue. I build the image with latest master with
|
I can successfully repro it in my local with following setup setup
import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2
parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)
if __name__ == '__main__':
args, _ = parser.parse_known_args()
port = args.port
num_runs = int(args.n)
with open('image.jpg', 'rb') as f:
data = f.read()
input_data = {'data': data}
request = inference_pb2.PredictionsRequest(model_name="easyocr",
input=input_data)
channel = grpc.insecure_channel(f'localhost:{port}')
stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
for i in range(num_runs):
try:
start = datetime.now()
response = stub.Predictions(request, timeout=float(args.timeout))
print("request time:", datetime.now() - start)
print(response)
except Exception as e:
print(e)
Resultit takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation's output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors. |
@hgong-snap This is precisely the problem I faced. @lxning Any thoughts ? |
@hgong-snap I tried the steps you provided. There is no exception or died workerthread in the ts_log.log.zip. Message "grpc client call already cancelled" was logged when TS was trying to send response but timeout on client side. The reason of the tab1 with timeout=2 slow recovery is that the tab2 with timeout=0.02 has 100x rate than tab1. It means there are 100x requests from tab2 accumulated in TS internal job queue. These jobs are still processed by TS even though client already cancels the connection. This scenario gives you the wrong impression that worker dies and takes long time to recover. Meanwhile, I filed a #1863 to optimize Torchserve performance. |
@lxning I updated my script to accept another parameter import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2
parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True, type=float)
parser.add_argument("--port", required=True, type=int)
parser.add_argument("--sleep_time", default=0, type=float)
parser.add_argument("-n", required=True, type=int)
if __name__ == '__main__':
args, _ = parser.parse_known_args()
port = args.port
num_runs = int(args.n)
with open('image.jpg', 'rb') as f:
data = f.read()
input_data = {'data': data}
request = inference_pb2.PredictionsRequest(model_name="easyocr",
input=input_data)
channel = grpc.insecure_channel(f'localhost:{port}')
stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
for i in range(num_runs):
try:
start = datetime.now()
response = stub.Predictions(request, timeout=args.timeout)
print("request time:", datetime.now() - start)
print(response)
except Exception as e:
print(e)
if args.sleep_time:
time.sleep(args.sleep_time) I record a video to illustrate the problem. some good timestamp to look at:
so that QPS=1 now.
https://drive.google.com/file/d/1zNlFTX6h2AO_DVQHgwZmbHClKy1zcfOI/view?usp=sharing |
@hgong-snap Thank you for recording the workflow, I can see the following exception in the video.
After the PR, the above exception should be gone. Instead, only warning "grpc client call already cancelled" is shown in the log (eg. ts_log.log.zip). Could you please check if you can see such warning in your log? If not, most likely you are still using the old source code. Here is an example of building a docker based on master.
|
@lxning hmm I'm pretty sure I built the image on latest master though... however just in case, would you mind uploading the docker you use so that I can try on my end? (better with GPU support) |
@hgong-snap I verified master branch on both local host and docker nightly build. You can fetch torchserve nightly build at https://hub.docker.com/r/pytorch/torchserve-nightly/tags.
|
@lxning thanks. I tried to build the dev image locally with
after this completed successfully, I run
but has error with:
Any hint why |
@hgong-snap gpu docker image should specify cuda version. For example: Here is the detail information about torchserve docker image build. Could you please directly pull torchserve docker nightly build to quickly verify, which is based on cu102? eg. |
@lxning verified on |
@hgong-snap Great, thank you for the verification. |
quick question: in order to run torchserve with the fix locally (e.g. mac) without docker, will
be sufficient? |
Yep, don't forget to run |
Hi, I am still getting this issue. |
🐛 Describe the bug
We encountered similar problems as #1531 and it happens quite often.
See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it's kinda stuck in an unknown state: it didn't terminate itself, so no workers can be added automatically. But in the meantime, it won't receive incoming traffic. which essentially means we only have one worker(9001) now.
The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can't receive any traffic. It still counted one active worker, thus torchserve won't add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found
Retry worker: 9001 in 1 seconds.
in the log )If I curl the management API, it still shows two works are all healthy.
Error logs
worker-9000 died because exception. It didn't have any log after 2022-08-25 21:21:14.056 PDT, selected logs:
[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]
[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected.
[INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 109
[INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr
Installation instructions
N/A unrelated
Model Packaing
N/A unrelated
config.properties
No response
Versions
used docker image of 0.6.0-gpu
Repro instructions
N/A
Possible Solution
No response
The text was updated successfully, but these errors were encountered: