Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker thread stuck in die state #1815

Closed
hgong-snap opened this issue Aug 26, 2022 · 30 comments
Closed

Worker thread stuck in die state #1815

hgong-snap opened this issue Aug 26, 2022 · 30 comments
Assignees
Labels
bug Something isn't working p0 high priority

Comments

@hgong-snap
Copy link

hgong-snap commented Aug 26, 2022

🐛 Describe the bug

We encountered similar problems as #1531 and it happens quite often.

See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it's kinda stuck in an unknown state: it didn't terminate itself, so no workers can be added automatically. But in the meantime, it won't receive incoming traffic. which essentially means we only have one worker(9001) now.

The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can't receive any traffic. It still counted one active worker, thus torchserve won't add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds. in the log )

If I curl the management API, it still shows two works are all healthy.

Error logs

worker-9000 died because exception. It didn't have any log after 2022-08-25 21:21:14.056 PDT, selected logs:

[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected.
[INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 109
[INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr

Screen Shot 2022-08-25 at 10 10 38 PM

Screen Shot 2022-08-25 at 10 09 24 PM

Installation instructions

N/A unrelated

Model Packaing

N/A unrelated

config.properties

No response

Versions

used docker image of 0.6.0-gpu

Repro instructions

N/A

Possible Solution

No response

@hgong-snap hgong-snap changed the title Worker thread stuck after got client Cancelled Worker thread die stuck Aug 26, 2022
@hgong-snap
Copy link
Author

hgong-snap commented Aug 26, 2022

Another example: similar thing, worker W-9000-easyocr died, and GPU usage drop to almost half (2 workers, 1 worker died stuck, only 1 work is functioning).

If I ssh into the pod and do the curl for the model:

root@ad-ranker-pytorch-ranker-usc-a-588566f68-dt8lj:/home/model-server# curl localhost:17201/models/easyocr
[
  {
    "modelName": "easyocr",
    "modelVersion": "1.0",
    "modelUrl": "easyocr.mar",
    "runtime": "python",
    "minWorkers": 2,
    "maxWorkers": 2,
    "batchSize": 5,
    "maxBatchDelay": 25,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2022-08-26T06:26:26.166Z",
        "status": "READY",
        "memoryUsage": 2979250176,
        "pid": 1042,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
      },
      {
        "id": "9001",
        "startTime": "2022-08-26T06:26:26.168Z",
        "status": "READY",
        "memoryUsage": 2992365568,
        "pid": 1043,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
      }
    ]
  }
]

Screen Shot 2022-08-26 at 12 04 45 AM

Screen Shot 2022-08-25 at 11 59 16 PM

@hgong-snap
Copy link
Author

hgong-snap commented Aug 26, 2022

With that said,

  1. I can't know if a worker is in die stuck state or not.
  2. Even if I know a worker is died, I can't know which one died, nor kill/restart this specific worker

so basically there's really no way to catch the worker die stuck error?

@msaroufim msaroufim added bug Something isn't working p0 high priority labels Aug 27, 2022
@hgong-snap hgong-snap changed the title Worker thread die stuck Worker thread die stuck state Aug 27, 2022
@hgong-snap hgong-snap changed the title Worker thread die stuck state Worker thread stuck in die state Aug 27, 2022
@hgong-snap
Copy link
Author

hgong-snap commented Aug 31, 2022

Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or timeout? ) and corresponding worker will stuck in a die-ish(zombie) state

@msaroufim msaroufim self-assigned this Aug 31, 2022
@msaroufim
Copy link
Member

Hi @hgong-snap is there any chance you can share a repro? In particular your .mar file and config.properties - otherwise I'll see if this occurs on smaller toy models

@msaroufim
Copy link
Member

msaroufim commented Sep 6, 2022

An update I tried to repro this offline

Setup

git clone https://github.com/pytorch/serve
mkdir model_store
torchserve --start --model-store model_store
python -m grpc_tools.protoc --proto_path=frontend/server/src/main/resources/proto/ --python_out=ts_scripts --grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto

Get patched client

Use this patched version of torchserve_grpc_client.py https://gist.github.com/msaroufim/8a9f7e8a0a58ff5687c4266677c10fe3

Get input data

I picked one of the slowest models I know of from docs/model_zoo.md and then ran 10000 inferences in a row

sample_text.txt

Bloomberg has decided to publish a new report on global economic situation.

Run test

for i in $(seq 0 1 10000)
do 
    python ts_scripts/torchserve_grpc_client.py infer bert_seqc_without_torchscript sample_text.txt
    echo $i
done

EDIT: I tried just running Control C a few times and if you look at the torchserve logs outputted in the console where you type in torchserve --start you should now see the backend thread worker died

2022-09-07T11:12:14,853 [WARN ] W-9000-bert_seqc_without_torchscript_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
        at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
        at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

@hgong-snap
Copy link
Author

Hi @msaroufim here's mar file and config file:

mar file:
https://drive.google.com/file/d/1GaNfIhvAZn-7hFE1BlfDyTXajrFAAWuR/view?usp=sharing
config.properties:

service_envelope=body

grpc_inference_port=17100
grpc_management_port=17101

inference_address=http://0.0.0.0:17200
management_address=http://0.0.0.0:17201
metrics_address=http://127.0.0.1:17202

default_workers_per_model=2
install_py_dep_per_model=true
load_models=all
# async_logging=true
job_queue_size=500
default_response_timeout=5
unregister_model_timeout=5
models={\
  "easyocr": {\
    "1.0": {\
      "defaultVersion": true,\
      "marName": "easyocr.mar",\
      "minWorkers": 2,\
      "maxWorkers": 2,\
      "batchSize": 5,\
      "maxBatchDelay": 25,\
      "responseTimeout": 2000\
    }\
  },\
  "suggestive_timesformer": {\
    "0.01": {\
      "defaultVersion": true,\
      "marName": "suggestive_timesformer.mar",\
      "minWorkers": 1,\
      "maxWorkers": 1,\
      "batchSize": 8,\
      "maxBatchDelay": 250,\
      "responseTimeout": 5000\
    }\
  }\
}

@hgong-snap
Copy link
Author

for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?

@msaroufim
Copy link
Member

@hgong-snap confirming we can finally repro, will get back to you with a solution soon

@MohamedAliRashad
Copy link

I have the same problem, Any ideas ?

@MHatemAbdelhamid
Copy link

Same Issue with only 1 worker

The model works correctly for a while then the worker stops with the below error, then it just tries to start a new worker and fails again for a while


 2022-09-08T11:44:09,800 [INFO ] epollEventLoopGroup-5-121 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STOPPED
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:09,800 [INFO ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:10,801 [DEBUG] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/bin/python3.8, /opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000]

@lxning
Copy link
Collaborator

lxning commented Sep 8, 2022

@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.

@MohamedAliRashad
Copy link

@lxning But the model does work in a normal state, only in stress testing does this happen.

@MHatemAbdelhamid
Copy link

@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.

It works normally on normal condition the error only happens on large number of users, suddenly when the number of users increases it goes into infinite loop of trying to create the worker then failing

@lxning
Copy link
Collaborator

lxning commented Sep 13, 2022

@MHatemAbdelhamid According to your description ("the error only happens on large number of users, suddenly when the number of users increases"). It seems it is a capacity issue. I suggest you find the error log to identify the root cause. I believe the issue in this ticket is different from the issue you faced.

@hgong-snap
Copy link
Author

hgong-snap commented Sep 14, 2022

Hi @lxning @msaroufim ,

thanks for the quick fix. unfortunately #1854 seems not fully mitigate the issue.

I build the image with latest master with ./build_image.sh -g [reference]and then deployed. Similar error happened:

  • [18:41:01] W-9001 throws exception:
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
	at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
	at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Screen Shot 2022-09-13 at 7 39 38 PM

  • [18:14:02] W-9001 stopped and no log message since then

Screen Shot 2022-09-13 at 7 41 17 PM

  • W-9000 was fine still, but finally died at 18:55

Screen Shot 2022-09-13 at 7 43 31 PM

It matches well with observed GPU memory usage
Screen Shot 2022-09-13 at 7 44 14 PM

@hgong-snap
Copy link
Author

we currently implemented a workaround for GKE that if there's no worker available just restart the pod(restart torchserve).
as you can see this error can happen very often(8pm-11pm). #worker 2->1->0->restart->2->1->0, repeat....

Screen Shot 2022-09-14 at 11 40 31 PM

@hgong-snap
Copy link
Author

hgong-snap commented Sep 16, 2022

@msaroufim @lxning

I can successfully repro it in my local with following setup

setup

  • in config.properties, I put min/max worker=2
  • start torchserve in local, waiting for it's fully up
  • have a simple python GRPC client
import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=float(args.timeout))
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)
  • in one shell tab, run the client with normal timeout(2s) and very large number of requests, simulating consistent stream of normal requests):
python3 torchserve_grpc_client.py --port=17100 --timeout=2 -n=100000000
  • in another shell tab, run the client with very short timeout(0.02s), simulating some heavy/bad requests that will eventually timeout:
python3 torchserve_grpc_client.py --port=17100 --timeout=0.02 -n=100

Result

it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation's output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.

@MohamedAliRashad
Copy link

@hgong-snap This is precisely the problem I faced.

@lxning Any thoughts ?

@lxning
Copy link
Collaborator

lxning commented Sep 16, 2022

@hgong-snap I tried the steps you provided. There is no exception or died workerthread in the ts_log.log.zip. Message "grpc client call already cancelled" was logged when TS was trying to send response but timeout on client side.

The reason of the tab1 with timeout=2 slow recovery is that the tab2 with timeout=0.02 has 100x rate than tab1. It means there are 100x requests from tab2 accumulated in TS internal job queue. These jobs are still processed by TS even though client already cancels the connection. This scenario gives you the wrong impression that worker dies and takes long time to recover.

Meanwhile, I filed a #1863 to optimize Torchserve performance.

@hgong-snap
Copy link
Author

hgong-snap commented Sep 16, 2022

@lxning I updated my script to accept another parameter --sleep_time so that it can configure how much time client should wait/sleep between requests.
new client script:

import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True, type=float)
parser.add_argument("--port", required=True, type=int)
parser.add_argument("--sleep_time", default=0, type=float)
parser.add_argument("-n", required=True, type=int)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=args.timeout)
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)
        if args.sleep_time:
            time.sleep(args.sleep_time)

I record a video to illustrate the problem. some good timestamp to look at:

  • 0:10 I send bad requests in tab2 with
python3 grpc_client/torchserve_grpc_client2.py --port=17100 --timeout=0.02 -n=10 --sleep_time=1

so that QPS=1 now.

  • 0:16 tab1 seeing worker issue, either INTERNAL, either DEADLINE_EXCEEDED
  • 0:34 tab1 recovered(lucky)
  • 0:45 I send bad request in tab2 again
  • tab1 never recovered since then. as time of writing tab1 still has 0 success ( >15 min now).

https://drive.google.com/file/d/1zNlFTX6h2AO_DVQHgwZmbHClKy1zcfOI/view?usp=sharing

@lxning
Copy link
Collaborator

lxning commented Sep 17, 2022

@hgong-snap Thank you for recording the workflow, I can see the following exception in the video.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
        at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
        at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

After the PR, the above exception should be gone. Instead, only warning "grpc client call already cancelled" is shown in the log (eg. ts_log.log.zip). Could you please check if you can see such warning in your log? If not, most likely you are still using the old source code.

Here is an example of building a docker based on master.

cd docker
./build_image.sh -bt dev

@hgong-snap
Copy link
Author

hgong-snap commented Sep 17, 2022

@lxning hmm I'm pretty sure I built the image on latest master though... however just in case, would you mind uploading the docker you use so that I can try on my end? (better with GPU support)

@lxning
Copy link
Collaborator

lxning commented Sep 19, 2022

@hgong-snap I verified master branch on both local host and docker nightly build.

You can fetch torchserve nightly build at https://hub.docker.com/r/pytorch/torchserve-nightly/tags.

  • docker pull pytorch/torchserve-nightly:latest-gpu
  • docker pull pytorch/torchserve-nightly:latest-cpu

@hgong-snap
Copy link
Author

@lxning thanks. I tried to build the dev image locally with

./build_image.sh -bt dev -g

after this completed successfully, I run

docker run -it --rm  pytorch/torchserve:dev-gpu

but has error with:

Traceback (most recent call last):
  File "/usr/bin/torchserve", line 33, in <module>
    sys.exit(load_entry_point('torchserve==0.6.0b20220919', 'console_scripts', 'torchserve')())
  File "/usr/bin/torchserve", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/lib/python3.8/importlib/metadata.py", line 503, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/lib/python3.8/importlib/metadata.py", line 177, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: torchserve

Any hint why torchserve python package not installed in the docker image?

@lxning
Copy link
Collaborator

lxning commented Sep 19, 2022

@hgong-snap gpu docker image should specify cuda version. For example:
./build_image.sh -bt dev -g -cv cu102

Here is the detail information about torchserve docker image build.

Could you please directly pull torchserve docker nightly build to quickly verify, which is based on cu102? eg.
docker pull pytorch/torchserve-nightly:latest-gpu

@hgong-snap
Copy link
Author

@lxning verified onpytorch/torchserve-nightly:latest-gpu and seems issue is solved!

@lxning
Copy link
Collaborator

lxning commented Sep 20, 2022

@hgong-snap Great, thank you for the verification.

@hgong-snap
Copy link
Author

quick question:

in order to run torchserve with the fix locally (e.g. mac) without docker, will

pip install torchserve-nightly

be sufficient?

@msaroufim
Copy link
Member

msaroufim commented Sep 21, 2022

Yep, don't forget to run ts_scripts/install_dependencies first though with dev dependencies and GPU if needed

@arnavmehta7
Copy link
Contributor

arnavmehta7 commented Jun 22, 2023

Hi, I am still getting this issue.
I have built the image through build.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p0 high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants