Worker thread stuck in die state #1815

hgong-snap · 2022-08-26T05:27:48Z

🐛 Describe the bug

We encountered similar problems as #1531 and it happens quite often.

See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it's kinda stuck in an unknown state: it didn't terminate itself, so no workers can be added automatically. But in the meantime, it won't receive incoming traffic. which essentially means we only have one worker(9001) now.

The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can't receive any traffic. It still counted one active worker, thus torchserve won't add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds. in the log )

If I curl the management API, it still shows two works are all healthy.

Error logs

worker-9000 died because exception. It didn't have any log after 2022-08-25 21:21:14.056 PDT, selected logs:

[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected.
[INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 "gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0" 13 109
[INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr

Installation instructions

N/A unrelated

Model Packaing

N/A unrelated

config.properties

No response

Versions

used docker image of 0.6.0-gpu

Repro instructions

N/A

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

hgong-snap · 2022-08-26T07:00:40Z

Another example: similar thing, worker W-9000-easyocr died, and GPU usage drop to almost half (2 workers, 1 worker died stuck, only 1 work is functioning).

If I ssh into the pod and do the curl for the model:

root@ad-ranker-pytorch-ranker-usc-a-588566f68-dt8lj:/home/model-server# curl localhost:17201/models/easyocr
[
  {
    "modelName": "easyocr",
    "modelVersion": "1.0",
    "modelUrl": "easyocr.mar",
    "runtime": "python",
    "minWorkers": 2,
    "maxWorkers": 2,
    "batchSize": 5,
    "maxBatchDelay": 25,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2022-08-26T06:26:26.166Z",
        "status": "READY",
        "memoryUsage": 2979250176,
        "pid": 1042,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
      },
      {
        "id": "9001",
        "startTime": "2022-08-26T06:26:26.168Z",
        "status": "READY",
        "memoryUsage": 2992365568,
        "pid": 1043,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::63 % utilization.memory [%]::20 % memory.used [MiB]::5755 MiB"
      }
    ]
  }
]

hgong-snap · 2022-08-26T07:06:49Z

With that said,

I can't know if a worker is in die stuck state or not.
Even if I know a worker is died, I can't know which one died, nor kill/restart this specific worker

so basically there's really no way to catch the worker die stuck error?

hgong-snap · 2022-08-31T04:25:06Z

Had quite a few similar issues recently, and I think in general all issues seems pointing to some race condition when GRPC got cancelled for some reason(maybe from client? or timeout? ) and corresponding worker will stuck in a die-ish(zombie) state

msaroufim · 2022-09-01T20:01:45Z

Hi @hgong-snap is there any chance you can share a repro? In particular your .mar file and config.properties - otherwise I'll see if this occurs on smaller toy models

msaroufim · 2022-09-06T15:24:57Z

An update I tried to repro this offline

Setup

git clone https://github.com/pytorch/serve
mkdir model_store
torchserve --start --model-store model_store
python -m grpc_tools.protoc --proto_path=frontend/server/src/main/resources/proto/ --python_out=ts_scripts --grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto

Get patched client

Use this patched version of torchserve_grpc_client.py https://gist.github.com/msaroufim/8a9f7e8a0a58ff5687c4266677c10fe3

Get input data

I picked one of the slowest models I know of from docs/model_zoo.md and then ran 10000 inferences in a row

sample_text.txt

Bloomberg has decided to publish a new report on global economic situation.

Run test

for i in $(seq 0 1 10000)
do 
    python ts_scripts/torchserve_grpc_client.py infer bert_seqc_without_torchscript sample_text.txt
    echo $i
done

EDIT: I tried just running Control C a few times and if you look at the torchserve logs outputted in the console where you type in torchserve --start you should now see the backend thread worker died

2022-09-07T11:12:14,853 [WARN ] W-9000-bert_seqc_without_torchscript_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
        at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
        at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

hgong-snap · 2022-09-06T19:23:04Z

Hi @msaroufim here's mar file and config file:

mar file:
https://drive.google.com/file/d/1GaNfIhvAZn-7hFE1BlfDyTXajrFAAWuR/view?usp=sharing
config.properties:

service_envelope=body

grpc_inference_port=17100
grpc_management_port=17101

inference_address=http://0.0.0.0:17200
management_address=http://0.0.0.0:17201
metrics_address=http://127.0.0.1:17202

default_workers_per_model=2
install_py_dep_per_model=true
load_models=all
# async_logging=true
job_queue_size=500
default_response_timeout=5
unregister_model_timeout=5
models={\
  "easyocr": {\
    "1.0": {\
      "defaultVersion": true,\
      "marName": "easyocr.mar",\
      "minWorkers": 2,\
      "maxWorkers": 2,\
      "batchSize": 5,\
      "maxBatchDelay": 25,\
      "responseTimeout": 2000\
    }\
  },\
  "suggestive_timesformer": {\
    "0.01": {\
      "defaultVersion": true,\
      "marName": "suggestive_timesformer.mar",\
      "minWorkers": 1,\
      "maxWorkers": 1,\
      "batchSize": 8,\
      "maxBatchDelay": 250,\
      "responseTimeout": 5000\
    }\
  }\
}

hgong-snap · 2022-09-06T19:27:27Z

for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?

msaroufim · 2022-09-07T11:17:43Z

@hgong-snap confirming we can finally repro, will get back to you with a solution soon

MohamedAliRashad · 2022-09-08T12:21:06Z

I have the same problem, Any ideas ?

MHatemAbdelhamid · 2022-09-08T12:25:54Z

Same Issue with only 1 worker

The model works correctly for a while then the worker stops with the below error, then it just tries to start a new worker and fails again for a while


 2022-09-08T11:44:09,800 [INFO ] epollEventLoopGroup-5-121 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STOPPED
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,800 [WARN ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:09,800 [INFO ] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stderr
2022-09-08T11:44:09,998 [INFO ] W-9000-unified_salient_and_beit_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-unified_salient_and_beit_1.0-stdout
2022-09-08T11:44:10,801 [DEBUG] W-9000-unified_salient_and_beit_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/bin/python3.8, /opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000]

lxning · 2022-09-08T23:20:18Z

@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.

MohamedAliRashad · 2022-09-11T15:26:15Z

@lxning But the model does work in a normal state, only in stress testing does this happen.

MHatemAbdelhamid · 2022-09-11T16:04:22Z

@MHatemAbdelhamid The issue reported in this ticket is the workerthread hang and never got chance to recreate a new workerthread. Your case is different since a new workerthread is created. In your case, most likely there is sth wrong either with your model implementation or the input data.

It works normally on normal condition the error only happens on large number of users, suddenly when the number of users increases it goes into infinite loop of trying to create the worker then failing

lxning · 2022-09-13T21:43:11Z

@MHatemAbdelhamid According to your description ("the error only happens on large number of users, suddenly when the number of users increases"). It seems it is a capacity issue. I suggest you find the error log to identify the root cause. I believe the issue in this ticket is different from the issue you faced.

hgong-snap · 2022-09-14T02:44:45Z

Hi @lxning @msaroufim ,

thanks for the quick fix. unfortunately #1854 seems not fully mitigate the issue.

I build the image with latest master with ./build_image.sh -g [reference]and then deployed. Similar error happened:

[18:41:01] W-9001 throws exception:

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
	at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
	at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
	at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

[18:14:02] W-9001 stopped and no log message since then

W-9000 was fine still, but finally died at 18:55

It matches well with observed GPU memory usage

hgong-snap · 2022-09-15T06:43:01Z

we currently implemented a workaround for GKE that if there's no worker available just restart the pod(restart torchserve).
as you can see this error can happen very often(8pm-11pm). #worker 2->1->0->restart->2->1->0, repeat....

hgong-snap · 2022-09-16T03:26:58Z

@msaroufim @lxning

I can successfully repro it in my local with following setup

setup

in config.properties, I put min/max worker=2
start torchserve in local, waiting for it's fully up
have a simple python GRPC client

import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=float(args.timeout))
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)

in one shell tab, run the client with normal timeout(2s) and very large number of requests, simulating consistent stream of normal requests):

python3 torchserve_grpc_client.py --port=17100 --timeout=2 -n=100000000

in another shell tab, run the client with very short timeout(0.02s), simulating some heavy/bad requests that will eventually timeout:

python3 torchserve_grpc_client.py --port=17100 --timeout=0.02 -n=100

Result

it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation's output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.

MohamedAliRashad · 2022-09-16T07:02:49Z

@hgong-snap This is precisely the problem I faced.

@lxning Any thoughts ?

lxning · 2022-09-16T22:39:55Z

@hgong-snap I tried the steps you provided. There is no exception or died workerthread in the ts_log.log.zip. Message "grpc client call already cancelled" was logged when TS was trying to send response but timeout on client side.

The reason of the tab1 with timeout=2 slow recovery is that the tab2 with timeout=0.02 has 100x rate than tab1. It means there are 100x requests from tab2 accumulated in TS internal job queue. These jobs are still processed by TS even though client already cancels the connection. This scenario gives you the wrong impression that worker dies and takes long time to recover.

Meanwhile, I filed a #1863 to optimize Torchserve performance.

hgong-snap · 2022-09-16T23:36:55Z

@lxning I updated my script to accept another parameter --sleep_time so that it can configure how much time client should wait/sleep between requests.
new client script:

import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True, type=float)
parser.add_argument("--port", required=True, type=int)
parser.add_argument("--sleep_time", default=0, type=float)
parser.add_argument("-n", required=True, type=int)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=args.timeout)
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)
        if args.sleep_time:
            time.sleep(args.sleep_time)

I record a video to illustrate the problem. some good timestamp to look at:

0:10 I send bad requests in tab2 with

python3 grpc_client/torchserve_grpc_client2.py --port=17100 --timeout=0.02 -n=10 --sleep_time=1

so that QPS=1 now.

0:16 tab1 seeing worker issue, either INTERNAL, either DEADLINE_EXCEEDED
0:34 tab1 recovered(lucky)
0:45 I send bad request in tab2 again
tab1 never recovered since then. as time of writing tab1 still has 0 success ( >15 min now).

https://drive.google.com/file/d/1zNlFTX6h2AO_DVQHgwZmbHClKy1zcfOI/view?usp=sharing

lxning · 2022-09-17T01:18:24Z

@hgong-snap Thank you for recording the workflow, I can see the following exception in the video.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
        at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?]
        at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

After the PR, the above exception should be gone. Instead, only warning "grpc client call already cancelled" is shown in the log (eg. ts_log.log.zip). Could you please check if you can see such warning in your log? If not, most likely you are still using the old source code.

Here is an example of building a docker based on master.

cd docker
./build_image.sh -bt dev

hgong-snap · 2022-09-17T08:59:28Z

@lxning hmm I'm pretty sure I built the image on latest master though... however just in case, would you mind uploading the docker you use so that I can try on my end? (better with GPU support)

lxning · 2022-09-19T18:23:57Z

@hgong-snap I verified master branch on both local host and docker nightly build.

You can fetch torchserve nightly build at https://hub.docker.com/r/pytorch/torchserve-nightly/tags.

docker pull pytorch/torchserve-nightly:latest-gpu
docker pull pytorch/torchserve-nightly:latest-cpu

hgong-snap · 2022-09-19T22:25:14Z

@lxning thanks. I tried to build the dev image locally with

./build_image.sh -bt dev -g

after this completed successfully, I run

docker run -it --rm  pytorch/torchserve:dev-gpu

but has error with:

Traceback (most recent call last):
  File "/usr/bin/torchserve", line 33, in <module>
    sys.exit(load_entry_point('torchserve==0.6.0b20220919', 'console_scripts', 'torchserve')())
  File "/usr/bin/torchserve", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/lib/python3.8/importlib/metadata.py", line 503, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/lib/python3.8/importlib/metadata.py", line 177, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: torchserve

Any hint why torchserve python package not installed in the docker image?

lxning · 2022-09-19T23:01:15Z

@hgong-snap gpu docker image should specify cuda version. For example:
./build_image.sh -bt dev -g -cv cu102

Here is the detail information about torchserve docker image build.

Could you please directly pull torchserve docker nightly build to quickly verify, which is based on cu102? eg.
docker pull pytorch/torchserve-nightly:latest-gpu

hgong-snap · 2022-09-20T16:43:08Z

@lxning verified onpytorch/torchserve-nightly:latest-gpu and seems issue is solved!

lxning · 2022-09-20T16:46:04Z

@hgong-snap Great, thank you for the verification.

hgong-snap · 2022-09-21T17:22:58Z

quick question:

in order to run torchserve with the fix locally (e.g. mac) without docker, will

pip install torchserve-nightly

be sufficient?

msaroufim · 2022-09-21T17:24:38Z

Yep, don't forget to run ts_scripts/install_dependencies first though with dev dependencies and GPU if needed

arnavmehta7 · 2023-06-22T07:59:01Z

Hi, I am still getting this issue.
I have built the image through build.sh

hgong-snap changed the title ~~Worker thread stuck after got client Cancelled~~ Worker thread die stuck Aug 26, 2022

msaroufim added bug Something isn't working p0 high priority labels Aug 27, 2022

hgong-snap changed the title ~~Worker thread die stuck~~ Worker thread die stuck state Aug 27, 2022

hgong-snap changed the title ~~Worker thread die stuck state~~ Worker thread stuck in die state Aug 27, 2022

msaroufim self-assigned this Aug 31, 2022

msaroufim mentioned this issue Sep 1, 2022

SIGTERM zombie workers handler #1836

Closed

3 tasks

lxning mentioned this issue Sep 10, 2022

fix workerthread hang due to grpc client cancel connection #1854

Merged

10 tasks

lxning self-assigned this Sep 10, 2022

msaroufim closed this as completed Sep 26, 2022

msaroufim mentioned this issue Oct 5, 2022

Worker crashes and model is reloaded when a gRPC client aborts a prediction request #1888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker thread stuck in die state #1815

Worker thread stuck in die state #1815

hgong-snap commented Aug 26, 2022 •

edited

Loading

hgong-snap commented Aug 26, 2022 •

edited

Loading

hgong-snap commented Aug 26, 2022 •

edited

Loading

hgong-snap commented Aug 31, 2022 •

edited

Loading

msaroufim commented Sep 1, 2022

msaroufim commented Sep 6, 2022 •

edited

Loading

hgong-snap commented Sep 6, 2022

hgong-snap commented Sep 6, 2022

msaroufim commented Sep 7, 2022

MohamedAliRashad commented Sep 8, 2022

MHatemAbdelhamid commented Sep 8, 2022

lxning commented Sep 8, 2022

MohamedAliRashad commented Sep 11, 2022

MHatemAbdelhamid commented Sep 11, 2022

lxning commented Sep 13, 2022 •

edited

Loading

hgong-snap commented Sep 14, 2022 •

edited

Loading

hgong-snap commented Sep 15, 2022

hgong-snap commented Sep 16, 2022 •

edited

Loading

MohamedAliRashad commented Sep 16, 2022

lxning commented Sep 16, 2022 •

edited

Loading

hgong-snap commented Sep 16, 2022 •

edited

Loading

lxning commented Sep 17, 2022 •

edited

Loading

hgong-snap commented Sep 17, 2022 •

edited

Loading

lxning commented Sep 19, 2022

hgong-snap commented Sep 19, 2022

lxning commented Sep 19, 2022 •

edited

Loading

hgong-snap commented Sep 20, 2022

lxning commented Sep 20, 2022

hgong-snap commented Sep 21, 2022

msaroufim commented Sep 21, 2022 •

edited

Loading

arnavmehta7 commented Jun 22, 2023 •

edited

Loading

Worker thread stuck in die state #1815

Worker thread stuck in die state #1815

Comments

hgong-snap commented Aug 26, 2022 • edited Loading

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

hgong-snap commented Aug 26, 2022 • edited Loading

hgong-snap commented Aug 26, 2022 • edited Loading

hgong-snap commented Aug 31, 2022 • edited Loading

msaroufim commented Sep 1, 2022

msaroufim commented Sep 6, 2022 • edited Loading

Setup

Get patched client

Get input data

Run test

hgong-snap commented Sep 6, 2022

hgong-snap commented Sep 6, 2022

msaroufim commented Sep 7, 2022

MohamedAliRashad commented Sep 8, 2022

MHatemAbdelhamid commented Sep 8, 2022

lxning commented Sep 8, 2022

MohamedAliRashad commented Sep 11, 2022

MHatemAbdelhamid commented Sep 11, 2022

lxning commented Sep 13, 2022 • edited Loading

hgong-snap commented Sep 14, 2022 • edited Loading

hgong-snap commented Sep 15, 2022

hgong-snap commented Sep 16, 2022 • edited Loading

setup

Result

MohamedAliRashad commented Sep 16, 2022

lxning commented Sep 16, 2022 • edited Loading

hgong-snap commented Sep 16, 2022 • edited Loading

lxning commented Sep 17, 2022 • edited Loading

hgong-snap commented Sep 17, 2022 • edited Loading

lxning commented Sep 19, 2022

hgong-snap commented Sep 19, 2022

lxning commented Sep 19, 2022 • edited Loading

hgong-snap commented Sep 20, 2022

lxning commented Sep 20, 2022

hgong-snap commented Sep 21, 2022

msaroufim commented Sep 21, 2022 • edited Loading

arnavmehta7 commented Jun 22, 2023 • edited Loading

hgong-snap commented Aug 26, 2022 •

edited

Loading

hgong-snap commented Aug 26, 2022 •

edited

Loading

hgong-snap commented Aug 26, 2022 •

edited

Loading

hgong-snap commented Aug 31, 2022 •

edited

Loading

msaroufim commented Sep 6, 2022 •

edited

Loading

lxning commented Sep 13, 2022 •

edited

Loading

hgong-snap commented Sep 14, 2022 •

edited

Loading

hgong-snap commented Sep 16, 2022 •

edited

Loading

lxning commented Sep 16, 2022 •

edited

Loading

hgong-snap commented Sep 16, 2022 •

edited

Loading

lxning commented Sep 17, 2022 •

edited

Loading

hgong-snap commented Sep 17, 2022 •

edited

Loading

lxning commented Sep 19, 2022 •

edited

Loading

msaroufim commented Sep 21, 2022 •

edited

Loading

arnavmehta7 commented Jun 22, 2023 •

edited

Loading