Inaccurate request handling when configuring queue policy #5783

wurthel · 2023-05-13T15:01:10Z

Description
When I configure the model's queue policy by setting the parameters "default_timeout_microseconds" or "max_queue_size", Triton does not handle requests in the accurate way as expected.

Triton Information
Docker image: nvcr.io/nvidia/tritonserver:23.03-py3

To Reproduce

Model

import time

import numpy as np
import triton_python_backend_utils as pb_utils


class TritonPythonModel:

    def execute(self, requests):
        logger = pb_utils.Logger
        logger.log_info(f"got {len(requests)} requests")

        responses = []
        for request in requests:
            logger.log_info(f"processing request #{request.request_id()}")
            inp = pb_utils.get_input_tensor_by_name(request, "INPUT0").as_numpy()

            # emulate some work
            time.sleep(2)

            output_tensors = [pb_utils.Tensor("OUTPUT0", inp.astype(np.float32))]
            inference_response = pb_utils.InferenceResponse(output_tensors=output_tensors)
            logger.log_info(f"request #{request.request_id()} processed")
            responses.append(inference_response)

        return responses

Model Config

backend: "python"
name: "simple_model"
input [{
  name: "INPUT0"
  data_type: TYPE_FP32
  dims: [ 4 ]
}]
output [{
  name: "OUTPUT0"
  data_type: TYPE_FP32
  dims: [ 4 ]
}]
instance_group [{
  count: 1
  kind: KIND_CPU
}]
dynamic_batching {
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 1000000
  }
}

Test Case

import concurrent
import logging
import time
from concurrent.futures import Future
from functools import partial

import numpy as np
import pytest
from tritonclient.grpc import InferenceServerClient, InferInput
from tritonclient.utils import InferenceServerException, np_to_triton_dtype

logger = logging.getLogger(__name__)


@pytest.fixture(scope="session")
def client():
    return InferenceServerClient("localhost:8001")


@pytest.fixture()
def inputs():
    xs = np.zeros(4).astype(np.float32)
    inp = InferInput("INPUT0", xs.shape, np_to_triton_dtype(xs.dtype))
    inp.set_data_from_numpy(xs)
    return [inp]


def callback(future, idx, result, error):
    if result is not None:
        logger.debug(f"[ID {idx}] got result: {result}")
        future.set_result(result)
    else:
        logger.debug(f"[ID {idx}] got exception: {error}")
        future.set_exception(error)


def test_timeout(client, inputs):
    num_requests = 4
    futures = []
    for idx in range(num_requests):
        f = Future()
        client.async_infer(
            model_name="simple_model",
            inputs=inputs,
            request_id=str(idx),
            callback=partial(callback, future=f, idx=idx),
        )
        futures.append(f)
        time.sleep(0.01)

    concurrent.futures.wait(futures)

    assert futures[0].result() is not None
    for f in futures[1:]:
        with pytest.raises(InferenceServerException, match="Request timeout expired"):
            f.result()

Expected behavior
Since

the timeout is set to 1 second
I send 4 requests simultaneously
each request takes 2 seconds to be processed

I expect:

only the first request will be processed successfully
the other 3 requests will be rejected due to the "Request timeout expired" error

but I get:

the first 3 requests are processed
the last request is rejected

Logs produced by the test case:

pytest test_python_model.py --log-cli-level=DEBUG
---------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------
DEBUG    test_python_model:test_python_model.py:30 [ID 0] got result: <tritonclient.grpc.InferResult object at 0x1155c4220>
DEBUG    test_python_model:test_python_model.py:33 [ID 3] got exception: [StatusCode.UNAVAILABLE] Request timeout expired
DEBUG    test_python_model:test_python_model.py:30 [ID 1] got result: <tritonclient.grpc.InferResult object at 0x1155b8340>
DEBUG    test_python_model:test_python_model.py:30 [ID 2] got result: <tritonclient.grpc.InferResult object at 0x1155b8a30>
========================================================================= short test summary info =========================================================================
FAILED test_python_model.py::test_timeout - Failed: DID NOT RAISE <class 'tritonclient.utils.InferenceServerException'>

I am experiencing a similar issue with another queue policy option. Let's modify the configuration I provided above by setting "max_queue_size":

...
default_queue_policy {
    timeout_action: REJECT
    max_queue_size: 1
  }
}

So, if I send 4 requests, they will all be processed. But if I send 5 requests, only the last one will be rejected due to the "Exceeds maximum queue size" error.

Quick investigation

System monitoring shows that some 3 threads of the main process are running additionally

- /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0
| - /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0
| - /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0
| - /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0

Not completely sure, but I believe it might explain the current behavior: in the "test_timeout" test case, when I send 4 requests, the first 3 are taken by those workers and only one is kept in the queue.

And when I set "max_queue_size", similar things are happening: 3 requests are taken by those workers, 1 is kept in the queue, and the 5th request is rejected because the queue is full.

The text was updated successfully, but these errors were encountered:

Tabrizian · 2023-08-01T15:18:57Z

Sorry for the delayed response. @tanmayv25 is investigating a similar issue.

Tabrizian assigned tanmayv25 Aug 1, 2023

tanmayv25 mentioned this issue Aug 2, 2023

Disable pre-fetching when using queue policy triton-inference-server/core#237

Merged

tanmayv25 closed this as completed in triton-inference-server/core#237 Aug 4, 2023

eeeeeunjung mentioned this issue Jan 17, 2024

Unfixed bugs：issue/5783, Inaccurate request handling when configuring queue policy #6796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccurate request handling when configuring queue policy #5783

Inaccurate request handling when configuring queue policy #5783

wurthel commented May 13, 2023

Tabrizian commented Aug 1, 2023

Inaccurate request handling when configuring queue policy #5783

Inaccurate request handling when configuring queue policy #5783

Comments

wurthel commented May 13, 2023

Model

Model Config

Test Case

Tabrizian commented Aug 1, 2023