Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate request handling when configuring queue policy #5783

Closed
wurthel opened this issue May 13, 2023 · 1 comment · Fixed by triton-inference-server/core#237
Closed
Assignees

Comments

@wurthel
Copy link

wurthel commented May 13, 2023

Description
When I configure the model's queue policy by setting the parameters "default_timeout_microseconds" or "max_queue_size", Triton does not handle requests in the accurate way as expected.

Triton Information
Docker image: nvcr.io/nvidia/tritonserver:23.03-py3

To Reproduce

Model
import time

import numpy as np
import triton_python_backend_utils as pb_utils


class TritonPythonModel:

    def execute(self, requests):
        logger = pb_utils.Logger
        logger.log_info(f"got {len(requests)} requests")

        responses = []
        for request in requests:
            logger.log_info(f"processing request #{request.request_id()}")
            inp = pb_utils.get_input_tensor_by_name(request, "INPUT0").as_numpy()

            # emulate some work
            time.sleep(2)

            output_tensors = [pb_utils.Tensor("OUTPUT0", inp.astype(np.float32))]
            inference_response = pb_utils.InferenceResponse(output_tensors=output_tensors)
            logger.log_info(f"request #{request.request_id()} processed")
            responses.append(inference_response)

        return responses
Model Config
backend: "python"
name: "simple_model"
input [{
  name: "INPUT0"
  data_type: TYPE_FP32
  dims: [ 4 ]
}]
output [{
  name: "OUTPUT0"
  data_type: TYPE_FP32
  dims: [ 4 ]
}]
instance_group [{
  count: 1
  kind: KIND_CPU
}]
dynamic_batching {
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 1000000
  }
}
Test Case
import concurrent
import logging
import time
from concurrent.futures import Future
from functools import partial

import numpy as np
import pytest
from tritonclient.grpc import InferenceServerClient, InferInput
from tritonclient.utils import InferenceServerException, np_to_triton_dtype

logger = logging.getLogger(__name__)


@pytest.fixture(scope="session")
def client():
    return InferenceServerClient("localhost:8001")


@pytest.fixture()
def inputs():
    xs = np.zeros(4).astype(np.float32)
    inp = InferInput("INPUT0", xs.shape, np_to_triton_dtype(xs.dtype))
    inp.set_data_from_numpy(xs)
    return [inp]


def callback(future, idx, result, error):
    if result is not None:
        logger.debug(f"[ID {idx}] got result: {result}")
        future.set_result(result)
    else:
        logger.debug(f"[ID {idx}] got exception: {error}")
        future.set_exception(error)


def test_timeout(client, inputs):
    num_requests = 4
    futures = []
    for idx in range(num_requests):
        f = Future()
        client.async_infer(
            model_name="simple_model",
            inputs=inputs,
            request_id=str(idx),
            callback=partial(callback, future=f, idx=idx),
        )
        futures.append(f)
        time.sleep(0.01)

    concurrent.futures.wait(futures)

    assert futures[0].result() is not None
    for f in futures[1:]:
        with pytest.raises(InferenceServerException, match="Request timeout expired"):
            f.result()

Expected behavior
Since

  • the timeout is set to 1 second
  • I send 4 requests simultaneously
  • each request takes 2 seconds to be processed

I expect:

  • only the first request will be processed successfully
  • the other 3 requests will be rejected due to the "Request timeout expired" error

but I get:

  • the first 3 requests are processed
  • the last request is rejected

Logs produced by the test case:

pytest test_python_model.py --log-cli-level=DEBUG
---------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------
DEBUG    test_python_model:test_python_model.py:30 [ID 0] got result: <tritonclient.grpc.InferResult object at 0x1155c4220>
DEBUG    test_python_model:test_python_model.py:33 [ID 3] got exception: [StatusCode.UNAVAILABLE] Request timeout expired
DEBUG    test_python_model:test_python_model.py:30 [ID 1] got result: <tritonclient.grpc.InferResult object at 0x1155b8340>
DEBUG    test_python_model:test_python_model.py:30 [ID 2] got result: <tritonclient.grpc.InferResult object at 0x1155b8a30>
========================================================================= short test summary info =========================================================================
FAILED test_python_model.py::test_timeout - Failed: DID NOT RAISE <class 'tritonclient.utils.InferenceServerException'>

I am experiencing a similar issue with another queue policy option. Let's modify the configuration I provided above by setting "max_queue_size":

...
default_queue_policy {
    timeout_action: REJECT
    max_queue_size: 1
  }
}

So, if I send 4 requests, they will all be processed. But if I send 5 requests, only the last one will be rejected due to the "Exceeds maximum queue size" error.

Quick investigation

System monitoring shows that some 3 threads of the main process are running additionally

- /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0
| - /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0
| - /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0
| - /opt/tritonserver/backends/python/triton_python_backend_stub ... simple_model_0

Not completely sure, but I believe it might explain the current behavior: in the "test_timeout" test case, when I send 4 requests, the first 3 are taken by those workers and only one is kept in the queue.

And when I set "max_queue_size", similar things are happening: 3 requests are taken by those workers, 1 is kept in the queue, and the 5th request is rejected because the queue is full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants