Support CUDA Graph #9978

feihugis · 2021-12-09T01:21:08Z

Description

This PR wants to support the feature of CUDA Graph. This feature can significantly reduce the CPU overhead of calling CUDA APIs by submitting the entire graph to the GPU with a single call to cudaGraphLaunch.

Motivation and Context

Why is this change required? What problem does it solve?
This feature is pretty helpful to reduce the model latency, especially for the online inference, when the above CPU overhead is a bottleneck. For example, it can reduce the 95% latency of the transformer-based online inference model (with 148 millions of parameters) from 4.3ms to 2.1ms.

hariharans29 · 2022-01-27T04:48:54Z

Hi @feihugis : Thanks for this contribution.

I had a question about the transformer model you are referring to in the description. From what I read and understand about CUDA Graphs, it cannot support dynamic control flow meaning any ONNX model that has control flow ops (If, Loop, and Scan) cannot be captured as a CUDA Graph. The same is called out in PyTorch as well (https://pytorch.org/docs/master/notes/cuda.html#constraints). So this is something that needs to be explicitly disallowed in the capture phase. Does the model you see gains for using this feature have no control flow nodes in it ?

weixingzhang · 2022-01-27T05:18:49Z

Two ways of supporting CUDA Graph. 1) using graph capture APIs 2) building graph directly with APIs such as cudaGraphCreate/cudaGraphAddNode. Since the input of ORT is ONNX graph, one of thinking is that the cuda graph can be built in ORT directly with the way #2 based on ONNX graph instead of using capture APIs.

hariharans29 · 2022-01-27T05:46:58Z

Two ways of supporting CUDA Graph. 1) using graph capture APIs 2) building graph directly with APIs such as cudaGraphCreate/cudaGraphAddNode. Since the input of ORT is ONNX graph, one of thinking is that the cuda graph can be built in ORT directly with the way #2 based on ONNX graph instead of using capture APIs.

I would have thought graph capture APIs is just way simpler (this PR is essentially that and PyTorch seems to be using capture as well). Is there any advantage behind using (2) ?

onnxruntime/python/onnxruntime_pybind_ortvalue.cc

feihugis · 2022-01-27T07:30:26Z

I had a question about the transformer model you are referring to in the description. From what I read and understand about CUDA Graphs, it cannot support dynamic control flow meaning any ONNX model that has control flow ops (If, Loop, and Scan) cannot be captured as a CUDA Graph. The same is called out in PyTorch as well (https://pytorch.org/docs/master/notes/cuda.html#constraints). So this is something that needs to be explicitly disallowed in the capture phase. Does the model you see gains for using this feature have no control flow nodes in it ?

Thanks @hariharans29 for your review. The two transformer models I tested do not have control flow nodes in it. The constrains in PyTorch apply to here as well. If any constrains happen during capturing, either the errors will be raised or the capturing graph may not produce the correct results. The first scenarios (raise errors) seems to be OK; the second scenarios seems to be hard to identify. It seems better to let users to handle it. The control flow ops is the case in the second scenario. As CUDA graph captures the kernels enqueued in the stream, it may still work, but just capture one branch. Therefore, the results may not be correct for all the inputs (like pytorch tracing). Some docs need to be added to explain the limitations. Explicitly disallowing these cases seems to be too strict. For example, users can capture different graphs for different branches of the control flow.

feihugis · 2022-01-27T07:35:40Z

Two ways of supporting CUDA Graph. 1) using graph capture APIs 2) building graph directly with APIs such as cudaGraphCreate/cudaGraphAddNode. Since the input of ORT is ONNX graph, one of thinking is that the cuda graph can be built in ORT directly with the way #2 based on ONNX graph instead of using capture APIs.

I would have thought graph capture APIs is just way simpler (this PR is essentially that and PyTorch seems to be using capture as well). Is there any advantage behind using (2) ?

Thanks @weixingzhang for your suggestions. There is a limitation in the second way: it is not easy to get the actual excuted CUDA kernel from the ONNX graph. For example, for MatMul node, the actual kernel selected by CUDA will be different for different input dtypes and shapes.

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

onnxruntime/python/onnxruntime_inference_collection.py

hariharans29 · 2022-01-27T15:38:57Z

I had a question about the transformer model you are referring to in the description. From what I read and understand about CUDA Graphs, it cannot support dynamic control flow meaning any ONNX model that has control flow ops (If, Loop, and Scan) cannot be captured as a CUDA Graph. The same is called out in PyTorch as well (https://pytorch.org/docs/master/notes/cuda.html#constraints). So this is something that needs to be explicitly disallowed in the capture phase. Does the model you see gains for using this feature have no control flow nodes in it ?

Thanks @hariharans29 for your review. The two transformer models I tested do not have control flow nodes in it. The constrains in PyTorch apply to here as well. If any constrains happen during capturing, either the errors will be raised or the capturing graph may not produce the correct results. The first scenarios (raise errors) seems to be OK; the second scenarios seems to be hard to identify. It seems better to let users to handle it. The control flow ops is the case in the second scenario. As CUDA graph captures the kernels enqueued in the stream, it may still work, but just capture one branch. Therefore, the results may not be correct for all the inputs (like pytorch tracing). Some docs need to be added to explain the limitations. Explicitly disallowing these cases seems to be too strict. For example, users can capture different graphs for different branches of the control flow.

"Explicitly disallowing these cases seems to be too strict. For example, users can capture different graphs for different branches of the control flow."

But atleast it will reduce some associated maintenance overhead and we wouldn't have to spend time debugging silent errors associated with executing kernels from the wrong subgraph of a control flow node and as far as I can tell even your design currently only allows capturing one graph per CUDA EP and even if we did allow capturing multiple graphs per EP we still wouldn't know which graph instance to execute for a new input (if the model had dynamic control flow nodes).

feihugis · 2022-01-27T20:35:16Z

But atleast it will reduce some associated maintenance overhead and we wouldn't have to spend time debugging silent errors associated with executing kernels from the wrong subgraph of a control flow node and as far as I can tell even your design currently only allows capturing one graph per CUDA EP and even if we did allow capturing multiple graphs per EP we still wouldn't know which graph instance to execute for a new input (if the model had dynamic control flow nodes).

@hariharans29 Got your point now! I will add a check to explicitly disable the cases that CUDA graph could not fully support. For the support of multiple CUDA graph, at the beginning I thought multiple sessions can be created for different graphs, but did not evaluate it yet and not sure creating multiple sessions will be allowed. Yes, you are right. The multiple graphs still could not handle the control flow very well.

include/onnxruntime/core/framework/run_options.h

onnxruntime/core/session/inference_session.h

onnxruntime/core/providers/cuda/cuda_graph.cc

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

onnxruntime/core/providers/cuda/cuda_execution_provider.h

include/onnxruntime/core/framework/execution_provider.h

onnxruntime/test/python/onnxruntime_test_cudagraph.py

onnxruntime/python/onnxruntime_inference_collection.py

onnxruntime/python/onnxruntime_pybind_ortvalue.cc

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

hariharans29 · 2022-03-04T05:21:30Z

/azp run Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

hariharans29 · 2022-03-04T05:21:41Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline

azure-pipelines · 2022-03-04T05:22:03Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2022-03-04T05:22:30Z

Azure Pipelines successfully started running 9 pipeline(s).

hariharans29 · 2022-03-04T06:37:55Z

/azp run onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2022-03-04T06:38:06Z

Azure Pipelines successfully started running 1 pipeline(s).

onnxruntime/core/session/inference_session.cc

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

hariharans29 · 2022-03-04T15:40:50Z

@pranavsharma - Could you please take another look at Fei's recent changes based on the offline discussion ?

onnxruntime/core/session/inference_session.h

pranavsharma · 2022-03-04T19:17:39Z

onnxruntime/core/session/inference_session.h

+    }
+
+    Status ReplayGraph() {
+      if (cached_execution_provider_for_graph_replay_) {


It'll be good to check for IsGraphCaptured here as well.

ORT_ENFORCE(IsGraphCaptured()); is added.

Doesn't the EP's ReplayGraph() already enforce that ? This seems redundant.

onnxruntime/core/session/inference_session.h

onnxruntime/core/session/inference_session.cc

feihugis · 2022-03-04T21:02:12Z

@hariharans29 @pranavsharma Thanks for the reviewing! The comments have been addressed.

pranavsharma

LGTM 👍 @hariharans29 I believe you're going to have a follow up PR with some documentation?

hariharans29 · 2022-03-05T05:53:28Z

LGTM 👍 @hariharans29 I believe you're going to have a follow up PR with some documentation?

Yes, was waiting until the PR is ready to be merged (as the design was in a state of constant flux). I will add the documentation next.

hariharans29 · 2022-03-05T05:54:56Z

/azp run Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2022-03-05T05:55:28Z

Azure Pipelines successfully started running 6 pipeline(s).

hariharans29 · 2022-03-05T05:55:58Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline

azure-pipelines · 2022-03-05T05:56:45Z

Azure Pipelines successfully started running 9 pipeline(s).

hariharans29 · 2022-03-05T05:59:12Z

/azp run onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2022-03-05T05:59:22Z

Azure Pipelines successfully started running 1 pipeline(s).

CUDA EP already supports [CUDA graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs), also we observed some models can benefit from using CUDA graph with `trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP. The implementation is based on #9978 with the same [constraints](#9978) as below: - Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. - Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the TRT EP. - The input/output types of models need to be tensors. - Shapes of inputs/outputs cannot change across inference calls. - IObinding is required.

pranavsharma assigned hariharans29 Dec 16, 2021

hariharans29 reviewed Jan 27, 2022

View reviewed changes

onnxruntime/python/onnxruntime_pybind_ortvalue.cc Show resolved Hide resolved

hariharans29 reviewed Jan 27, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Jan 27, 2022

View reviewed changes

onnxruntime/python/onnxruntime_inference_collection.py Outdated Show resolved Hide resolved

hariharans29 mentioned this pull request Jan 29, 2022

Enable cuda graph in TensorRT EP #10423

Open

feihugis force-pushed the cuda_graph branch from b463781 to 753741f Compare January 29, 2022 07:48