CUDA graph support for TRT EP #16081

chilo-ms · 2023-05-24T17:29:35Z

CUDA EP already supports CUDA graph, also we observed some models can benefit from using CUDA graph with trtexec. Therefore, this PR enables the CUDA graph support for TRT EP.

The implementation is based on #9978 with the same constraints as below:

Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported.
Usage of CUDA Graphs is limited to models where-in all the model ops (graph nodes) can be partitioned to the TRT EP.
The input/output types of models need to be tensors.
Shapes of inputs/outputs cannot change across inference calls.
IObinding is required.

onnxruntime/test/python/onnxruntime_test_python_cudagraph.py

hariharans29 · 2023-05-24T23:48:23Z

Thanks @chilo-ms.
@tlwu had a couple of fixes for CUDA Graphs (open PRs) - I was wondering if you needed fixes similar to that while testing your models ?

chilo-ms · 2023-05-25T02:27:37Z

Thanks @chilo-ms. @tlwu had a couple of fixes for CUDA Graphs (open PRs) - I was wondering if you needed fixes similar to that while testing your models ?

yes, I noticed @tianleiwu had a PR (#15005) and I'm checking with him does he plan to get the PR merged? Since he is OOF, do you know why the PR is pending?
I will ask our partner from Nvidia to test the model to see whether they encountered the issue that needs the fix.

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

tianleiwu · 2023-06-13T18:12:40Z

Thanks @chilo-ms. @tlwu had a couple of fixes for CUDA Graphs (open PRs) - I was wondering if you needed fixes similar to that while testing your models ?

yes, I noticed @tianleiwu had a PR (#15005) and I'm checking with him does he plan to get the PR merged? Since he is OOF, do you know why the PR is pending? I will ask our partner from Nvidia to test the model to see whether they encountered the issue that needs the fix.

The PR (#15005) is ready for review. Please help review it. You may need resolve the conflicts later after it is merged. Thanks.

chilo-ms · 2023-06-14T17:48:19Z

Thanks @chilo-ms. @tlwu had a couple of fixes for CUDA Graphs (open PRs) - I was wondering if you needed fixes similar to that while testing your models ?

yes, I noticed @tianleiwu had a PR (#15005) and I'm checking with him does he plan to get the PR merged? Since he is OOF, do you know why the PR is pending? I will ask our partner from Nvidia to test the model to see whether they encountered the issue that needs the fix.

The PR (#15005) is ready for review. Please help review it. You may need resolve the conflicts later after it is merged. Thanks.

The PR looks good to me (not reviewing the multi-stream part). Will wait for your PR merged to main first, and then remove the cuda version macro in my PR as well as test it again.

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h

onnxruntime/core/session/inference_session.cc

tianleiwu · 2023-06-20T16:47:37Z

LGTM. Please resolve conflicts.

### Description This PR expands the graph capture capability to JS EP, which is similar to #16081. But for JS EP, we don't use the CUDA Graph, instead, we records all gpu commands and replay them, which removes most of the cpu overhead to avoid the the situation that gpu waiting for cpu. mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from 4.58ms on Intel A770. All limitations are similar with CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the JS EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IObinding is required. The usage is like below: Method 1: specify outputs buffers explicitly. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer/outputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; const fetches = { 'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] }) }; let results = await session.run(feeds, fetches); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release(); ``` Method 2: Don't specify outputs buffers explicitly. Internally, when graph capture is enabled, it will set all outputs location to 'gpu-buffer'. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; let results = await session.run(feeds); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release();

This PR expands the graph capture capability to JS EP, which is similar to #16081. But for JS EP, we don't use the CUDA Graph, instead, we records all gpu commands and replay them, which removes most of the cpu overhead to avoid the the situation that gpu waiting for cpu. mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from 4.58ms on Intel A770. All limitations are similar with CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the JS EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IObinding is required. The usage is like below: Method 1: specify outputs buffers explicitly. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer/outputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; const fetches = { 'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] }) }; let results = await session.run(feeds, fetches); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release(); ``` Method 2: Don't specify outputs buffers explicitly. Internally, when graph capture is enabled, it will set all outputs location to 'gpu-buffer'. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; let results = await session.run(feeds); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release();

### Description This PR expands the graph capture capability to JS EP, which is similar to microsoft#16081. But for JS EP, we don't use the CUDA Graph, instead, we records all gpu commands and replay them, which removes most of the cpu overhead to avoid the the situation that gpu waiting for cpu. mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from 4.58ms on Intel A770. All limitations are similar with CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the JS EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IObinding is required. The usage is like below: Method 1: specify outputs buffers explicitly. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer/outputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; const fetches = { 'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] }) }; let results = await session.run(feeds, fetches); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release(); ``` Method 2: Don't specify outputs buffers explicitly. Internally, when graph capture is enabled, it will set all outputs location to 'gpu-buffer'. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; let results = await session.run(feeds); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release();

chilo-ms added 5 commits May 22, 2023 17:02

update

1274f06

update

007b64c

update

2e1cba5

update

7a38074

update

f382087

chilo-ms requested review from hariharans29, tianleiwu and jywu-msft May 24, 2023 17:29

Merge branch 'main' into chi/cuda_graph

a6fd8bc

github-advanced-security bot found potential problems May 24, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_python_cudagraph.py Fixed Show fixed Hide fixed

chilo-ms added 2 commits May 24, 2023 22:17

fix bug

2e9c646

add inference test for TRT EP

4d022bf

chilo-ms added 3 commits May 25, 2023 17:25

fix format

772c849

fix format

4122d69

fix bug

1bc69e1

tianleiwu reviewed Jun 13, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Outdated Show resolved Hide resolved

jywu-msft reviewed Jun 17, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Outdated Show resolved Hide resolved

jywu-msft reviewed Jun 17, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Show resolved Hide resolved

jywu-msft reviewed Jun 17, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Outdated Show resolved Hide resolved

jywu-msft reviewed Jun 17, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Show resolved Hide resolved

chilo-ms and others added 3 commits June 19, 2023 12:08

Merge branch 'main' into chi/cuda_graph

1bbbf4f

fix typo

8f6390a

fix format

17f2daf

jywu-msft reviewed Jun 19, 2023

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h Outdated Show resolved Hide resolved

chilo-ms added 2 commits June 19, 2023 20:22

remove cuda version macro

23734d5

fix bug

2f0d48b

hariharans29 reviewed Jun 19, 2023

View reviewed changes

onnxruntime/core/session/inference_session.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Jun 19, 2023

View reviewed changes

onnxruntime/core/session/inference_session.cc Show resolved Hide resolved

hariharans29 reviewed Jun 19, 2023

View reviewed changes

onnxruntime/core/session/inference_session.cc Show resolved Hide resolved

modify comments per reviewer

3ae889e

Merge branch 'main' into chi/cuda_graph

88ea123

jywu-msft approved these changes Jun 20, 2023

View reviewed changes

tianleiwu approved these changes Jun 20, 2023

View reviewed changes

chilo-ms merged commit 4e3cff6 into main Jun 21, 2023
88 of 91 checks passed

chilo-ms deleted the chi/cuda_graph branch June 21, 2023 16:36

qjia7 mentioned this pull request Jan 11, 2024

[js/webgpu] Support capture and replay for jsep #18989

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA graph support for TRT EP #16081

CUDA graph support for TRT EP #16081

chilo-ms commented May 24, 2023

hariharans29 commented May 24, 2023 •

edited

Loading

chilo-ms commented May 25, 2023 •

edited

Loading

tianleiwu commented Jun 13, 2023

chilo-ms commented Jun 14, 2023

tianleiwu commented Jun 20, 2023

CUDA graph support for TRT EP #16081

CUDA graph support for TRT EP #16081

Conversation

chilo-ms commented May 24, 2023

hariharans29 commented May 24, 2023 • edited Loading

chilo-ms commented May 25, 2023 • edited Loading

tianleiwu commented Jun 13, 2023

chilo-ms commented Jun 14, 2023

tianleiwu commented Jun 20, 2023

hariharans29 commented May 24, 2023 •

edited

Loading

chilo-ms commented May 25, 2023 •

edited

Loading