Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cuda graph capture #15005

Merged
merged 5 commits into from
Jun 15, 2023
Merged

Fix cuda graph capture #15005

merged 5 commits into from
Jun 15, 2023

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Mar 11, 2023

Description

Fix two issues related to cuda graph capture: #14942 and #15002

Issue 1: Previously, graph capture starts at the second run. However, memory pattern optimization will allocate memory from the second run, and cudamalloc is not allowed during graph capture. In this PR, the graph capture will start graph capture after 2 runs to avoid the issue.

Issue 2: #13495 introduced multiple stream support. But stream cleanup will call cudaStreamSyncronize which is not allowed in cuda graph capture. In this PR, we move stream cleanup after cuda graph capture.

Update the squeeze net test model with dynamic axis so that we can test with larger batch size. Add a test that could reproduce the bug (when changing min runs from 2 back to 1).

Motivation and Context

@tianleiwu tianleiwu marked this pull request as draft March 11, 2023 02:22
@hariharans29
Copy link
Member

hariharans29 commented Mar 14, 2023

However, memory pattern optimization will allocate memory from the second run - If memory pattern optimization will always allocate memory on the second run, how come this doesn't repro universally for all models (Seem to recall we had a C++ test for a simple model and this issue never happens there)? Is it possible, that while memory pattern optimization kicks in always (based on the session option), it only triggers an arena extension for some models based on the peak memory usage identified by the planner in the first run and the arena's state at that point in time ?

@tianleiwu
Copy link
Contributor Author

tianleiwu commented Mar 14, 2023

However, memory pattern optimization will allocate memory from the second run - If memory pattern optimization will always allocate memory on the second run, how come this doesn't repro universally for all models (Seem to recall we had a C++ test for a simple model and this issue never happens there)? Is it possible, that while memory pattern optimization kicks in based on the session option, it only triggers an arena extension for some models based on the peak memory usage identified by the planner in the first run and the arena's state at that point in time ?

I think the default Arena setting make it. Let me change the Arena setting in the test case, and it shall be able to reproduce.

@tianleiwu tianleiwu requested a review from a team as a code owner June 13, 2023 05:29
@tianleiwu tianleiwu marked this pull request as draft June 13, 2023 05:31
@tianleiwu tianleiwu marked this pull request as ready for review June 13, 2023 16:09
@tianleiwu
Copy link
Contributor Author

However, memory pattern optimization will allocate memory from the second run - If memory pattern optimization will always allocate memory on the second run, how come this doesn't repro universally for all models (Seem to recall we had a C++ test for a simple model and this issue never happens there)? Is it possible, that while memory pattern optimization kicks in always (based on the session option), it only triggers an arena extension for some models based on the peak memory usage identified by the planner in the first run and the arena's state at that point in time ?

It is due to Arena setting of 1M default initial buffer bytes and the NextPowerOfTwo extend strategy could have extra memory covering small memory allocation. I updated the Arena setting to use SameAsRequest, also use large batch size so that 1M is not enough. Now the new test can reproduce the bug.

@hariharans29
Copy link
Member

However, memory pattern optimization will allocate memory from the second run - If memory pattern optimization will always allocate memory on the second run, how come this doesn't repro universally for all models (Seem to recall we had a C++ test for a simple model and this issue never happens there)? Is it possible, that while memory pattern optimization kicks in always (based on the session option), it only triggers an arena extension for some models based on the peak memory usage identified by the planner in the first run and the arena's state at that point in time ?

It is due to Arena setting of 1M default initial buffer bytes and the NextPowerOfTwo extend strategy could have extra memory covering small memory allocation. I updated the Arena setting to use SameAsRequest, also use large batch size so that 1M is not enough. Now the new test can reproduce the bug.

Thanks for catching this.

Copy link
Member

@hariharans29 hariharans29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the core changes. Didn't review the multi-stream specific changes.

@tianleiwu tianleiwu merged commit 9be1332 into main Jun 15, 2023
@tianleiwu tianleiwu deleted the tlwu/fix_cuda_graph branch June 15, 2023 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants