-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cuda graph capture #15005
Fix cuda graph capture #15005
Conversation
efef6b1
to
a5f3810
Compare
|
I think the default Arena setting make it. Let me change the Arena setting in the test case, and it shall be able to reproduce. |
e4f89f0
to
12cbbd0
Compare
It is due to Arena setting of 1M default initial buffer bytes and the NextPowerOfTwo extend strategy could have extra memory covering small memory allocation. I updated the Arena setting to use SameAsRequest, also use large batch size so that 1M is not enough. Now the new test can reproduce the bug. |
Thanks for catching this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the core changes. Didn't review the multi-stream specific changes.
Description
Fix two issues related to cuda graph capture: #14942 and #15002
Issue 1: Previously, graph capture starts at the second run. However, memory pattern optimization will allocate memory from the second run, and cudamalloc is not allowed during graph capture. In this PR, the graph capture will start graph capture after 2 runs to avoid the issue.
Issue 2: #13495 introduced multiple stream support. But stream cleanup will call cudaStreamSyncronize which is not allowed in cuda graph capture. In this PR, we move stream cleanup after cuda graph capture.
Update the squeeze net test model with dynamic axis so that we can test with larger batch size. Add a test that could reproduce the bug (when changing min runs from 2 back to 1).
Motivation and Context