Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Dsitributed] fix del exception #4488

Closed
wants to merge 1 commit into from

Conversation

youkaichao
Copy link
Member

Exception warnings found in main branch:

https://buildkite.com/vllm/ci/builds/6086#018f2c52-88b7-4d39-a948-1529eca365e8

tests/distributed/test_chunked_prefill_distributed.py::test_models[16-5-half-meta-llama/Llama-2-7b-hf]
/usr/local/lib/python3.10/dist-packages/_pytest/unraisableexception.py:80: PytestUnraisableExceptionWarning: Exception ignored in: <function NCCLCommunicator.del at 0x7f6ce2d739a0>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 285, in del
dist.destroy_process_group()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1656, in destroy_process_group
assert pg is not None
AssertionError

Try to see if this fixes the exception.

@youkaichao
Copy link
Member Author

@simon-mo @zhuohan123 this is even trickier than I thought. After I remove the code that raised the exception, the test hangs for half an hour: https://buildkite.com/vllm/ci/builds/6120#018f2d8c-dc83-4595-bd26-68eb581d89fa

It is quite a coincidence that the current ci works, because the exception happens before the deadlock :(

I will try to figure out what happened ...

@youkaichao youkaichao closed this Apr 30, 2024
@youkaichao youkaichao deleted the fix_del_exception branch April 30, 2024 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant