-
Notifications
You must be signed in to change notification settings - Fork 143
Pytorch workers keep crashing if master is not up yet. #125
Comments
@TimZaman you are right. Workers will need master to be up before start. It might cause workers to restart few times during startup but it shouldn't cause any issue. Have you seen any side effects because of this issue? |
Thank for your reponse! Nice work on this KF project, I might try to contribute a bit. You are right: I don't think for our examples there would be any side effect, other than it's not ideal that they crash, of course (crash reports, logs, lists restarts, wasted cycles). I was assuming this is where the dedicated pytorch operator controller would come into fruition. We cannot look into other people's code, and their workers might do things before the distributed initialisation happens. So ideally we tackle this somehow. |
@TimZaman Thanks for your interest and contributions. This issue can be taken care of. I don't think, there is any problem in handling it. I will have a look at it |
Adding label /area 0.5.0 |
@TimZaman Fix is merged |
Close this issue? |
Hi guys,
|
Can you create a separate issue? |
I've been observing this for a while now, and now I'm confident this consistently happens (for me):
A) If the master is up first, things go well, no pods crash
B) If the master is not up first, the worker pod keep crashing with below error, until the master pod is up, and then things run fine. See below:
The reason this occurs is simply because PyTorch workers require the master to be up to connect to. If they cannot connect to the master, they will die, this is intended behaviour (and nothing to do with the pytorch operator or K8s). However, I would expect the pytorch operator to handle this correctly, and bring the master up before the others.
The text was updated successfully, but these errors were encountered: