Pytorch workers keep crashing if master is not up yet. #125

TimZaman · 2019-01-11T05:46:38Z

I've been observing this for a while now, and now I'm confident this consistently happens (for me):

Create a pytorch job:
A) If the master is up first, things go well, no pods crash
B) If the master is not up first, the worker pod keep crashing with below error, until the master pod is up, and then things run fine. See below:

 kubectl logs optimizer-worker-0
2019-01-11 05:43:30,310 INFO     main(rmq_host=rmq.default.svc.cluster.local, rmq_port=5672, batch_size=12)
2019-01-11 05:43:30,310 INFO     init_distribution
Traceback (most recent call last):
  File "optimizer.py", line 459, in <module>
    pretrained_model=args.pretrained_model,
  File "optimizer.py", line 422, in main
    init_distribution()
  File "optimizer.py", line 413, in init_distribution
    torch.distributed.init_process_group(backend=backend)
  File "/root/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/root/.local/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
ValueError: host not found: Name or service not known

The reason this occurs is simply because PyTorch workers require the master to be up to connect to. If they cannot connect to the master, they will die, this is intended behaviour (and nothing to do with the pytorch operator or K8s). However, I would expect the pytorch operator to handle this correctly, and bring the master up before the others.

The text was updated successfully, but these errors were encountered:

johnugeorge · 2019-01-11T06:00:20Z

@TimZaman you are right. Workers will need master to be up before start. It might cause workers to restart few times during startup but it shouldn't cause any issue. Have you seen any side effects because of this issue?

TimZaman · 2019-01-11T06:13:41Z

Thank for your reponse! Nice work on this KF project, I might try to contribute a bit.

You are right: I don't think for our examples there would be any side effect, other than it's not ideal that they crash, of course (crash reports, logs, lists restarts, wasted cycles). I was assuming this is where the dedicated pytorch operator controller would come into fruition. We cannot look into other people's code, and their workers might do things before the distributed initialisation happens. So ideally we tackle this somehow.

johnugeorge · 2019-01-11T06:20:20Z

@TimZaman Thanks for your interest and contributions. This issue can be taken care of. I don't think, there is any problem in handling it. I will have a look at it

johnugeorge · 2019-02-07T07:16:18Z

Adding label /area 0.5.0

johnugeorge · 2019-03-08T03:46:49Z

@TimZaman Fix is merged

johnugeorge · 2019-03-08T03:54:15Z

Close this issue?

Akmazad · 2019-04-09T11:58:28Z

Hi guys,
I'm facing the same problem now and not sure if this is due to other issues in my coding. I'm using pytorch/1.1.0a0-py36, openmpi/3.1.3, cuda/9.0, dist_method="env://" and backend=nccl.

File "/apps/pytorch/1.1.0a0-py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
ValueError: host not found: Name or service not known

Any suggestion?

johnugeorge · 2019-04-09T13:11:13Z

Can you create a separate issue?

johnugeorge added the improvement/enhancement label Jan 11, 2019

johnugeorge added the area/0.5.0 label Feb 7, 2019

johnugeorge mentioned this issue Mar 5, 2019

Workers are created only when the master is in running phase #145

Merged

TimZaman closed this as completed Mar 8, 2019

johanfleury mentioned this issue Apr 2, 2019

[discussion] Need for 'Master' replica type? #55

Closed

johnugeorge mentioned this issue Jul 5, 2019

gang schedule bug #186

Closed

2sin18 mentioned this issue Sep 28, 2021

Pytorch distributed job failed when master replica start later than worker replica kubeflow/arena#547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch workers keep crashing if master is not up yet. #125

Pytorch workers keep crashing if master is not up yet. #125

TimZaman commented Jan 11, 2019

johnugeorge commented Jan 11, 2019

TimZaman commented Jan 11, 2019

johnugeorge commented Jan 11, 2019

johnugeorge commented Feb 7, 2019 •

edited

Loading

johnugeorge commented Mar 8, 2019

johnugeorge commented Mar 8, 2019

Akmazad commented Apr 9, 2019

johnugeorge commented Apr 9, 2019

Pytorch workers keep crashing if master is not up yet. #125

Pytorch workers keep crashing if master is not up yet. #125

Comments

TimZaman commented Jan 11, 2019

johnugeorge commented Jan 11, 2019

TimZaman commented Jan 11, 2019

johnugeorge commented Jan 11, 2019

johnugeorge commented Feb 7, 2019 • edited Loading

johnugeorge commented Mar 8, 2019

johnugeorge commented Mar 8, 2019

Akmazad commented Apr 9, 2019

Hi guys, I'm facing the same problem now and not sure if this is due to other issues in my coding. I'm using pytorch/1.1.0a0-py36, openmpi/3.1.3, cuda/9.0, dist_method="env://" and backend=nccl.

File "/apps/pytorch/1.1.0a0-py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon) ValueError: host not found: Name or service not known

johnugeorge commented Apr 9, 2019

johnugeorge commented Feb 7, 2019 •

edited

Loading

Hi guys,
I'm facing the same problem now and not sure if this is due to other issues in my coding. I'm using pytorch/1.1.0a0-py36, openmpi/3.1.3, cuda/9.0, dist_method="env://" and backend=nccl.

File "/apps/pytorch/1.1.0a0-py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
ValueError: host not found: Name or service not known