Open
Description
Your issue may already be reported!
Please search on the issue tracker before creating one.
Context
- Pytorch version:
- Operating System and version:
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
Your Environment
- Installed using source? [yes/no]:
- Are you planning to deploy it using docker container? [yes/no]:
- Is it a CPU or GPU environment?: Gpu
- Which example are you using:
- Link to code or data to repro [if any]:
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
Expected Behavior
run good
with https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py
Current Behavior
torchrun --nproc-per-node 4 train.py
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 108) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Possible Solution
docker.io/pytorch/pytorch:2110
using the environment to run the torchrun, it will be successful.
https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py
Steps to Reproduce
...