Skip to content

pytorch:2.0.0 ddp training error but the old version is good #1144

Open
@alicera

Description

@alicera

Your issue may already be reported!
Please search on the issue tracker before creating one.

Context

  • Pytorch version:
  • Operating System and version:
    pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

Your Environment

  • Installed using source? [yes/no]:
  • Are you planning to deploy it using docker container? [yes/no]:
  • Is it a CPU or GPU environment?: Gpu
  • Which example are you using:
  • Link to code or data to repro [if any]:
    pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

Expected Behavior

run good
with https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py

Current Behavior

torchrun --nproc-per-node 4 train.py 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 108) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Possible Solution

docker.io/pytorch/pytorch:2110
using the environment to run the torchrun, it will be successful.

https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py

Steps to Reproduce

...

Failure Logs [if any]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions