Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

Open
anti-machinee opened this issue May 17, 2022 · 10 comments

Comments

@anti-machinee
Copy link

anti-machinee commented May 17, 2022

Please review this error

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
mpirun noticed that process rank 7 with PID 0 on node ip-<> exited on signal 9 (Killed).

@JoeyTPChou
Copy link

JoeyTPChou commented May 18, 2022

Encounter the similar issue while running PyTorch GPT 2 example on 8 Gaudi (AWS DL1 instance). Both 1.4.0 and 1.4.1 showed this error. The error we got from 1.4.0

....
254 2022-04-28 22:31:43 | INFO | root | Reducer buckets have been rebuilt in this iteration.
255 2022-04-28 22:33:47 | INFO | train_inner | epoch 001:     10 / 16405 loss=18.858, ppl=475154, wps=37885.1, ups=0.07, wpb=524288, bsz=512, num_updates=10, lr=6.099e-06, gnorm=18.355, clip=100, train_wall=148, wall=212
256 2022-04-28 22:36:06 | INFO | train_inner | epoch 001:     20 / 16405 loss=16.001, ppl=65586.3, wps=37670.6, ups=0.07, wpb=524288, bsz=512, num_updates=20, lr=1.2098e-05, gnorm=4.887, clip=100, train_wall=139, wall=352
257 2022-04-28 22:38:25 | INFO | train_inner | epoch 001:     30 / 16405 loss=14.704, ppl=26695.2, wps=37699, ups=0.07, wpb=524288, bsz=512, num_updates=30, lr=1.8097e-05, gnorm=2.606, clip=100, train_wall=139, wall=491
258 /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/hcl_remote_device.cpp::69(onReceivedArt): The condition [ entry.seqNumber == (lastReceived + 1) ] failed. Illegal ART seqNumber, from rank (0), seqNumbe259 terminate called after throwing an instance of 'c10::Error'
260   what():  Collective call returned error
261 Exception raised from operator() at /tmp/pip-req-build-_kqiu0aw/habana_frameworks/torch/core/hccl/ProcessGroupHCCL.cpp:647 (most recent call first):
262 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f295d4eed2c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
263 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xf5 (0x7f295d4cec4d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
264 frame #2: <unknown function> + 0x29dc7 (0x7f28b36f2dc7 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/_hccl_C.so)
265 frame #3: <unknown function> + 0x2836a (0x7f28b36f136a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/_hccl_C.so)
266 frame #4: <unknown function> + 0xd6de4 (0x7f295d3a2de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
267 frame #5: <unknown function> + 0x8609 (0x7f2969839609 in /lib/x86_64-linux-gnu/libpthread.so.0)
268 frame #6: clone + 0x43 (0x7f2969973163 in /lib/x86_64-linux-gnu/libc.so.6)
....
295 Traceback (most recent call last):
296   File "train.py", line 14, in <module>
297     cli_main()
298   File "/GPT2/fairseq_cli/train.py", line 537, in cli_main
299     distributed_utils.call_main(cfg, main)
300   File "/GPT2/fairseq/distributed/utils.py", line 369, in call_main
301     torch.multiprocessing.spawn(
302   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
303     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
304   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
305     while not context.join():
306   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
307     raise ProcessExitedException(
308 torch.multiprocessing.spawn.ProcessExitedException: process 5 terminated with signal SIGKILL                                                                                                                                                            309 Couldn't import apex.normalization.fused_layer_norm.FusedLayerNorm, using torch.nn.LayerNorm
310 Couldn't import apex.normalization.fused_layer_norm.FusedLayerNorm, using torch.nn.LayerNorm
311 /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 128 leaked semaphore objects to clean up at shutdown
312   warnings.warn('resource_tracker: There appear to be %d '

@greg-serochi
Copy link
Collaborator

greg-serochi commented May 18, 2022

Hi @JoeyTPChou, some follow up questions here:

  1. Did you kill this process? The team is curious why it's listed as killed.

  2. Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2

  3. Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue

  4. Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section.

  5. We have a dedicated snapshot tool https://github.com/HabanaAI/Snapshot_For_Debug that you can run to capture the relevant log files.

@JoeyTPChou
Copy link

JoeyTPChou commented May 21, 2022

Hi @greg-serochi, just want to let you know I didn't forget this issue. So since this issue is non-deterministic, my 1st epoch goes to ~3000 iteration this time and it is still running. Let me reply some of your comments:

1. Did you kill this process? The team is curious why it's listed as killed.
No I didn't kill it. It was killed by the process.

2. Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2
Yes I am. I used the PyTorch docker file and the example script to run on 8 Gaudi on a single DL1 instance.

3. Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue?
I ran them on 8 Gaudi on AWS.

4. Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section.
5. We have a dedicated snapshot tool https://github.com/HabanaAI/Snapshot_For_Debug that you can run to capture the relevant log files

Will try them after this run got killed.

@JoeyTPChou
Copy link

@greg-serochi How can I send the dmesg and the log files generated from gather_info_docker.py file?

@greg-serochi
Copy link
Collaborator

We are taking this internally for further debug. Once we have a resolution, we will provide an update.

@JoeyTPChou
Copy link

Hi Greg, thanks for the update. Do you think the fix will be captured in the next release (1.5?) ?

@greg-serochi
Copy link
Collaborator

We cannot comment on future fixes until they are released. Any update will be in our release notes and/or we'll update this thread.

@greg-serochi
Copy link
Collaborator

We have made updates to our GPT2 model in the 1.5.0 version of our SynapaseAI Software stack that has been released today.

@JoeyTPChou
Copy link

Thanks for the great news! Does the 1.5.0 image also released? Do we need a new AMI and docker image?

@greg-serochi
Copy link
Collaborator

greg-serochi commented Jun 16, 2022

the 1.5.0 content was released today. You should use Habana's Base AMI and Docker images based on 1.5.0

Base AMI: https://aws.amazon.com/marketplace/search?searchTerms=habana%C2%AE
Docker Images: https://gallery.ecr.aws/habanalabs/pytorch-installer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants