resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

anti-machinee · 2022-05-17T10:58:09Z

Please review this error

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
mpirun noticed that process rank 7 with PID 0 on node ip-<> exited on signal 9 (Killed).

JoeyTPChou · 2022-05-18T21:41:14Z

Encounter the similar issue while running PyTorch GPT 2 example on 8 Gaudi (AWS DL1 instance). Both 1.4.0 and 1.4.1 showed this error. The error we got from 1.4.0

....
254 2022-04-28 22:31:43 | INFO | root | Reducer buckets have been rebuilt in this iteration.
255 2022-04-28 22:33:47 | INFO | train_inner | epoch 001:     10 / 16405 loss=18.858, ppl=475154, wps=37885.1, ups=0.07, wpb=524288, bsz=512, num_updates=10, lr=6.099e-06, gnorm=18.355, clip=100, train_wall=148, wall=212
256 2022-04-28 22:36:06 | INFO | train_inner | epoch 001:     20 / 16405 loss=16.001, ppl=65586.3, wps=37670.6, ups=0.07, wpb=524288, bsz=512, num_updates=20, lr=1.2098e-05, gnorm=4.887, clip=100, train_wall=139, wall=352
257 2022-04-28 22:38:25 | INFO | train_inner | epoch 001:     30 / 16405 loss=14.704, ppl=26695.2, wps=37699, ups=0.07, wpb=524288, bsz=512, num_updates=30, lr=1.8097e-05, gnorm=2.606, clip=100, train_wall=139, wall=491
258 /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/hcl_remote_device.cpp::69(onReceivedArt): The condition [ entry.seqNumber == (lastReceived + 1) ] failed. Illegal ART seqNumber, from rank (0), seqNumbe259 terminate called after throwing an instance of 'c10::Error'
260   what():  Collective call returned error
261 Exception raised from operator() at /tmp/pip-req-build-_kqiu0aw/habana_frameworks/torch/core/hccl/ProcessGroupHCCL.cpp:647 (most recent call first):
262 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f295d4eed2c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
263 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xf5 (0x7f295d4cec4d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
264 frame #2: <unknown function> + 0x29dc7 (0x7f28b36f2dc7 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/_hccl_C.so)
265 frame #3: <unknown function> + 0x2836a (0x7f28b36f136a in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/core/_hccl_C.so)
266 frame #4: <unknown function> + 0xd6de4 (0x7f295d3a2de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
267 frame #5: <unknown function> + 0x8609 (0x7f2969839609 in /lib/x86_64-linux-gnu/libpthread.so.0)
268 frame #6: clone + 0x43 (0x7f2969973163 in /lib/x86_64-linux-gnu/libc.so.6)
....
295 Traceback (most recent call last):
296   File "train.py", line 14, in <module>
297     cli_main()
298   File "/GPT2/fairseq_cli/train.py", line 537, in cli_main
299     distributed_utils.call_main(cfg, main)
300   File "/GPT2/fairseq/distributed/utils.py", line 369, in call_main
301     torch.multiprocessing.spawn(
302   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
303     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
304   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
305     while not context.join():
306   File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
307     raise ProcessExitedException(
308 torch.multiprocessing.spawn.ProcessExitedException: process 5 terminated with signal SIGKILL                                                                                                                                                            309 Couldn't import apex.normalization.fused_layer_norm.FusedLayerNorm, using torch.nn.LayerNorm
310 Couldn't import apex.normalization.fused_layer_norm.FusedLayerNorm, using torch.nn.LayerNorm
311 /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 128 leaked semaphore objects to clean up at shutdown
312   warnings.warn('resource_tracker: There appear to be %d '

greg-serochi · 2022-05-18T23:53:01Z

Hi @JoeyTPChou, some follow up questions here:

Did you kill this process? The team is curious why it's listed as killed.
Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2
Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue
Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section.
We have a dedicated snapshot tool https://github.com/HabanaAI/Snapshot_For_Debug that you can run to capture the relevant log files.

JoeyTPChou · 2022-05-21T16:33:16Z

Hi @greg-serochi, just want to let you know I didn't forget this issue. So since this issue is non-deterministic, my 1st epoch goes to ~3000 iteration this time and it is still running. Let me reply some of your comments:

1. Did you kill this process? The team is curious why it's listed as killed.
No I didn't kill it. It was killed by the process.

2. Are you just running the default commands from the GPT2 Model site? https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/GPT2
Yes I am. I used the PyTorch docker file and the example script to run on 8 Gaudi on a single DL1 instance.

3. Does this happen with Single Card? I know this is not ideal, but we want to confirm if this is a DDP or Synapse SW issue?
I ran them on 8 Gaudi on AWS.

4. Can you please provide the full dmesg log for this failure? However, I'd assumed what you posted today is the main error section.
5. We have a dedicated snapshot tool https://github.com/HabanaAI/Snapshot_For_Debug that you can run to capture the relevant log files
Will try them after this run got killed.

JoeyTPChou · 2022-05-23T14:33:25Z

@greg-serochi How can I send the dmesg and the log files generated from gather_info_docker.py file?

greg-serochi · 2022-05-26T18:30:53Z

We are taking this internally for further debug. Once we have a resolution, we will provide an update.

JoeyTPChou · 2022-05-26T20:27:05Z

Hi Greg, thanks for the update. Do you think the fix will be captured in the next release (1.5?) ?

greg-serochi · 2022-05-26T20:50:37Z

We cannot comment on future fixes until they are released. Any update will be in our release notes and/or we'll update this thread.

greg-serochi · 2022-06-16T15:23:13Z

We have made updates to our GPT2 model in the 1.5.0 version of our SynapaseAI Software stack that has been released today.

JoeyTPChou · 2022-06-16T15:50:33Z

Thanks for the great news! Does the 1.5.0 image also released? Do we need a new AMI and docker image?

greg-serochi · 2022-06-16T18:13:57Z

the 1.5.0 content was released today. You should use Habana's Base AMI and Docker images based on 1.5.0

Base AMI: https://aws.amazon.com/marketplace/search?searchTerms=habana%C2%AE
Docker Images: https://gallery.ecr.aws/habanalabs/pytorch-installer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

anti-machinee commented May 17, 2022 •

edited

Loading

JoeyTPChou commented May 18, 2022 •

edited

Loading

greg-serochi commented May 18, 2022 •

edited

Loading

JoeyTPChou commented May 21, 2022 •

edited

Loading

JoeyTPChou commented May 23, 2022

greg-serochi commented May 26, 2022

JoeyTPChou commented May 26, 2022

greg-serochi commented May 26, 2022

greg-serochi commented Jun 16, 2022

JoeyTPChou commented Jun 16, 2022

greg-serochi commented Jun 16, 2022 •

edited

Loading

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

resource_tracker: There appear to be 45 leaked semaphore objects to clean up at shutdown #18

Comments

anti-machinee commented May 17, 2022 • edited Loading

JoeyTPChou commented May 18, 2022 • edited Loading

greg-serochi commented May 18, 2022 • edited Loading

JoeyTPChou commented May 21, 2022 • edited Loading

JoeyTPChou commented May 23, 2022

greg-serochi commented May 26, 2022

JoeyTPChou commented May 26, 2022

greg-serochi commented May 26, 2022

greg-serochi commented Jun 16, 2022

JoeyTPChou commented Jun 16, 2022

greg-serochi commented Jun 16, 2022 • edited Loading

anti-machinee commented May 17, 2022 •

edited

Loading

JoeyTPChou commented May 18, 2022 •

edited

Loading

greg-serochi commented May 18, 2022 •

edited

Loading

JoeyTPChou commented May 21, 2022 •

edited

Loading

greg-serochi commented Jun 16, 2022 •

edited

Loading