error -9 when training caffe-alexnet model #17

srinivasmangipudi · 2018-03-21T12:16:10Z

The job run for about 2 mins, but when on process#60 its crashing with the following error.see image attached.

humphd · 2018-03-21T14:41:00Z

I'm not entirely sure, but my guess is that this is either your system running out of memory or a problem with it picking incorrectly between cpu vs. gpu.

Could be NVIDIA/DIGITS#1402 ?

ln3333 · 2018-04-10T09:37:19Z

Reproduced the error on my docker box. By default docker is allocating 2G memory for the pod on my Macbook, which is insufficient in this case. Seen from DIGITS dashboard, the training is eating up ~3G memory.

For my case, increasing memory in docker preference panel works. Navigate through the docker whale icon -> preferences -> advanced -> memory, then increase accordingly.

humphd · 2018-04-10T14:19:02Z

@ln3333 thanks for this, I've added a note and pushed it. Closing.

I'm in the process of rewriting this for TensorFlow and TensorFlow.js right now in #14, so I think further debugging of DIGITS issues isn't necessary.

humphd closed this as completed in caf3ca0 Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error -9 when training caffe-alexnet model #17

error -9 when training caffe-alexnet model #17

srinivasmangipudi commented Mar 21, 2018

humphd commented Mar 21, 2018

ln3333 commented Apr 10, 2018 •

edited

Loading

humphd commented Apr 10, 2018

error -9 when training caffe-alexnet model #17

error -9 when training caffe-alexnet model #17

Comments

srinivasmangipudi commented Mar 21, 2018

humphd commented Mar 21, 2018

ln3333 commented Apr 10, 2018 • edited Loading

humphd commented Apr 10, 2018

ln3333 commented Apr 10, 2018 •

edited

Loading