Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error -9 when training caffe-alexnet model #17

Closed
srinivasmangipudi opened this issue Mar 21, 2018 · 3 comments
Closed

error -9 when training caffe-alexnet model #17

srinivasmangipudi opened this issue Mar 21, 2018 · 3 comments

Comments

@srinivasmangipudi
Copy link

The job run for about 2 mins, but when on process#60 its crashing with the following error.see image attached.

screen shot 2018-03-21 at 5 44 41 pm

@humphd
Copy link
Owner

humphd commented Mar 21, 2018

I'm not entirely sure, but my guess is that this is either your system running out of memory or a problem with it picking incorrectly between cpu vs. gpu.

Could be NVIDIA/DIGITS#1402 ?

@ln3333
Copy link

ln3333 commented Apr 10, 2018

Reproduced the error on my docker box. By default docker is allocating 2G memory for the pod on my Macbook, which is insufficient in this case. Seen from DIGITS dashboard, the training is eating up ~3G memory.

For my case, increasing memory in docker preference panel works. Navigate through the docker whale icon -> preferences -> advanced -> memory, then increase accordingly.

@humphd humphd closed this as completed in caf3ca0 Apr 10, 2018
@humphd
Copy link
Owner

humphd commented Apr 10, 2018

@ln3333 thanks for this, I've added a note and pushed it. Closing.

I'm in the process of rewriting this for TensorFlow and TensorFlow.js right now in #14, so I think further debugging of DIGITS issues isn't necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants