Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618

Closed
pseudotensor opened this issue Jun 28, 2019 · 9 comments · Fixed by #5160
Assignees

Comments

@pseudotensor
Copy link
Contributor

pseudotensor commented Jun 28, 2019

xgboost.core.XGBoostError: [12:18:20] /root/repo/xgboost/include/xgboost/../../src/common/common.h:193: Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1

Stack trace returned 10 entries:
[bt] (0) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::StackTrace(unsigned long)+0x54) [0x7fc7fdd22604]
[bt] (1) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1d) [0x7fc7fdd22d9d]
[bt] (2) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::GPUSet::All(int, int, int)+0x527) [0x7fc7fdf18927]
[bt] (3) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::Configure(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&)+0x233) [0x7fc7fdf4ec33]
[bt] (4) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(void xgboost::ObjFunction::Configure<std::_Rb_tree_iterator<std::pair<std::string const, std::string> > >(std::_Rb_tree_iterator<std::pair<std::string const, std::string> >, std::_Rb_tree_iterator<std::pair<std::string const, std::string> >)+0xd1) [0x7fc7fdda49d1]
[bt] (5) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::Load(dmlc::Stream*)+0xc9a) [0x7fc7fddb2f6a]
[bt] (6) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(XGBoosterLoadModelFromBuffer+0x50) [0x7fc7fdd2deb0]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc95de35e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc95de358ab]
[bt] (9) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2be) [0x7fc95e049b9e]

Similar context as before @trivialfis . Was 8 k80s 32 vCPU and 488 gigs of ram from ec2 and then that pickle file was run on system with 2 GPUs.

@sh1ng

@pseudotensor
Copy link
Contributor Author

@pseudotensor
Copy link
Contributor Author

So basically during unpickling, gpu_id is still checked.

I tried a case with 4->1 GPUs and didn't hit same problem always, depends upon gpu_id used.

For the same package, going from GPU->CPU did work. That is:

CUDA_VISIBLE_DEVICE= python ...

shows no error. This is what I think @trivialfis solved before. Just need gpu_id to not fail upon pickle load either if number of GPUs visible changes. Need to give user time to adjust system resources used by model for (e.g.) predict.

@pseudotensor
Copy link
Contributor Author

xgboost used to be smart and wrap the gpu_id by the number of visible devices. I know @RAMitchell said that was taken out, but this is a reason to put back in such basic smarts. Shouldn't make xgboost super dumb.

@trivialfis
Copy link
Member

@pseudotensor Thanks for the issue. I will trying looking into the unpickling.

but this is a reason to put back in such basic smarts.

It's removed only because I don't know how to make it work. The multi gpu only started working after removing it in #3851 and a few months worth of debugging efforts ... Before that any change of gpu_id can cause a series of troubles.

@trivialfis
Copy link
Member

I'm preparing for a relocation, things might be slow.

@trivialfis
Copy link
Member

trivialfis commented Jul 15, 2019

I'm investigating the JSON serialization format.

@trivialfis
Copy link
Member

Closing as n_gpu no longer being supported.

@pseudotensor
Copy link
Contributor Author

not sure that is correct view. gpu id can still be off snd needs to be chosen. if train with gpu id 8 and go to system with gou id 0 only allowed this would still fail

@trivialfis
Copy link
Member

You are right. My mistake and will take a closer look. The binary model loading is giving me a huge headache.

@trivialfis trivialfis reopened this Aug 17, 2019
@trivialfis trivialfis self-assigned this Aug 18, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants