Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618

pseudotensor · 2019-06-28T03:40:10Z

xgboost.core.XGBoostError: [12:18:20] /root/repo/xgboost/include/xgboost/../../src/common/common.h:193: Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1

Stack trace returned 10 entries:
[bt] (0) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::StackTrace(unsigned long)+0x54) [0x7fc7fdd22604]
[bt] (1) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1d) [0x7fc7fdd22d9d]
[bt] (2) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::GPUSet::All(int, int, int)+0x527) [0x7fc7fdf18927]
[bt] (3) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::Configure(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&)+0x233) [0x7fc7fdf4ec33]
[bt] (4) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(void xgboost::ObjFunction::Configure<std::_Rb_tree_iterator<std::pair<std::string const, std::string> > >(std::_Rb_tree_iterator<std::pair<std::string const, std::string> >, std::_Rb_tree_iterator<std::pair<std::string const, std::string> >)+0xd1) [0x7fc7fdda49d1]
[bt] (5) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::Load(dmlc::Stream*)+0xc9a) [0x7fc7fddb2f6a]
[bt] (6) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(XGBoosterLoadModelFromBuffer+0x50) [0x7fc7fdd2deb0]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc95de35e40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc95de358ab]
[bt] (9) /home/stefanp/RB/scoring-pipeline/env/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2be) [0x7fc95e049b9e]

Similar context as before @trivialfis . Was 8 k80s 32 vCPU and 488 gigs of ram from ec2 and then that pickle file was run on system with 2 GPUs.

@sh1ng

The text was updated successfully, but these errors were encountered:

pseudotensor · 2019-06-28T04:14:39Z

From the code, I see:

https://github.com/dmlc/xgboost/blob/master/src/common/common.cc

pseudotensor · 2019-06-28T04:48:12Z

So basically during unpickling, gpu_id is still checked.

I tried a case with 4->1 GPUs and didn't hit same problem always, depends upon gpu_id used.

For the same package, going from GPU->CPU did work. That is:

CUDA_VISIBLE_DEVICE= python ...

shows no error. This is what I think @trivialfis solved before. Just need gpu_id to not fail upon pickle load either if number of GPUs visible changes. Need to give user time to adjust system resources used by model for (e.g.) predict.

pseudotensor · 2019-06-28T04:50:04Z

xgboost used to be smart and wrap the gpu_id by the number of visible devices. I know @RAMitchell said that was taken out, but this is a reason to put back in such basic smarts. Shouldn't make xgboost super dumb.

trivialfis · 2019-06-28T14:58:30Z

@pseudotensor Thanks for the issue. I will trying looking into the unpickling.

but this is a reason to put back in such basic smarts.

It's removed only because I don't know how to make it work. The multi gpu only started working after removing it in #3851 and a few months worth of debugging efforts ... Before that any change of gpu_id can cause a series of troubles.

trivialfis · 2019-07-02T06:43:07Z

I'm preparing for a relocation, things might be slow.

trivialfis · 2019-07-15T16:39:08Z

I'm investigating the JSON serialization format.

trivialfis · 2019-08-17T17:35:52Z

Closing as n_gpu no longer being supported.

pseudotensor · 2019-08-17T18:34:03Z

not sure that is correct view. gpu id can still be off snd needs to be chosen. if train with gpu id 8 and go to system with gou id 0 only allowed this would still fail

trivialfis · 2019-08-17T18:40:58Z

You are right. My mistake and will take a closer look. The binary model loading is giving me a huge headache.

trivialfis closed this as completed Aug 17, 2019

trivialfis reopened this Aug 17, 2019

trivialfis self-assigned this Aug 18, 2019

trivialfis mentioned this issue Sep 26, 2019

Add JSON IO for various components. #4732

Closed

24 tasks

trivialfis added the type: bug label Dec 24, 2019

trivialfis mentioned this issue Dec 26, 2019

Fix wrapping GPU ID. #5160

Merged

trivialfis closed this as completed in #5160 Dec 27, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618

Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618

pseudotensor commented Jun 28, 2019 •

edited

Loading

pseudotensor commented Jun 28, 2019

pseudotensor commented Jun 28, 2019

pseudotensor commented Jun 28, 2019

trivialfis commented Jun 28, 2019

trivialfis commented Jul 2, 2019

trivialfis commented Jul 15, 2019 •

edited

Loading

trivialfis commented Aug 17, 2019

pseudotensor commented Aug 17, 2019

trivialfis commented Aug 17, 2019

Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618

Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618

Comments

pseudotensor commented Jun 28, 2019 • edited Loading

pseudotensor commented Jun 28, 2019

pseudotensor commented Jun 28, 2019

pseudotensor commented Jun 28, 2019

trivialfis commented Jun 28, 2019

trivialfis commented Jul 2, 2019

trivialfis commented Jul 15, 2019 • edited Loading

trivialfis commented Aug 17, 2019

pseudotensor commented Aug 17, 2019

trivialfis commented Aug 17, 2019

pseudotensor commented Jun 28, 2019 •

edited

Loading

trivialfis commented Jul 15, 2019 •

edited

Loading