-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check failed: n_gpus <= n_available_devices (1 vs. -4) Starting from gpu id: 6, there are only -4 available devices, while n_gpus is set to: 1 #4618
Comments
From the code, I see: https://github.com/dmlc/xgboost/blob/master/src/common/common.cc |
So basically during unpickling, gpu_id is still checked. I tried a case with 4->1 GPUs and didn't hit same problem always, depends upon gpu_id used. For the same package, going from GPU->CPU did work. That is:
shows no error. This is what I think @trivialfis solved before. Just need gpu_id to not fail upon pickle load either if number of GPUs visible changes. Need to give user time to adjust system resources used by model for (e.g.) predict. |
xgboost used to be smart and wrap the gpu_id by the number of visible devices. I know @RAMitchell said that was taken out, but this is a reason to put back in such basic smarts. Shouldn't make xgboost super dumb. |
@pseudotensor Thanks for the issue. I will trying looking into the unpickling.
It's removed only because I don't know how to make it work. The multi gpu only started working after removing it in #3851 and a few months worth of debugging efforts ... Before that any change of |
I'm preparing for a relocation, things might be slow. |
I'm investigating the JSON serialization format. |
Closing as |
not sure that is correct view. gpu id can still be off snd needs to be chosen. if train with gpu id 8 and go to system with gou id 0 only allowed this would still fail |
You are right. My mistake and will take a closer look. The binary model loading is giving me a huge headache. |
Similar context as before @trivialfis . Was 8 k80s 32 vCPU and 488 gigs of ram from ec2 and then that pickle file was run on system with 2 GPUs.
@sh1ng
The text was updated successfully, but these errors were encountered: