add `gpu_hist` support to Spark #4175

rongou · 2019-02-22T07:29:23Z

Two parts to the PR:

On the Spark/Scala side, it's mostly plumbing.
On the C++/CUDA side, create the NCCL communicator for distributed mode, with support for different GPUs per node (e.g. a mixture of DGX-1s and DGX-2s). Partially based on [REVIEW] Enable Multi-Node Multi-GPU functionality #4095.

I've tested this on GCP with a 20-node Spark Standalone cluster, 1 T4 GPU per node.

@RAMitchell @canonizer @CodingCat @mt-jones

CodingCat · 2019-02-22T08:10:01Z

Thanks for the contribution

First of all, we should not merge this at least for 0.82 release (which is happening soon), as it is never tested by anyone except the author and hard to be done within such few days

Second, I don't think it's ready to support GPU in XGBoost-Spark since Spark itself is far from being ready:

I didn't see there is a clear agreement in the Spark community about how to support GPU (e.g. [WIP] allow building spark gpu docker images apache/spark#23347)
there are a lot of issues to be resolved to get a mature support for GPU in Spark per our colleagues in Uber, e.g. when I run a task for a stage with CPU, I will release right after the task is done, however with GPU, the GPU resources have to be allocated for the whole life cycle of Spark application (even for Databricks, they admit everything is in beta https://docs.databricks.com/user-guide/clusters/gpu.html )
even Spark does not have particular CI tests for covering its integration with GPU.....it's hard for us to ensure the quality of XGBoost-Spark as a downstream library with this feature.....

so I didn't see I will agree to have this fancy feature in near future

CodingCat · 2019-02-22T08:30:45Z

...ages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/GeneralParams.scala

@@ -171,7 +179,7 @@ private[spark] trait GeneralParams extends Params {

  final def getSeed: Long = $(seed)

-  setDefault(numRound -> 1, numWorkers -> 1, nthread -> 1,
+  setDefault(numRound -> 1, numWorkers -> 1, nthread -> 1, nGpus -> 1,


this should be 0 at least

I'd be happy to change it, but it's only used when tree_method is set to gpu_hist, in which case the user probably expects to grab a GPU. It's also the default on the C++ side (https://github.com/dmlc/xgboost/blob/master/src/tree/param.h#L200), so having this default seems less surprising and more consistent.

rongou · 2019-02-22T17:46:18Z

@CodingCat Thanks for the quick review! I don't know about fancy, XGBoost GPU support was added in 2016. :)

Yeah it's fine to hold off until 0.82 is released, and I totally agree not everything is ready, but isn't this the whole point of open source development, the Cathedral and the Bazaar, release early and release often, given enough eyeballs all bugs are shallow, etc.

To address your specific concerns:

The issue with [WIP] allow building spark gpu docker images apache/spark#23347 is mainly about how to custom docker images for Spark in general, not specific to GPU support. There is a JIRA issue to figure this out: https://issues.apache.org/jira/browse/SPARK-24655.
Yes currently Spark is not really aware of any accelerators, but I understand Databricks is working on it. Until that is ready, maybe XGBoost-Spark can have very limited GPU support that's not ideal or perfect in any way but at least a working solution? It can also provide an important use case for Databricks to optimize.
I can work on making the current CI happy, and perhaps add some tests for distributed GPU training if Jenkins supports it.

Looking through past issues, people have asked for this (#2983, #3499), so at least there is some demand.

CodingCat · 2019-02-22T18:06:08Z

XGBoost GPU support was added in 2016. :)

XGBoost GPU != XGBoost-Spark GPU

release early and release often, given enough eyeballs all bugs are shallow, etc.

It's not the same situation, the current issue is that we know there is a bug and we push this known bug to the user?

The issue with apache/spark#23347 is mainly about how to custom docker images for Spark

here are something copied from that thread

"One other concern is that adding another option that we say is ready to run out of the box for GPUs, is that we have to maintain this mode and ensure it is tested in CI." - this is from mccheah, who is one of the main persons behind spark@k8s which is the only way to make Spark "run" with GPU

Until that is ready, maybe XGBoost-Spark can have very limited GPU support that's not ideal or perfect in any way but at least a working solution

no, this is not the definition of "working" especially nowadays XGBoost is not a research project anymore...if you track how Spark accepts features, it is more and more conservative along the way

I can work on making the current CI happy, and perhaps add some tests for distributed GPU training if Jenkins supports it.

I mean Spark, as the base of XGBoost-Spark, should prove that it is supporting GPU in a mature way, so feel free to work with Spark community on this

Looking through past issues, people have asked for this (#2983, #3499), so at least there is some demand.

I think #3499 is to be addressed in #4095

regarding #2983, since the release of XGBoost-GPU, how many issues are raised about GPU, and how many are about distributed GPU in Spark? I don't think that's a convincing number to trigger us to take the risk of claiming it is supported in XGBoost-Spark even Spark didn't say that

CodingCat · 2019-02-22T18:36:18Z

I would suggest, as you are working for NVIDIA, how about host a library in NVIDIA@GITHUB based on XGBoost-Spark to support distributed GPU? as that's the major interests for NVIDIA, but as the community member, I have concern on the quality of feature in the master branch

and on some day (I also hope it will happen ASAP) when Spark supports GPU better, I'd more than happy to work with you to bring the feature here

canonizer · 2019-02-22T19:11:33Z

...ages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/BoosterParams.scala

@@ -284,7 +286,7 @@ private[spark] object BoosterParams {

  val supportedBoosters = HashSet("gbtree", "gblinear", "dart")

-  val supportedTreeMethods = HashSet("auto", "exact", "approx", "hist")
+  val supportedTreeMethods = HashSet("auto", "exact", "approx", "hist", "gpu_hist", "gpu_exact")


Most likely, gpu_exact doesn't support running in a distributed setting. Could you remove it from the list of supported tree methods?

srowen · 2019-04-18T13:59:17Z

FWIW there is an actual design for GPU resource scheduling in Spark 3.0: https://issues.apache.org/jira/browse/SPARK-24615 It will probably go into 3.0 in some form. Yeah, that would be a good time to try to use GPU-aware scheduling. Anything else is a little hacky.

add gpu_hist support to Spark

15d3c56

CodingCat reviewed Feb 22, 2019

View reviewed changes

canonizer suggested changes Feb 22, 2019

View reviewed changes

rongou closed this Feb 22, 2019

lock bot locked as resolved and limited conversation to collaborators Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `gpu_hist` support to Spark #4175

add `gpu_hist` support to Spark #4175

rongou commented Feb 22, 2019

CodingCat commented Feb 22, 2019 •

edited

Loading

CodingCat Feb 22, 2019

rongou Feb 22, 2019

rongou commented Feb 22, 2019

CodingCat commented Feb 22, 2019 •

edited

Loading

CodingCat commented Feb 22, 2019 •

edited

Loading

canonizer Feb 22, 2019

srowen commented Apr 18, 2019

add gpu_hist support to Spark #4175

add gpu_hist support to Spark #4175

Conversation

rongou commented Feb 22, 2019

CodingCat commented Feb 22, 2019 • edited Loading

CodingCat Feb 22, 2019

Choose a reason for hiding this comment

rongou Feb 22, 2019

Choose a reason for hiding this comment

rongou commented Feb 22, 2019

CodingCat commented Feb 22, 2019 • edited Loading

CodingCat commented Feb 22, 2019 • edited Loading

canonizer Feb 22, 2019

Choose a reason for hiding this comment

srowen commented Apr 18, 2019

add `gpu_hist` support to Spark #4175

add `gpu_hist` support to Spark #4175

CodingCat commented Feb 22, 2019 •

edited

Loading

CodingCat commented Feb 22, 2019 •

edited

Loading

CodingCat commented Feb 22, 2019 •

edited

Loading