Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gpu_hist support to Spark #4175

Closed
wants to merge 1 commit into from
Closed

add gpu_hist support to Spark #4175

wants to merge 1 commit into from

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Feb 22, 2019

Two parts to the PR:

I've tested this on GCP with a 20-node Spark Standalone cluster, 1 T4 GPU per node.

@RAMitchell @canonizer @CodingCat @mt-jones

@CodingCat
Copy link
Member

CodingCat commented Feb 22, 2019

Thanks for the contribution

First of all, we should not merge this at least for 0.82 release (which is happening soon), as it is never tested by anyone except the author and hard to be done within such few days

Second, I don't think it's ready to support GPU in XGBoost-Spark since Spark itself is far from being ready:

  1. I didn't see there is a clear agreement in the Spark community about how to support GPU (e.g. [WIP] allow building spark gpu docker images apache/spark#23347)

  2. there are a lot of issues to be resolved to get a mature support for GPU in Spark per our colleagues in Uber, e.g. when I run a task for a stage with CPU, I will release right after the task is done, however with GPU, the GPU resources have to be allocated for the whole life cycle of Spark application (even for Databricks, they admit everything is in beta https://docs.databricks.com/user-guide/clusters/gpu.html )

  3. even Spark does not have particular CI tests for covering its integration with GPU.....it's hard for us to ensure the quality of XGBoost-Spark as a downstream library with this feature.....

so I didn't see I will agree to have this fancy feature in near future

@@ -171,7 +179,7 @@ private[spark] trait GeneralParams extends Params {

final def getSeed: Long = $(seed)

setDefault(numRound -> 1, numWorkers -> 1, nthread -> 1,
setDefault(numRound -> 1, numWorkers -> 1, nthread -> 1, nGpus -> 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 0 at least

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to change it, but it's only used when tree_method is set to gpu_hist, in which case the user probably expects to grab a GPU. It's also the default on the C++ side (https://github.com/dmlc/xgboost/blob/master/src/tree/param.h#L200), so having this default seems less surprising and more consistent.

@rongou
Copy link
Contributor Author

rongou commented Feb 22, 2019

@CodingCat Thanks for the quick review! I don't know about fancy, XGBoost GPU support was added in 2016. :)

Yeah it's fine to hold off until 0.82 is released, and I totally agree not everything is ready, but isn't this the whole point of open source development, the Cathedral and the Bazaar, release early and release often, given enough eyeballs all bugs are shallow, etc.

To address your specific concerns:

  1. The issue with [WIP] allow building spark gpu docker images apache/spark#23347 is mainly about how to custom docker images for Spark in general, not specific to GPU support. There is a JIRA issue to figure this out: https://issues.apache.org/jira/browse/SPARK-24655.
  2. Yes currently Spark is not really aware of any accelerators, but I understand Databricks is working on it. Until that is ready, maybe XGBoost-Spark can have very limited GPU support that's not ideal or perfect in any way but at least a working solution? It can also provide an important use case for Databricks to optimize.
  3. I can work on making the current CI happy, and perhaps add some tests for distributed GPU training if Jenkins supports it.

Looking through past issues, people have asked for this (#2983, #3499), so at least there is some demand.

@CodingCat
Copy link
Member

CodingCat commented Feb 22, 2019

XGBoost GPU support was added in 2016. :)

XGBoost GPU != XGBoost-Spark GPU

release early and release often, given enough eyeballs all bugs are shallow, etc.

It's not the same situation, the current issue is that we know there is a bug and we push this known bug to the user?

The issue with apache/spark#23347 is mainly about how to custom docker images for Spark

here are something copied from that thread

"One other concern is that adding another option that we say is ready to run out of the box for GPUs, is that we have to maintain this mode and ensure it is tested in CI." - this is from mccheah, who is one of the main persons behind spark@k8s which is the only way to make Spark "run" with GPU

Until that is ready, maybe XGBoost-Spark can have very limited GPU support that's not ideal or perfect in any way but at least a working solution

no, this is not the definition of "working" especially nowadays XGBoost is not a research project anymore...if you track how Spark accepts features, it is more and more conservative along the way

I can work on making the current CI happy, and perhaps add some tests for distributed GPU training if Jenkins supports it.

I mean Spark, as the base of XGBoost-Spark, should prove that it is supporting GPU in a mature way, so feel free to work with Spark community on this

Looking through past issues, people have asked for this (#2983, #3499), so at least there is some demand.

I think #3499 is to be addressed in #4095

regarding #2983, since the release of XGBoost-GPU, how many issues are raised about GPU, and how many are about distributed GPU in Spark? I don't think that's a convincing number to trigger us to take the risk of claiming it is supported in XGBoost-Spark even Spark didn't say that

@CodingCat
Copy link
Member

CodingCat commented Feb 22, 2019

I would suggest, as you are working for NVIDIA, how about host a library in NVIDIA@GITHUB based on XGBoost-Spark to support distributed GPU? as that's the major interests for NVIDIA, but as the community member, I have concern on the quality of feature in the master branch

and on some day (I also hope it will happen ASAP) when Spark supports GPU better, I'd more than happy to work with you to bring the feature here

@@ -284,7 +286,7 @@ private[spark] object BoosterParams {

val supportedBoosters = HashSet("gbtree", "gblinear", "dart")

val supportedTreeMethods = HashSet("auto", "exact", "approx", "hist")
val supportedTreeMethods = HashSet("auto", "exact", "approx", "hist", "gpu_hist", "gpu_exact")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely, gpu_exact doesn't support running in a distributed setting. Could you remove it from the list of supported tree methods?

@rongou rongou closed this Feb 22, 2019
@srowen
Copy link
Contributor

srowen commented Apr 18, 2019

FWIW there is an actual design for GPU resource scheduling in Spark 3.0: https://issues.apache.org/jira/browse/SPARK-24615 It will probably go into 3.0 in some form. Yeah, that would be a good time to try to use GPU-aware scheduling. Anything else is a little hacky.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants