Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move prediction cache to Learner. #5220

Merged
merged 3 commits into from
Feb 14, 2020
Merged

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jan 22, 2020

Clean-ups

  • Remove duplicated cache in Learner and GBM.
  • Remove ad-hoc fix of invalid cache.
  • Remove PredictFromCache in predictors.
  • Remove prediction cache for linear altogether, as it's only moving the prediction into training process but doesn't provide any actual overall speed gain.
  • The cache is now unique to Learner, which means the ownership is no longer shared by any other components.

Changes

  • Add version to prediction cache.
  • Use weak ptr to check expired DMatrix.
  • Pass shared pointer instead of raw pointer.

Risks

Before the DMatrix is tied to Booster with a shared pointer, which means the access to it is always safe as it can't expire before the booster. Now we need to make sure the safety ourselves.

Advantages

As now we don't store the shared pointer any more, this should reduce the memory usage when doing grid search with libraries like Scikit-Learn (during evaluation the training matrix can be released). Also, performance for training continuation (incremental training, checkpoint recovery) will be significantly better as we cache all encountered DMatrix.

Last thing is this PR is intended to ease the development of our DMatrix refactoring plan, which requires an easy to access and reliable prediction cache, so that we can use quantile values as a proxy to actual data.

Related issues and PRs:

PRs:

#5272

Issues:

#3946
#4786
#4774
#4482

@trivialfis
Copy link
Member Author

Requires some more thoughts to make a clean design for the cache.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an idea, what if the learner handles all prediction caching and when it needs to update the cache it calls predict only on the most recent tree. We remove all functions like "UpdatePredictionCache" and the caching mechanism is isolated to the learner. There might be a small loss of efficiency because we do not make use of the information in the Updater, but we still achieve algorithmic efficiency by predicting from only one tree per iteration.

@trivialfis
Copy link
Member Author

trivialfis commented Feb 10, 2020

@hcho3 We talked about adding a custom segfault handler like: https://github.com/bombela/backward-cpp

I added a simple one in this PR for debugging on Jenkins (and it was helpful). But it breaks JVM package's error handling/recovery logic, so I will remove it once the PR is done. You may find it useful in the future as it's super simple (about 50+ added lines).

@trivialfis trivialfis marked this pull request as ready for review February 10, 2020 18:52
@trivialfis trivialfis changed the title [WIP] Move prediction cache to Learner. Move prediction cache to Learner. Feb 10, 2020
include/xgboost/predictor.h Outdated Show resolved Hide resolved
include/xgboost/predictor.h Outdated Show resolved Hide resolved
Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, I think the approach is solid.

This PR is too large to review properly. The gblinear changes can be extracted. There are unnecessary formatting changes or changes not related to prediction caching. Changing the DMatrix pointer arguments to shared pointers could potentially be extracted.

I want to review the changes to caching logic more carefully and in parts. You can help by breaking things down as much is possible. As you know this part of the code base is very bug prone.

src/learner.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated

predictions_.SetDevice(generic_parameters_.gpu_id);
predictions_.Resize(predt.predictions.Size());
predictions_.Copy(predt.predictions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an extra copy here that wasn't there before?

Copy link
Member Author

@trivialfis trivialfis Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell It's a copy moved from Predictor::PredictFromCache. I removed this function altogether as now the cache is managed by Learner, hence the copying is also done by Learner. It's 1 copy less as I also removed the temporary vector in C API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use two differently sized evaluation matrices, the predictions vector will get resized twice every single iteration. This is a very common case.

src/learner.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated
this->PredictRaw(data.get(), predts, training, ntree_limit);
out_preds->SetDevice(generic_parameters_.gpu_id);
out_preds->Resize(predts->predictions.Size());
out_preds->Copy(predts->predictions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is there extra copying happening here?

src/learner.cc Outdated Show resolved Hide resolved
src/tree/updater_prune.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

Wil revert some changes, I sometimes added formatting changes because clangd is warning with clang-tidy checks and it's quite difficult to focus.

@trivialfis
Copy link
Member Author

Changing the DMatrix pointer arguments to shared pointers could potentially be extracted.

This might not be possible as we need the shared pointer for constructing std::weak_ptr.

@RAMitchell
Copy link
Member

You could do the shared pointers first. This way we get past some harmless changes before we look at the core logic.

int old_ntree = model.trees.size() - num_new_trees;
// update cache entry
for (auto& kv : (*cache_)) {
Copy link
Member Author

@trivialfis trivialfis Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting, before the PR, this function updates the cache for all DMatrix, including validation datasets. Now it only updates the training DMatrix, as I don't think the updater can/should have information about validation datasets in the future (otherwise that would be peeking and violates the purpose of validation).

* Clean-ups

- Remove duplicated cache in Learner and GBM.
- Remove ad-hoc fix of invalid cache.
- Remove `PredictFromCache` in predictors.
- Remove prediction cache for linear altogether, as it's only moving the
  prediction into training process but doesn't provide any actual overall speed
  gain.
- The cache is now unique to Learner, which means the ownership is no longer
  shared by any other components.

* Changes

- Add version to prediction cache.
- Use weak ptr to check expired DMatrix.
- Pass shared pointer instead of raw pointer.
@trivialfis
Copy link
Member Author

Not sure why, putting HostDeviceVector into thread local store crashes Python on Windows at the end of execution. @RAMitchell Have you ever experienced this?

@RAMitchell
Copy link
Member

If you use device vectors as static variables they can get destructed after the cuda context has been released, leading to errors.

@trivialfis
Copy link
Member Author

This might make #5207 difficult to implement...

@trivialfis
Copy link
Member Author

@RAMitchell Anyway problem for another day. I cleaned up the PR per your request. Formatting changes are reverted. But the change in linear is kept, as gbm no longer has access to global prediction cache, the code removal for linear is necessary.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice. Is using an integer for prediction state and begin/end ranges is the most robust approach? You have already noted that this causes ambiguity when there are multiple output groups. In a perfect world we would have unique hashes for every tree generated and be able to see which trees have contributed to a prediction vector. Maybe we could get closer to this?

It looks to me like your version is smarter and only computes predictions that it needs, now that some state is carried along with the prediction vector. If you change the updaters to return false in UpdatePredictionCache you could get an idea of performance regression from removing these functions. Of course we cannot do this until gpu_predictor is able to predict from Ellpack, otherwise we have memory allocation problems.

src/learner.cc Outdated

predictions_.SetDevice(generic_parameters_.gpu_id);
predictions_.Resize(predt.predictions.Size());
predictions_.Copy(predt.predictions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use two differently sized evaluation matrices, the predictions vector will get resized twice every single iteration. This is a very common case.

src/predictor/cpu_predictor.cc Show resolved Hide resolved
src/predictor/cpu_predictor.cc Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Feb 12, 2020

@RAMitchell

If I use two differently sized evaluation matrices, the predictions vector will get resized twice every single iteration. This is a very common case.

Before this PR, Learner stores another prediction vector for each DMatrix (hence 2 prediction vectors for each DMatrix, one for cache one for temporary), I can restore that to avoid the resize. Another way around is I can modify the input to metric into a span and resize the prediction vector when it doesn't have sufficient space for current DMatrix.

Update:

I restored the additional vectors for each matrix to avoid any additional refactor in this PR.

@trivialfis
Copy link
Member Author

trivialfis commented Feb 12, 2020

I don't have a better name than version. If you are boosting forest with 4 trees per-forest, the updated version each round is 4. So boosted_rounds is not an option. This is non-ideal, but right now there's not much we can do.

As another problem regarding refresh updater I mentioned offline with @RAMitchell , it should be fine as training continuation requires at least 1 serialization. The cache is cleared.

@mli
Copy link
Member

mli commented Feb 12, 2020

Codecov Report

Merging #5220 into master will not change coverage by %.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5220   +/-   ##
=======================================
  Coverage   83.76%   83.76%           
=======================================
  Files          11       11           
  Lines        2409     2409           
=======================================
  Hits         2018     2018           
  Misses        391      391           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29eeea7...4fcc43e. Read the comment docs.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Looking through the predict function it seems like there are about four copies between generating the predictions and reaching the end user. I think we need to look at optimising this process. We can look at this when considering prediction to device memory.

The other thing I think can be improved with the predict API is passing in a bunch of bools to change behaviour. I think there should be separate functions for separate functionality.

@trivialfis
Copy link
Member Author

The copying is annoying. I don't have any thoughts on to get rid of them yet.

@trivialfis trivialfis merged commit c35cdec into dmlc:master Feb 14, 2020
@trivialfis trivialfis deleted the prediction-cache branch February 14, 2020 05:04
@hcho3 hcho3 mentioned this pull request Feb 21, 2020
12 tasks
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants