[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

CodingCat · 2020-02-02T22:42:07Z

the major purpose is to fix the performance issue when training based on a checkpoint

We need to call prediction in 3 places, UpdateOneIter for residual, evaluation and update prediction cache. Because of the existence of pred cache, we only really perform prediction when update prediction cache and leverage the produced results in the other two places.

the performance degradation comes from the fact that when we train based on a checkpoint we lost the prediction cache. As s result, we will call in UpdateOneIter and EvalOneIter (since we lost cache, we will not call when updating cache)

this PR fixes the issue by adding DMatrix to cache when cache has been lost (this behavior only happens in JVM and only when checkpoint is set)

relevant issues

#3946

#4786

#4774

CodingCat · 2020-02-02T22:47:00Z

@trivialfis would you please help to review? (I will remove those std::cout right before merge)

CodingCat · 2020-02-02T22:55:05Z

@trams the last fix for cp perf is here

trivialfis · 2020-02-03T04:34:37Z

We face a similar issue. My thought is to cache all DMatrix , instead only the ones in constructors.

CodingCat · 2020-02-03T04:36:32Z

We face a similar issue. My thought is to cache all DMatrix , instead only the ones in constructors.

it sounds like what I did here, I added a parameter (to avoid breaking current behavior) adding_all_to_cache ...but my limited c++ experience makes me struggle with some double free error (?)

trivialfis · 2020-02-03T05:35:04Z

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

CodingCat · 2020-02-03T05:35:50Z

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

sure, that's awesome!

CodingCat · 2020-02-03T05:36:06Z

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

take care BTW given the current situation

trivialfis · 2020-02-03T05:37:16Z

Yup. Thanks!

CodingCat · 2020-02-03T05:46:38Z

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

I saw your PR, yeah, that's kind of must-do, I found my approach here is problematic in terms of the scope of shared_ptr which is very hard to manage with the current architecture

trivialfis · 2020-02-03T06:04:42Z

I expect that PR to simplify the management of cache a lot. Once merged we can refactor the existing interfaces together. I don't mind breaking changes for this caching behavior. For memory usage concern, I plan to utilize weak_ptr in the future for detecting destructed DMatrix objects and hence cleaning up the cache.

RAMitchell · 2020-02-03T07:15:37Z

Weak pointer sounds perfect, then we just add everything to the cache without worrying.

trams

Thank you for submitting this change!

I like your approach. You just want to recreate the cache if the cache entry is missing.
I am a little bit worried about the implementation but mostly cause I do not really understand 1 line of code (see comments)

P.S. Full disclosure. I am no longer part of Criteo and I can't test this change on their dataset anymore

trams · 2020-02-03T08:24:43Z

src/learner.cc

@@ -202,6 +202,7 @@ class LearnerImpl : public Learner {
    tparam_.UpdateAllowUnknown(args);
    mparam_.UpdateAllowUnknown(args);
    generic_parameters_.UpdateAllowUnknown(args);
+    std::cout << "all_to_prediction_cache:" << generic_parameters_.adding_all_to_cache << "\n";


Just do not forget to remove it before merging

trams · 2020-02-03T08:35:45Z

src/predictor/cpu_predictor.cc

+        return;
+      } else {
+        std::cout << "adding dmatrix to cache\n";
+        (*cache_)[dmat].data = static_cast<std::shared_ptr<DMatrix>>(dmat);


I am a bit confused by this line of code. It was quite a while when I last dived into this part of code
(Full disclaimer, I recently switched companies and I am no longer work in Criteo on xgboost :( )

My C++ is a little bit rusty in regard to static_cast in this context so please feel free to correct me
I am reading this https://en.cppreference.com/w/cpp/language/static_cast and I am trying to figure out which case is here.
A constructor of shared_ptr is marked as explicit so it can't be an implicit object creation but I do not understand then what it can be? shared_ptr is not a child of DMatrix.

Are you creating a new shared_ptr from the pointer dmat? If so I strongly suggest against. It usually leads to errors.

Cause I thought about similar fix but I rejected it cause I did not figure out how to retrieve already existing shared_pt from just DMatrix* (cause I want to have the copy of that reference counting pointer)

What I had in mind is changing the interface of Predict to pass shared_ptr in the first place instead of just a raw pointer. That would permit us to recreate a cache

I consulted with my C++ friends and they pointed to https://stackoverflow.com/questions/32713083/explicit-constructor-and-static-cast
So static_cast here actually will call a constructor and it will create a second separate shared_ptr which won't know about any other shared_ptr's which manage this pointer.

Sooner or later this will result in double free cause each of these smart pointers will try to free the resource.

In the end I suggest to wait @trivialfis refactoring. It should enable you (or us) to solve this problem

yeah that's why the current unit test broken

trams · 2020-02-03T08:39:21Z

@alois-bissuel,
this change should fix checkpointing performance issue. Could you ask people in Criteo who works on xgboost to test it? Thank you

trivialfis · 2020-02-10T18:56:30Z

@CodingCat Could you please help taking a look into #5220 ?

trivialfis · 2020-02-14T05:12:19Z

@CodingCat I believe the issue is resolved in #5220 as it caches all DMatrix. Can you test it out?

Nan Zhu added 18 commits February 1, 2020 20:59

logging

9dcae64

more logging

c6d85ba

more logging

c05eb4e

test

a05f76f

update

bdbdb3d

try to compile

13a6da5

check

3018bfc

test

3190a69

logging

4b968db

change default to true for testing

6386306

change default

b34b7e4

more logging

b7af1b7

add logging in leaner

e251401

mostly fix?

9b079eb

default to false but set to true in xgboost-spark

982ca88

test with false

a3ad395

set true in scala

8f1fde3

recover some changes

c70a826

sync

3d8bc33

trams reviewed Feb 3, 2020

View reviewed changes

trivialfis mentioned this pull request Feb 10, 2020

Move prediction cache to Learner. #5220

Merged

CodingCat closed this Apr 20, 2020

trivialfis mentioned this pull request Aug 17, 2020

[jvm-packages] Removing synchronized predict #6021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

CodingCat commented Feb 2, 2020 •

edited

Loading

CodingCat commented Feb 2, 2020

CodingCat commented Feb 2, 2020

trivialfis commented Feb 3, 2020

CodingCat commented Feb 3, 2020

trivialfis commented Feb 3, 2020

CodingCat commented Feb 3, 2020

CodingCat commented Feb 3, 2020

trivialfis commented Feb 3, 2020

CodingCat commented Feb 3, 2020

trivialfis commented Feb 3, 2020

RAMitchell commented Feb 3, 2020

trams left a comment

trams Feb 3, 2020

trams Feb 3, 2020

trams Feb 4, 2020

CodingCat Feb 13, 2020

trams commented Feb 3, 2020

trivialfis commented Feb 10, 2020

trivialfis commented Feb 14, 2020

[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

Conversation

CodingCat commented Feb 2, 2020 • edited Loading

CodingCat commented Feb 2, 2020

CodingCat commented Feb 2, 2020

trivialfis commented Feb 3, 2020

CodingCat commented Feb 3, 2020

trivialfis commented Feb 3, 2020

CodingCat commented Feb 3, 2020

CodingCat commented Feb 3, 2020

trivialfis commented Feb 3, 2020

CodingCat commented Feb 3, 2020

trivialfis commented Feb 3, 2020

RAMitchell commented Feb 3, 2020

trams left a comment

Choose a reason for hiding this comment

trams Feb 3, 2020

Choose a reason for hiding this comment

trams Feb 3, 2020

Choose a reason for hiding this comment

trams Feb 4, 2020

Choose a reason for hiding this comment

CodingCat Feb 13, 2020

Choose a reason for hiding this comment

trams commented Feb 3, 2020

trivialfis commented Feb 10, 2020

trivialfis commented Feb 14, 2020

CodingCat commented Feb 2, 2020 •

edited

Loading