Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] [jvm-packages] allow rebuild prediction cache when it is not initialized #5272

Closed
wants to merge 19 commits into from

Conversation

CodingCat
Copy link
Member

@CodingCat CodingCat commented Feb 2, 2020

the major purpose is to fix the performance issue when training based on a checkpoint

We need to call prediction in 3 places, UpdateOneIter for residual, evaluation and update prediction cache. Because of the existence of pred cache, we only really perform prediction when update prediction cache and leverage the produced results in the other two places.

the performance degradation comes from the fact that when we train based on a checkpoint we lost the prediction cache. As s result, we will call in UpdateOneIter and EvalOneIter (since we lost cache, we will not call when updating cache)

this PR fixes the issue by adding DMatrix to cache when cache has been lost (this behavior only happens in JVM and only when checkpoint is set)

relevant issues

#3946

#4786

#4774

@CodingCat
Copy link
Member Author

@trivialfis would you please help to review? (I will remove those std::cout right before merge)

@CodingCat
Copy link
Member Author

@trams the last fix for cp perf is here

@trivialfis
Copy link
Member

We face a similar issue. My thought is to cache all DMatrix , instead only the ones in constructors.

@CodingCat
Copy link
Member Author

We face a similar issue. My thought is to cache all DMatrix , instead only the ones in constructors.

it sounds like what I did here, I added a parameter (to avoid breaking current behavior) adding_all_to_cache ...but my limited c++ experience makes me struggle with some double free error (?)

@trivialfis
Copy link
Member

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

@CodingCat
Copy link
Member Author

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

sure, that's awesome!

@CodingCat
Copy link
Member Author

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

take care BTW given the current situation

@trivialfis
Copy link
Member

Yup. Thanks!

@CodingCat
Copy link
Member Author

I'm refactoring the prediction cache. If you don't mind me being slow ( I'm working from home with a laptop ) I will take it from here.

I saw your PR, yeah, that's kind of must-do, I found my approach here is problematic in terms of the scope of shared_ptr which is very hard to manage with the current architecture

@trivialfis
Copy link
Member

I expect that PR to simplify the management of cache a lot. Once merged we can refactor the existing interfaces together. I don't mind breaking changes for this caching behavior. For memory usage concern, I plan to utilize weak_ptr in the future for detecting destructed DMatrix objects and hence cleaning up the cache.

@RAMitchell
Copy link
Member

Weak pointer sounds perfect, then we just add everything to the cache without worrying.

Copy link
Contributor

@trams trams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this change!

I like your approach. You just want to recreate the cache if the cache entry is missing.
I am a little bit worried about the implementation but mostly cause I do not really understand 1 line of code (see comments)

P.S. Full disclosure. I am no longer part of Criteo and I can't test this change on their dataset anymore

@@ -202,6 +202,7 @@ class LearnerImpl : public Learner {
tparam_.UpdateAllowUnknown(args);
mparam_.UpdateAllowUnknown(args);
generic_parameters_.UpdateAllowUnknown(args);
std::cout << "all_to_prediction_cache:" << generic_parameters_.adding_all_to_cache << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do not forget to remove it before merging

return;
} else {
std::cout << "adding dmatrix to cache\n";
(*cache_)[dmat].data = static_cast<std::shared_ptr<DMatrix>>(dmat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by this line of code. It was quite a while when I last dived into this part of code
(Full disclaimer, I recently switched companies and I am no longer work in Criteo on xgboost :( )

My C++ is a little bit rusty in regard to static_cast in this context so please feel free to correct me
I am reading this https://en.cppreference.com/w/cpp/language/static_cast and I am trying to figure out which case is here.
A constructor of shared_ptr is marked as explicit so it can't be an implicit object creation but I do not understand then what it can be? shared_ptr is not a child of DMatrix.

Are you creating a new shared_ptr from the pointer dmat? If so I strongly suggest against. It usually leads to errors.

Cause I thought about similar fix but I rejected it cause I did not figure out how to retrieve already existing shared_pt from just DMatrix* (cause I want to have the copy of that reference counting pointer)

What I had in mind is changing the interface of Predict to pass shared_ptr in the first place instead of just a raw pointer. That would permit us to recreate a cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consulted with my C++ friends and they pointed to https://stackoverflow.com/questions/32713083/explicit-constructor-and-static-cast
So static_cast here actually will call a constructor and it will create a second separate shared_ptr which won't know about any other shared_ptr's which manage this pointer.

Sooner or later this will result in double free cause each of these smart pointers will try to free the resource.

In the end I suggest to wait @trivialfis refactoring. It should enable you (or us) to solve this problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's why the current unit test broken

@trams
Copy link
Contributor

trams commented Feb 3, 2020

@alois-bissuel,
this change should fix checkpointing performance issue. Could you ask people in Criteo who works on xgboost to test it? Thank you

@trivialfis
Copy link
Member

@CodingCat Could you please help taking a look into #5220 ?

@trivialfis
Copy link
Member

@CodingCat I believe the issue is resolved in #5220 as it caches all DMatrix. Can you test it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants