-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove stop process. #143
Remove stop process. #143
Conversation
Why we want to do this? Will this break some xgb jvm unit test? |
@CodingCat Not sure, is there any test that's expecting rabit to bring down the all workers in the process? I will submit a PR on XGBoost to run the full test instead. Also you proposed that we might subsume rabit into XGBoost and I refused to do so in before. I'm starting to think that you are right, what do you think of it now? |
I'm working on dmlc/xgboost#4826, it's still on early planning. I need to cleanup the rabit codebase first. |
I think error on exit were default implementation until we realized this is causing trouble in jvm side, we added throw exception with this flag. Please test if it will break any unit test but feature wise seems okay ASFAIK. |
@chenqin I revisited your doc around bootstrapping cache recently. Do you have suggestions on handling duplicated allreduce function calls? |
Sorry about late reply, can you share new approaches. Regarding to duplicated allreduce function calls, I think the issue unresolved is if caller try to pass in from a encapsulated method where we may not be able to generate a unique caller footprint. It will leave bootstrap cache hard to match historical cache. I used to have some thoughts to fix this..
|
So far I want to remove single point recovery. I tried to think about falling back to using seq number as the calls should have strict ordering and deterministic. But with async programming, there might be infinite edge cases and violates the ordering for multiple DMatrix construction. Feel free to correct me if I'm wrong. My expectation is XGBoost can fail gracefully on both spark and dask. |
sounds good, if we know for sure compute layer can't offer single point recovery, we should consider remove all caches and make implementation much simple |
@chenqin Thanks for the replies, let me think about it. |
I think there are one or more tests relying on this tag to handle missing value or invalid metrics or something in task layer to avoid shutting down JVM process and let SparkContextKiller to handle the post action if you remove this, the test may not be runnable
I think it makes sense to move rabit under XGB, as I do not think anyone else are relying on this module and XGB has a tight dependency with Rabit |
Can we merge this now that all tests on XGBoost are passing? |
No description provided.