Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON IO for various components. #4732

Closed
wants to merge 3 commits into from

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Aug 4, 2019

  • Split up model IO and serialization.
  • Unify the use of num_feature, base_score and num_output_group across XGBoost with learner_model_param_ (LearnerModelParam). This is passed as a pointer to const across XGBoost. Deprecate all duplicated parameter. Mark original model parameter as legacy.
  • Unify num_output_group and num_class.
  • Add JSON for both model IO and serialisation.
  • Remove duplicated parameter in split evaluator.
  • Fix wrong predict contribution tests.
  • Add tests for JSON IO in both cxx and Python.
  • Save updaters for carrying out training continuation.
  • Rigorous tests for training continuation.
  • Define prediction cache in gbtree instead of predictor, as the cache needs to be in sync between cpu_predictor and gpu_predictor.
  • Add basic documentation for the serialisation format.
  • Enabled save/load config in Python pickle.

Todos:

@trivialfis trivialfis mentioned this pull request Aug 4, 2019
10 tasks
@trivialfis
Copy link
Member Author

@hetong007 Do you have any suggestion for releasing CRAN source distribution with CMake instead of Make?

@trivialfis
Copy link
Member Author

Also @pommedeterresautee ;-).

@hetong007
Copy link
Member

I'm probably the least experienced developer of XGBoost in terms of CMake. Adding @khotilov as another potential source of suggestions.

src/c_api/c_api.cc Outdated Show resolved Hide resolved
src/common/common.h Outdated Show resolved Hide resolved
src/common/json.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated Show resolved Hide resolved
@hcho3
Copy link
Collaborator

hcho3 commented Aug 5, 2019

@trivialfis @hetong007 I'm quite nervous about requiring CMake in the R package. There are lots of different machines the R package get tested, and we have to pass every single one of them in order to keep the package on CRAN. See https://cran.r-project.org/web/checks/check_results_xgboost.html for the list of machines. My concern is that not every one of the testing machines would have CMake installed.

Of course, we could simply ask users to compile from source and not bother with CRAN, but then we'd lose the ability to install XGBoost with install.packages('xgboost'). Note that the MXNet R package is not on CRAN for similar reasons.

@trivialfis
Copy link
Member Author

@hcho3 Thanks for the link. Let's keep them there for now. Will open a new issue when I have something concrete.

@pommedeterresautee
Copy link
Member

Hi, there is another point to take into account, I have made a PR on remotes package and when (soon I hope) it will be merged, XGBoost R package from Github will be installable in 1 line directly on R with remotes or devtools instead of current process (git clone taking care of submodules, etc.) requiring a working toolchain outside of the R environment.

I am wondering if switching to Cmake would break things or even worst make it harder for new comers to setup the toolchain to be able to install github code.

@trivialfis
Copy link
Member Author

trivialfis commented Aug 5, 2019

@trams Please help reviewing. This is extracted from previous monolithic PR with specialization and schema removed.

Currently the performance is quite bad, it takes about 1.5s to load and 2.0s to save for a 25MB JSON model. Running valgrind most of the time spent on memory allocation (std::vector from JsonArray). I plan to add a custom pool allocator for all objects or use input string as buffer in a later PR. For now I'm going to prioritize on correctness.

Will fix the failing tests. Unsurprisingly, handling those dynamic parameters is difficult. But the general structure for JSON integration should be ready for review.

@trivialfis
Copy link
Member Author

@pommedeterresautee Thanks for the information. Let me see if I can simplify those makefiles.

@trams
Copy link
Contributor

trams commented Aug 5, 2019

@trivialfis, thank you for mentioning me in this review.
Right now (till the end of the week) I am attending the conference and I am not sure I will have time to give a comprehensive review

As for performance I recently learned about https://github.com/lemire/simdjson a very fast json parser (~2-3Gb per second). I do not know your reasons to implement your own json parser vs reusing the third party but either way I suggest to dive into their code for tips on how they implemented high performance json parser (see here one of their benchmarks https://lemire.me/blog/2019/08/02/json-parsing-simdjson-vs-json-for-modern-c/)

@trivialfis
Copy link
Member Author

trivialfis commented Aug 5, 2019

@trams Thanks, I'm aware of SIMD json. But requiring avx is not feasible in XGBoost. I will try simpler method like iterative parsing, single allocator etc. I'm not too worried about that.

We don't want to add dependencies.

@trivialfis
Copy link
Member Author

Emm. I might need to add an UpdateAllowUnknown method to dmlc::Parameter so that I don't have to save the parameter cache.

@trivialfis trivialfis changed the title Add JSON IO for various components. [WIP] Add JSON IO for various components. Aug 6, 2019
@trivialfis
Copy link
Member Author

trivialfis commented Aug 6, 2019

@RAMitchell Let me try to extract an even smaller part of this first to push the cudf PR forward.

@trivialfis
Copy link
Member Author

@hcho3 @RAMitchell The added Load method is very much like our Configure method, except for Json is structured while configuration is not. I wonder if there's a way to merge two of them.

Copy link
Contributor

@trams trams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not full review. I hope to find some time to finish it in next few days

CMakeLists.txt Outdated Show resolved Hide resolved
include/xgboost/gbm.h Outdated Show resolved Hide resolved
include/xgboost/tree_model.h Outdated Show resolved Hide resolved
python-package/xgboost/compat.py Outdated Show resolved Hide resolved
src/gbm/gblinear_model.cc Outdated Show resolved Hide resolved
src/gbm/gbtree_model.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated Show resolved Hide resolved
src/learner.cc Outdated Show resolved Hide resolved
src/objective/rank_obj.cc Outdated Show resolved Hide resolved
src/tree/tree_model.cc Outdated Show resolved Hide resolved
j_node[5] = n.DefaultLeft();
CHECK(IsA<Boolean>(j_node[5]));

v_j_nodes[i] = j_node;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is here an implicit constructor call to Array? If so could you make it explicit to make it easier to read?

std::vector<Json> v_j_nodes(nodes_.size());
for (size_t i = 0; i < nodes_.size(); ++i) {
auto const& n = nodes_[i];
std::vector<Json> j_node(6);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a perfect case for using C++11 initializer_list

src/tree/tree_model.cc Outdated Show resolved Hide resolved
src/tree/tree_model.cc Outdated Show resolved Hide resolved
tests/cpp/test_learner.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

@trams Thanks for the review. Those are very helpful suggestions. Will get back to this once available.

include/xgboost/gbm.h Outdated Show resolved Hide resolved
include/xgboost/learner.h Outdated Show resolved Hide resolved
python-package/xgboost/compat.py Outdated Show resolved Hide resolved
src/gbm/gblinear.cc Outdated Show resolved Hide resolved
src/gbm/gbtree.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Sep 26, 2019

Related open issues: #4878 , #4877, #4855, #4755, #4758, #4639, #4618, #4072, #3980, #3771, #5003

Just leaving a note here.

@trivialfis
Copy link
Member Author

@RAMitchell To summarize, other than bug fixes, this PR does 4 things:

  • Add JSON IO for XGBoost.
  • Replace duplicated parameters, deprecate the old one when deleting is not feasible.
  • Handle prediction cache syncing.
  • Exhausted tests for training continuation, including boosted trees, boosted forest, multi-class, logit.

So why is that so huge? Because the JSON IO creates a framework for easy testing. It's meant to help creating reproducible train, where binary IO is not feasible. Also with JSON, you can access and verify all internal parameters of XGBoost. Hence easy testing for second item.

doc/tutorials/saving_model.rst Outdated Show resolved Hide resolved
include/xgboost/c_api.h Outdated Show resolved Hide resolved
include/xgboost/c_api.h Outdated Show resolved Hide resolved
include/xgboost/learner.h Outdated Show resolved Hide resolved
src/gbm/gblinear_model.h Outdated Show resolved Hide resolved
src/gbm/gblinear_model.h Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

@RAMitchell The R tests actually passes on Jenkins ... I'm gonna ignore the error I found.

> require(data.table)
Loading required package: data.table
data.table 1.12.2 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> a <- data.table(c=1:3, d=2:4)
> str(a)
Classesdata.tableand 'data.frame':	3 obs. of  2 variables:
 $ c: int  1 2 3
 $ d: int  2 3 4
 - attr(*, ".internal.selfref")=<externalptr> 

Notice the single quote around data.table on my machine is Unicode, weirdly it doesn't seem to be the case for data.frame. The Unicode breaks testthat::expect_output's regex searching.

* Add tests for JSON IO in both CXX.
* Rigorous tests for training continuation.
* Add basic documentation for the serialization format.
@hcho3
Copy link
Collaborator

hcho3 commented Dec 24, 2019

@trivialfis Can we close this, or is there more work to be done in JSON serialization?

@trivialfis
Copy link
Member Author

Will close it once R package is updated.

@trivialfis
Copy link
Member Author

Sorry for being slow.

@trivialfis trivialfis closed this Apr 10, 2020
@trivialfis trivialfis deleted the json-prototype branch September 18, 2020 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants