Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Initial commit for JSON. #4683

Closed
wants to merge 2 commits into from
Closed

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jul 19, 2019

Currently depends on #4577 . This is a monolithic PR for early review. I will try to split it up into 2 or 3 dependent PRs later once it's ready. I can create stacked PRs if pushing to origin is allowed. :-)

TODOs:

  • Revise schema.
  • Locales
  • Better error reporting.
  • Check for implicit copying.
  • Post profiling results.
  • Enable GPU device configuration for varying environments.
  • Tree node stat instance count.
  • Enable pickling support.
  • Enable rabit checkpoint.
  • default_left is not saved.

@hcho3 @RAMitchell Sorry for the long wait.

Notes to myself

leaf_child_cnt is not yet omitted.
num_boosted_rounds is not saved.

Merged parts

Initial implementation #4708
Slightly improved version #4739

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON parsing code looks like it should go in dmlc-core. How hard is it for us to upstream it? I am open to having it in xgboost if it's too difficult, but we should at least consider it.

@@ -0,0 +1,1059 @@
################################
XGBoost JSON Schema, Version 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan to maintain this? I can see this going out of date quickly without some kind of CI enforcement. This looks like we are exposing the complete serialisation format such that someone could construct a working xgboost model completely external to xgboost. This could be a very useful feature, but also with some cost to support. Note that up until now we have not provided a schema for our binary format, only tried to enforce backwards compatibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell We can automate the schema enforcement if we express our model schema using JSON Schema. There are tools available to validate JSON files against a given JSON schema.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 20, 2019

@RAMitchell I agree that NIH (JSON parsing code) should eventually be merged into dmlc-core. For now, let's keep it in XGBoost and make sure it works.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 20, 2019

I can create stacked PRs if pushing to origin is allowed. :-)

Why don’t you create a new branch dev_json and push your code there?

@trivialfis
Copy link
Member Author

trivialfis commented Jul 22, 2019

@hcho3 Jenkins seems to return a 404 at first load. It works fine after a refresh. I have been experiencing the error for a few days so it might not be an incident.

Screenshot from 2019-07-22 04-14-06

Copy link
Contributor

@trams trams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your tremendous work

I am very interested in this change. It would simply writing converters of xgboost to other formats.

General question.
How close is your format to xgboost json dump?

include/xgboost/tree_model.h Show resolved Hide resolved
src/c_api/c_api.cc Show resolved Hide resolved
for (size_t i = 0; i < n_weights; ++i) {
auto w = weight[i];
if (i != n_weights - 1) {
raw += std::to_string(w) + ',';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that std::to_string is locale dependent (https://en.cppreference.com/w/cpp/string/basic_string/to_string)
and in some locales (russian or ukrainian) they use ',' as a floating point which may break the output.

What do you think about using a locale independent function (if we can use C++17 we should consider std::to_chars)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as far as I understood documentation std::to_string just calls fprintf (https://en.cppreference.com/w/cpp/io/c/fprintf) which by default outputs floating point numbers with precision 6 which is not always enough to reconstruct the same number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will look into std::to_chars, thanks for the suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing std::{from, to}_chars for integer is easy, but floating point number is messy. I think I'm just gonna use std::locale to work around it instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worst case we can set locale at the top of conversion.

I am more concern that we should output at least numeric_limit::max_digits10 digits so we can reconstruct the same float and I do not believe that the default does this (I think it outputs only 6 instead of 9 digits)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be useful but I've not done C++ for a few year and might be outdated: https://www.boost.org/doc/libs/1_59_0/libs/locale/doc/html/std_locales.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3 Good to know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, as a general rule what version of a standard are we using for xgboost?
C++14?

Copy link
Collaborator

@hcho3 hcho3 Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/tree/tree_model.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

How close is your format to xgboost json dump?

Not close. The dump prints only trees, but prints with much more readable structure. While this one dumps everything, but in a much more compact way for serialization.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 23, 2019

@trivialfis It looks like the result page is inaccessible from anonymous (not logged-in) users. I got the same 404 when using Incognito Browsing Mode. Let me take a look.

Update. Doing a refresh removes 404. Investigating now.

Copy link
Contributor

@trams trams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather like this pull request. I have several minor comments

size_t i = 0;
auto constexpr u0 = static_cast<unsigned char>('0');
auto constexpr u10 = static_cast<unsigned char>(10);
while (i < size && std::isspace(str[i]) && str[i] != '\0') { ++i; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that std::isspace('\0') is already false (see the bottom of this page https://en.cppreference.com/w/cpp/string/byte/isdigit) so the extra check std[i] != '\0' is excessive

auto constexpr u0 = static_cast<unsigned char>('0');
auto constexpr u10 = static_cast<unsigned char>(10);
while (i < size && std::isspace(str[i]) && str[i] != '\0') { ++i; }
for (; i < size && std::isdigit(str[i]); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this https://en.cppreference.com/w/cpp/string/byte/isdigit the standard requires the static cast to unsigned char. I do not understand exactly why...

auto const size = str.size();
size_t i = 0;
auto constexpr u0 = static_cast<unsigned char>('0');
auto constexpr u10 = static_cast<unsigned char>(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] I think it is a bit confusing to have two similar looking variables u0 and u10 which represent very different things. I would just use literal 10 in the formula

src/common/json.cc Outdated Show resolved Hide resolved
src/common/json.cc Outdated Show resolved Hide resolved
while (true) {
char ch = GetNextChar();
while (ch != ']' && ch != -1) { ch = GetNextChar(); }
ch = GetNextNonSpaceChar();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand after line 622 we would have
ch == ']' or ch == -1
the first one is non white space. The second case indicates that the stream has been at the end.
Either way we do not need to skip non space chars. Or do I miss something here?

AppendFloat(s.loss_chg, false);
AppendFloat(s.sum_hess, false);
AppendFloat(s.base_weight, false);
AppendInt(s.leaf_child_cnt, true);;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: two semicolons


std::string tree_raw;
tree_raw.reserve(
stats_.size() * std::numeric_limits<Number::Float>::max_digits10 * 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to figure out why you reserve this amount of memory.
It seems each stat has 3 floats and one integer so I would expect that the total length needed is

stats_.size() * ( 2 + float_length*3 + int_length + 3) + 2
2 is []
3 are commas
float_length = std::numeric_limitsNumber::Float::max_digits10 + 1
int_length ...

out["stats"] = JsonRaw(std::move(tree_raw));

tree_raw.clear();
tree_raw.reserve(nodes_.size() * 9 * 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 9 comes from amount of digits one need to save while serializing float. Why do we have 2 here? We have 5 ints and 1 float below

I see that the logic for preallocating buffers for arrays is quite complex and I think this is not the only place we have it. What do you think about extracting this into separate functions?

It is not very critical. It is rather the optimization.

tests/cpp/common/test_json.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

@trams Thanks for the review. I'm no expert with string manipulations so your suggestions are really helpful.

@trivialfis
Copy link
Member Author

Extracting implementation to #4732 for easier review.

@trivialfis
Copy link
Member Author

trivialfis commented Aug 12, 2019

Sorry folks, this is no longer blocking 1.0 as I think it's necessary to improve dmlc::Parameter first before finishing JSON integration. See #4755 . I want to do it right so it might take some more time. Please understand.

@trivialfis trivialfis closed this Dec 21, 2019
@trivialfis trivialfis deleted the json-rebase branch December 23, 2019 12:10
@lock lock bot locked as resolved and limited conversation to collaborators Mar 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants