Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Redesign dmlc::Parameter. #4755

Closed
trivialfis opened this issue Aug 9, 2019 · 4 comments
Closed

[RFC] Redesign dmlc::Parameter. #4755

trivialfis opened this issue Aug 9, 2019 · 4 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Aug 9, 2019

There are a few problems in dmlc::Parameter I want to address:

Issues

Parameter update

Currently dmlc::Parameter uses a static manager (__MANAGER()__ method) class to carry out all the initialization steps. The problem is this manager is created per-class instead of per-instance, so there is no way that it can understand whether a particular parameter instance has been initialized. As a side effect, we cannot do update operation on it, which results in every call of Configure a complete re-initialization. See #4738 . Once JSON is here the problem will spread through all other parameters.

  • Adding a initialized_ field to dmlc::Parameter won't work as it will break all binary IO logic (size is changed).
  • Adding a std::map<Parameter*, bool> initialized_ field to the manager won't work, as distinguishing instance by pointer breaks easily with stack allocated objects:
{
  Param p;
}
{
  Param p0;  // Same address as p
}
  • Adding non-trivial constructor/destructor to dmlc::Parameter calling __MANAGER()__ won't work. I tried and get a dead lock from static variables, stuck at :
__cxa_guard_acquire 
  • Adding std::map<std::string, FieldAccessEntry*>> to manager won't work, again it operates on class instead of instance.

Initialization

Another issue is the parameter can't initialize itself. It troubles me and @RAMitchell for quite a while now as those primitive types have undefined values at initialization. The stack example above, param0 might has same value as param (actually compiler just reuse param for param0 even in debugging compilation). Also, one can imagine calling any method from __MANAGER__ is quite dangerous.

Description is not very useful.

The one most duplicated code in XGB is those descriptions of parameters, as there's no way we can query the document from C++ level, resulting in those descriptions being copied across XGBoost.

Namespace

As XGBoost has so many parameters that can form a small embedded language, putting all of them into one std::vector makes precise validation really difficult.

What's next

I don't have any concrete idea yet. But discussions are welcomed. @RAMitchell @CodingCat @hcho3 Please join.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 9, 2019

Yes! This is a long-due discussion. See #3803, for example, where we have no way to detect runtime update of a field in dmlc::Parameter and thus leak memory.

@RAMitchell
Copy link
Member

Thanks @trivialfis. As this functionality lives in dmlc-core it can be very difficult for us to change as we will impose changes on mxnet etc.

I see two paths forward:
a) We build our own xgboost parameter classes
b) We RFC these issues up to dmlc-core and try to engage developers of dependent libraries

Upstreaming could be relatively easy if the existing functionality is more or less preserved, otherwise difficult if we are changing some features they rely on.

I think we can create a linked issue on dmlc-core, and when we have a proof of concept PR for how to solve these issues, decide on a) or b).

Aso, we have two other pieces of code that are candidates to be upstreamed in the future:

  • common::Span
  • improved JSON parser.

@tqchen
Copy link
Member

tqchen commented Aug 10, 2019

I think some of the features could be built on top of the current Paramater. It can be a bit hard to say that lazy initialization(ignore what is being configured) is really what we want. As the simplest way to reason about the object is to treat it as immutable and update all the field if necessary.

One thing we could do is to add helper factory functions to initialize the parameters. Additionally, we might be able to move away from POD, and just add default value initializers now that we have cxx11, and create custom serializer for these parameters

In terms of documentation. I started to believe more that duplication for better documentation could be better than an adhoc solution that generates semi complete docs. It also allows us to create language specific docs. So as long as we have reasonable duplication it should be fine.

@trivialfis
Copy link
Member Author

@tqchen Thanks for the suggestions. Will think about these later.

As the simplest way to reason about the object is to treat it as immutable and update all the field if necessary.

I still want finer control over each hyper parameter.

One thing we could do is to add helper factory functions to initialize the parameters. Additionally, we might be able to move away from POD.

That's a good idea, I will try to stay close to dmlc::Parameter as much as possible so we can upstream the changes in the future.

It also allows us to create language specific docs.

This is not a priority for now as we don't have frequent change in parameter, nor do I understand the document system for all bindings.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants