Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Callback interface for logging internal information, to aid debugging #4837

Closed
hcho3 opened this issue Sep 5, 2019 · 5 comments · Fixed by #5088
Closed

[RFC] Callback interface for logging internal information, to aid debugging #4837

hcho3 opened this issue Sep 5, 2019 · 5 comments · Fixed by #5088

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Sep 5, 2019

Recently, I've seen many users report regressions in model performance (e.g. accuracy, AUC) when they upgraded their XGBoost to latest version. For example, see https://discuss.xgboost.ai/t/learning-with-xgboost-0-90-vs-1-0-0/1068/5.

I'd like to work with @chenqin to set up regression tests. As part of that, we need more than just traces of evaluation set metrics (accuracy, AUC) etc, since the metrics only capture the aggregate information. Many useful information such as gradient values, split scores, and quantile sketches remain hidden deep within the C++ codebase.

I propose to create a callback interface so as to expose useful information for verbose logging. Once we can log these pieces of information, we can use them to detect potential regressions and troubleshoot anomalies. Logging the traces of non-aggregate quantities will allow us to have closer look at the internal workings of XGBoost. We can even build a "debugger" for users to find out why their XGBoost models are performing terribly.

For each of the quantities listed above, XGBoost will call the callback function with the quantity as an argument. Here are the possible callback interfaces (there’s more than one since quantities are of different types):

  • void callback(const char* keyword, const float* array, const uint64_t* shape, int dim)
  • void callback(const char* keyword, const double* array, const uint64_t* shape, int dim)
  • void callback(const char* keyword, const GradientPair* array, const uint64_t* shape, int dim)
    • GradientPair is defined to be the C structure consisting of two float fields.

Here is what I envision to be the user experience:

import xgboost

# Note. some details are elided for brevity

arrays = {}

def handle_float_array(keyword, array, shape, dim):
  arrays[keyword] = np.array(array, shape=shape)

bst = xgboost.train( params={ ... },
                     callbacks=[ ..., debug_logging_callback(handle_float_array)] )

# Now arrays should contain a bunch of arrays extracted from XGBoost internals

Each quantity will have a unique keyword associated, e.g. "gradient", "hessian", "histogram", "quantile_sketch", etc.

Requirements

  • Don't pay for what you don't use: This feature shall impose no runtime overhead unless it is turned on.
  • Should work in distributed environments, e.g. Spark, Dask, Kubernetes

List of quantities we should log

  • yhat_i: Predicted labels (model outputs) for data points (Per data point, Per boosting iteration)
  • x_i - Input
  • l(y_i, yhat_i): Loss value (“residual”) computed from true and predicted labels (Per data point, Per boosting iteration)
    • We can use this to highlight rows that each decision tree handles well and not well.
  • Gradient and hessian values computed from the loss value (Per data point, Per boosting iteration)
  • Histogram (partial sums) over gradient and hessian values (one histogram per individual split node)
    • We can use this to figure out why and how individual split is chosen (which feature? which threshold?)
  • Quantile sketches (Per feature)
    • We can visualize estimated feature quantiles and determine if the estimates are reasonable given the actual distribution of data. Quantile estimates are used to speed up training.
@hcho3 hcho3 changed the title [RFC] Callback interface for debugging [RFC] Callback interface for logging internal information, to aid debugging Sep 5, 2019
@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 5, 2019

Related: Tensorflow lets users to "poke" into internal quantities and visualize them in Tensorboard. https://www.tensorflow.org/guide/summaries_and_tensorboard

@chenqin
Copy link
Contributor

chenqin commented Sep 5, 2019

we might just starts with something simple like store a structured csv file along side with model. It give better idea to compare apple with apple We can leave visualization empty and see how people actually want to interact with those information.

from debugging side, it can be interesting to see how folks set breaks on certain conditions (number of iteration, depth of tree etc). We might needs to put more thoughts on where is best place for python user to access those internal info or do we allow user manually overwrite those values.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 5, 2019

@chenqin Yes, visualization can be left as a future work. The main point of the proposal is to create a callback interface where the user-defined function is called to log internal information. (We can create CSV serialization logic on top of the callback interface.)

For now, this proposal does not consider an interactive debugger. The internal info will be strictly read-only. So the user will run a full training job and then analyze the log file after the fact.

@trivialfis
Copy link
Member

trivialfis commented Sep 6, 2019

@hcho3 I can create a vomit option to verbosity if you want. Then it can log stuffs into file or to stdout and we redirect them to file. I'm not sure how Python callback can extract arbitrary information from c++.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 6, 2019

I'm not sure how Python callback can extract arbitrary information from c++.

Python callback will invoke a C API function XGBRegisterDebugLoggingCallback(), which would register the callback function at C++ level.

I can create a vomit option to verbosity if you want. Then it can log stuffs into file or to stdout and we redirect them to file

This might be easier to manage than callbacks. Let me think this through.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants