[RFC] Callback interface for logging internal information, to aid debugging #4837

hcho3 · 2019-09-05T20:07:18Z

Recently, I've seen many users report regressions in model performance (e.g. accuracy, AUC) when they upgraded their XGBoost to latest version. For example, see https://discuss.xgboost.ai/t/learning-with-xgboost-0-90-vs-1-0-0/1068/5.

I'd like to work with @chenqin to set up regression tests. As part of that, we need more than just traces of evaluation set metrics (accuracy, AUC) etc, since the metrics only capture the aggregate information. Many useful information such as gradient values, split scores, and quantile sketches remain hidden deep within the C++ codebase.

I propose to create a callback interface so as to expose useful information for verbose logging. Once we can log these pieces of information, we can use them to detect potential regressions and troubleshoot anomalies. Logging the traces of non-aggregate quantities will allow us to have closer look at the internal workings of XGBoost. We can even build a "debugger" for users to find out why their XGBoost models are performing terribly.

For each of the quantities listed above, XGBoost will call the callback function with the quantity as an argument. Here are the possible callback interfaces (there’s more than one since quantities are of different types):

void callback(const char* keyword, const float* array, const uint64_t* shape, int dim)
void callback(const char* keyword, const double* array, const uint64_t* shape, int dim)
void callback(const char* keyword, const GradientPair* array, const uint64_t* shape, int dim)
- GradientPair is defined to be the C structure consisting of two float fields.

Here is what I envision to be the user experience:

import xgboost

# Note. some details are elided for brevity

arrays = {}

def handle_float_array(keyword, array, shape, dim):
  arrays[keyword] = np.array(array, shape=shape)

bst = xgboost.train( params={ ... },
                     callbacks=[ ..., debug_logging_callback(handle_float_array)] )

# Now arrays should contain a bunch of arrays extracted from XGBoost internals

Each quantity will have a unique keyword associated, e.g. "gradient", "hessian", "histogram", "quantile_sketch", etc.

Requirements

Don't pay for what you don't use: This feature shall impose no runtime overhead unless it is turned on.
Should work in distributed environments, e.g. Spark, Dask, Kubernetes

List of quantities we should log

yhat_i: Predicted labels (model outputs) for data points (Per data point, Per boosting iteration)
x_i - Input
l(y_i, yhat_i): Loss value (“residual”) computed from true and predicted labels (Per data point, Per boosting iteration)
- We can use this to highlight rows that each decision tree handles well and not well.
Gradient and hessian values computed from the loss value (Per data point, Per boosting iteration)
Histogram (partial sums) over gradient and hessian values (one histogram per individual split node)
- We can use this to figure out why and how individual split is chosen (which feature? which threshold?)
Quantile sketches (Per feature)
- We can visualize estimated feature quantiles and determine if the estimates are reasonable given the actual distribution of data. Quantile estimates are used to speed up training.

The text was updated successfully, but these errors were encountered:

hcho3 · 2019-09-05T20:44:06Z

Related: Tensorflow lets users to "poke" into internal quantities and visualize them in Tensorboard. https://www.tensorflow.org/guide/summaries_and_tensorboard

chenqin · 2019-09-05T23:23:50Z

we might just starts with something simple like store a structured csv file along side with model. It give better idea to compare apple with apple We can leave visualization empty and see how people actually want to interact with those information.

from debugging side, it can be interesting to see how folks set breaks on certain conditions (number of iteration, depth of tree etc). We might needs to put more thoughts on where is best place for python user to access those internal info or do we allow user manually overwrite those values.

hcho3 · 2019-09-05T23:37:37Z

@chenqin Yes, visualization can be left as a future work. The main point of the proposal is to create a callback interface where the user-defined function is called to log internal information. (We can create CSV serialization logic on top of the callback interface.)

For now, this proposal does not consider an interactive debugger. The internal info will be strictly read-only. So the user will run a full training job and then analyze the log file after the fact.

trivialfis · 2019-09-06T06:18:21Z

@hcho3 I can create a vomit option to verbosity if you want. Then it can log stuffs into file or to stdout and we redirect them to file. I'm not sure how Python callback can extract arbitrary information from c++.

hcho3 · 2019-09-06T16:04:45Z

I'm not sure how Python callback can extract arbitrary information from c++.

Python callback will invoke a C API function XGBRegisterDebugLoggingCallback(), which would register the callback function at C++ level.

I can create a vomit option to verbosity if you want. Then it can log stuffs into file or to stdout and we redirect them to file

This might be easier to manage than callbacks. Let me think this through.

hcho3 changed the title ~~[RFC] Callback interface for debugging~~ [RFC] Callback interface for logging internal information, to aid debugging Sep 5, 2019

This was referenced Dec 1, 2019

Add JSON IO for various components. #4732

Closed

Implement training observer. #5088

Merged

trivialfis closed this as completed in #5088 Dec 5, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Callback interface for logging internal information, to aid debugging #4837

[RFC] Callback interface for logging internal information, to aid debugging #4837

hcho3 commented Sep 5, 2019 •

edited

Loading

hcho3 commented Sep 5, 2019

chenqin commented Sep 5, 2019

hcho3 commented Sep 5, 2019 •

edited

Loading

trivialfis commented Sep 6, 2019 •

edited

Loading

hcho3 commented Sep 6, 2019 •

edited

Loading

[RFC] Callback interface for logging internal information, to aid debugging #4837

[RFC] Callback interface for logging internal information, to aid debugging #4837

Comments

hcho3 commented Sep 5, 2019 • edited Loading

hcho3 commented Sep 5, 2019

chenqin commented Sep 5, 2019

hcho3 commented Sep 5, 2019 • edited Loading

trivialfis commented Sep 6, 2019 • edited Loading

hcho3 commented Sep 6, 2019 • edited Loading

hcho3 commented Sep 5, 2019 •

edited

Loading

hcho3 commented Sep 5, 2019 •

edited

Loading

trivialfis commented Sep 6, 2019 •

edited

Loading

hcho3 commented Sep 6, 2019 •

edited

Loading