Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Add Bagging by Query for Lambdarank #6623

Merged
merged 17 commits into from
Oct 2, 2024

Conversation

shiyu1994
Copy link
Collaborator

Add bagging by query instead of by items in lambdarank, suggested by @metpavel. This should be more reasonable for bagging in ranking tasks. For a comparison of performance, on MS LTR dataset:
With bagging_freq=1 and bagging_fraction=0.1, if bagging_by_query=true

[LightGBM] [Info] Iteration:100, training ndcg@1 : 0.528525
[LightGBM] [Info] Iteration:100, training ndcg@3 : 0.502271
[LightGBM] [Info] Iteration:100, training ndcg@5 : 0.5034
[LightGBM] [Info] 23.123690 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

and if bagging_by_query=false

[LightGBM] [Info] Iteration:100, training ndcg@1 : 0.524889
[LightGBM] [Info] Iteration:100, training ndcg@3 : 0.502272
[LightGBM] [Info] Iteration:100, training ndcg@5 : 0.502838
[LightGBM] [Info] 43.811966 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

Without bagging

[LightGBM] [Info] Iteration:100, training ndcg@1 : 0.535041
[LightGBM] [Info] Iteration:100, training ndcg@3 : 0.509657
[LightGBM] [Info] Iteration:100, training ndcg@5 : 0.510785
[LightGBM] [Info] 50.232102 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

Comment on lines +52 to +56
void GetGradients(const double* scores, const data_size_t /*num_sampled_queries*/, const data_size_t* /*sampled_query_indices*/, score_t* gradients, score_t* hessians) const override {
LaunchGetGradientsKernel(scores, gradients, hessians);
SynchronizeCUDADevice(__FILE__, __LINE__);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neNasko1 is this something that might be missing for CUDA support in #6586?

@shiyu1994
Copy link
Collaborator Author

@guolinke Could you please help to review this when you have time? Thanks.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some very minor suggestions from me below:

include/LightGBM/objective_function.h Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
Comment on lines +4534 to +4535
assert ndcg_score_bagging_by_query >= ndcg_score - 0.1
assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR's description states that bagging_by_query=True should improve metrics, but I don't see any comparison of bagging_by_query=True and bagging_by_query=False here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I found the result can be random when the dataset is small. For example, on CPU, bagging_by_query=True gets higher NDCG with the toy test dataset (even higher than the case without bagging), while with GPU bagging_by_query=True could get worse results compared with bagging_by_query=False. But when the dataset is large, for example, with MS LTR dataset, the results are less random, and bagging_by_query=True should improve performance, as in the description of this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we also see a significant improvement in training speed with bagging_by_query=True.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. So this test is for something like "bagging_by_query=True doesn't break training".

shiyu1994 and others added 4 commits September 6, 2024 10:42
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@shiyu1994 shiyu1994 merged commit d1d218c into master Oct 2, 2024
45 checks passed
@shiyu1994 shiyu1994 deleted the bagging/bagging-by-query-for-lambdarank branch October 2, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants