-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
- training with external memory part 1 of 2 #4486
Conversation
- this pr focuses on computing the quantiles using multiple gpus on a dataset that uses the external cache capabilities - there will a follow-up pr soon after this that will support creation of histogram indices on large dataset as well - both of these changes are required to support training with external memory - the sparse pages in dmatrix are taken in batches and the the cut matrices are incrementally built - also snuck in some (perf) changes related to sketches aggregation amongst multiple features across multiple sparse page batches. instead of aggregating the summary inside each device and merged later, it is aggregated in-place when the device is working on different rows but the same feature
@canonizer @RAMitchell @rongou - please review this is very similar to #4448 and i have split it up per the earlier review request |
@hcho3 any reason why i do not see all the checks with this pr (typically there are ~ 10 of them)? |
@sriramch Looks like GitHub webhook didn't get activated. Let me try to re-trigger |
@sriramch See https://www.githubstatus.com/. Looks like there is an issue with notification delivery. We'll have to wait until it's fixed |
thanks @hcho3 for looking and helping! |
@sriramch For now, I've manually triggered all Jenkins tests (Linux/Win64). |
@sriramch I'm having some issues with triggering Linux tests. Can we wait until the webhook issue is fixed? |
Codecov Report
@@ Coverage Diff @@
## master #4486 +/- ##
=======================================
Coverage 81.42% 81.42%
=======================================
Files 9 9
Lines 1626 1626
=======================================
Hits 1324 1324
Misses 302 302 Continue to review full report at Codecov.
|
src/common/hist_util.cu
Outdated
*/ | ||
struct SketchContainer { | ||
std::vector<HistCutMatrix::WXQSketch> sketches_; // NOLINT | ||
std::vector<std::unique_ptr<std::mutex>> col_locks_; // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to use std::vector<std::mutex>
, unless there is a reason to use std::unique_ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elements within a vector have to be copy/assign-able or move-able. a mutex is neither of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copying/assignment/moving only applies when resizing the vector. Creating a mutex vector of a fixed size should be possible.
I'm not a C++ expert here, however, so I may be wrong.
src/common/hist_util.cu
Outdated
n_rows_(row_end - row_begin), param_(std::move(param)), sketch_container_(sketch_container) { | ||
} | ||
|
||
inline size_t GetRowStride() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have a method here? If the purpose is to make row_stride_
accessible, consider making DeviceShard
a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the issue with having this accessor? i can consider making it a const member function, if any, as the state isn't getting mutated. deviceshard already delineates what is public and private and i do not want to loosen the accessibility restrictions without fully understanding why it was made a class in the first place.
WXQSketch::LimitSizeLevel(row_batch.Size(), eps, &dummy_nlevel, &n_cuts_); | ||
// double ncuts to be the same as the number of values | ||
// in the temporary buffers of the sketches | ||
n_cuts_ *= 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to keep doubling n_cuts_
?
src/common/hist_util.cu
Outdated
} | ||
}; | ||
|
||
void Sketch(const SparsePage& batch, const MetaInfo& info, | ||
HistCutMatrix* hmat, int gpu_batch_nrows) { | ||
size_t SketchBatch(const GPUDistribution &dist, const SparsePage &batch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row stride can be definitely passed through a shard or class variable. No need to return a value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
src/common/hist_util.cu
Outdated
void Sketch(const SparsePage& batch, const MetaInfo& info, | ||
HistCutMatrix* hmat, int gpu_batch_nrows) { | ||
size_t SketchBatch(const GPUDistribution &dist, const SparsePage &batch, | ||
const MetaInfo &info, SketchContainer *sketch_container) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sketch_container
should be a member of a class or shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
src/tree/updater_gpu_hist.cu
Outdated
common::DeviceSketch(batch, *info_, param_, &hmat_, hist_maker_param_.gpu_batch_nrows); | ||
// TODO(sriramch): The return value will be used when we add support for histogram | ||
// index creation for multiple batches | ||
(void)common::DeviceSketch(param_, hist_maker_param_.gpu_batch_nrows, dmat, &hmat_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to cast the return value to void
. Just ignoring it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll clean all these up in a follow-up pr for the histogram. i added a todo and did this just to remind myself of this. the void return casting is to just to notate for now that the return value is ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TODO describing that you ignore the value is fine. No need to cast to void
, ignoring return values in C/C++ is OK.
namespace xgboost { | ||
namespace common { | ||
|
||
void TestDeviceSketch(const GPUSet& devices) { | ||
void TestDeviceSketch(const GPUSet& devices, bool use_external_memory = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a default parameter value here? Wouldn't updating calls to TestDeviceSketch
be easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not see an issue either way. i'll change it nonetheless.
std::shared_ptr<xgboost::DMatrix> *dmat = nullptr; | ||
|
||
size_t num_cols = 1; | ||
if (!use_external_memory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider if (use_external_memory)
and swapping the branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
HistCutMatrix hmat_gpu; | ||
DeviceSketch(batch, (*dmat)->Info(), p, &hmat_gpu, gpu_batch_nrows); | ||
(void)DeviceSketch(p, gpu_batch_nrows, (*dmat).get(), &hmat_gpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to cast to void
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to indicate that the return value is ignored. i can do it c++ style if you want!
HistCutMatrix hmat_gpu; | ||
DeviceSketch(batch, (*dmat)->Info(), p, &hmat_gpu, gpu_batch_nrows); | ||
(void)DeviceSketch(p, gpu_batch_nrows, (*dmat).get(), &hmat_gpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dmat->get()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do...
src/common/hist_util.cu
Outdated
*/ | ||
struct SketchContainer { | ||
std::vector<HistCutMatrix::WXQSketch> sketches_; // NOLINT | ||
std::vector<std::unique_ptr<std::mutex>> col_locks_; // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copying/assignment/moving only applies when resizing the vector. Creating a mutex vector of a fixed size should be possible.
I'm not a C++ expert here, however, so I may be wrong.
src/common/hist_util.cu
Outdated
// Initialize Sketches for this dmatrix | ||
sketches_.resize(info.num_col_); | ||
col_locks_.resize(info.num_col_); | ||
#pragma omp parallel for schedule(static) if (info.num_col_ > 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number occurs in more than one place. Consider defining an constant and using it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
private: | ||
std::vector<std::unique_ptr<DeviceShard>> shards_; | ||
const tree::TrainParam ¶m_; | ||
int gpu_batch_nrows_; | ||
size_t row_stride_; | ||
std::unique_ptr<SketchContainer> sketch_container_; | ||
}; | ||
|
||
size_t DeviceSketch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document what value is being returned.
Alternatively, consider using a pointer parameter instead of returning a value.
src/common/hist_util.cu
Outdated
} | ||
|
||
/* Builds the sketches on the GPU */ | ||
/* Builds the sketches on the GPU for the dmatrix */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -51,7 +51,7 @@ void TestDeviceSketch(const GPUSet& devices, bool use_external_memory = false) { | |||
|
|||
// find the cuts on the GPU | |||
HistCutMatrix hmat_gpu; | |||
(void)DeviceSketch(p, gpu_batch_nrows, (*dmat).get(), &hmat_gpu); | |||
(void)DeviceSketch(p, gpu_batch_nrows, dmat->get(), &hmat_gpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also check that the correct row stride is returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/tree/updater_gpu_hist.cu
Outdated
common::DeviceSketch(batch, *info_, param_, &hmat_, hist_maker_param_.gpu_batch_nrows); | ||
// TODO(sriramch): The return value will be used when we add support for histogram | ||
// index creation for multiple batches | ||
(void)common::DeviceSketch(param_, hist_maker_param_.gpu_batch_nrows, dmat, &hmat_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TODO describing that you ignore the value is fine. No need to cast to void
, ignoring return values in C/C++ is OK.
w.r.t. this: to accommodate this, we have to explicitly make the type non-copy-able/move-able/assign-able. i have made just that to prevent such accidental usage. |
@sriramch So is the external memory feature now complete? That is, can I take today's master branch and train with GPU and external memory data? |
@sriramch |
@hcho3 - what i meant was that if i include this commit in my branch, then the cli just sits and spins chewing a lot of memory on a large dataset that has the external cache enabled, before it can even create those sparse pages on disk. this seems like a clear regression. hence, i reverted this commit in my private branch. i did not refer to the jenkins build error. sorry for not being clear earlier. if you are interested, the full set of external memory training changes are here |
@sriramch Actually, your CLI config would be sufficient for me to diagnose the issue. |
@sriramch Indeed, your CLI config makes the problem very clear. I will resolve this issue very soon. |
dataset that uses the external cache capabilities
of histogram indices on large dataset as well
are incrementally built
features across multiple sparse page batches. instead of aggregating the summary
inside each device and merged later, it is aggregated in-place when the device
is working on different rows but the same feature