Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that LoadSequentialFile() actually read the whole file #5831

Merged
merged 13 commits into from
Jul 4, 2020
4 changes: 3 additions & 1 deletion include/xgboost/json_io.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@
#include <vector>
#include <memory>
#include <string>
#include <cinttypes>
#include <utility>
#include <map>
#include <limits>
#include <sstream>
#include <locale>
#include <cinttypes>

namespace xgboost {
/*
Expand Down Expand Up @@ -86,6 +86,8 @@ class JsonReader {
msg += "\", got: \"";
if (got == -1) {
msg += "EOF\"";
} else if (got == 0) {
msg += "\\0\"";
} else {
msg += std::to_string(got) + " \"";
}
Expand Down
32 changes: 14 additions & 18 deletions src/common/io.cc
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
#include <unistd.h>
#endif // defined(__unix__)
#include <algorithm>
#include <cstdio>
#include <fstream>
#include <string>
#include <utility>
#include <cstdio>

#include "xgboost/logging.h"
#include "io.h"
Expand Down Expand Up @@ -120,27 +121,22 @@ std::string LoadSequentialFile(std::string fname) {
#if defined(__linux__)
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
#endif // defined(__linux__)
ssize_t bytes_read = read(fd, &buffer[0], f_size_bytes);
Copy link
Collaborator Author

@hcho3 hcho3 Jun 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Previously, we called read() only once. For a file containing 2896075944 bytes (2.69 GB), read() only reads 2147479552 bytes (1.99 GB), and the variables would be set as follows:

bytes_read = 2147479552, f_size_bytes = 2896075944

This PR ensures that bytes_read would be identical to f_size_bytes, by calling read() multiple times in a while loop.

if (bytes_read < 0) {
close(fd);
ReadErr();
ssize_t bytes_read = 0;
while (bytes_read < f_size_bytes) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Take a look at this while loop. This is a commonly used construct in UNIX programs.

ssize_t result = read(fd, &buffer[bytes_read], f_size_bytes - bytes_read);
if (result < 0) {
close(fd);
ReadErr();
}
bytes_read += result;
}
close(fd);
buffer.back() = '\0';
#else // defined(__unix__)
FILE *f = fopen(fname.c_str(), "r");
if (f == NULL) {
std::string msg;
OpenErr();
}
fseek(f, 0, SEEK_END);
auto fsize = ftell(f);
fseek(f, 0, SEEK_SET);
Comment on lines -135 to -137
Copy link
Collaborator Author

@hcho3 hcho3 Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: According to the MSDN documentation, ftell() on Windows does not yield the correct byte offset when the file was opened in the text mode.

Fix. We do away with fseek() and ftell() entirely and use std::istreambuf_iterator to read the whole file into std::string. The fix only applies on Windows.

On Linux, we use stat() to measure the file size and then call read() to read all the bytes in the file.


buffer.resize(fsize + 1);
fread(&buffer[0], 1, fsize, f);
fclose(f);
std::ifstream ifs(fname);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this is terribly slow.

Copy link
Collaborator Author

@hcho3 hcho3 Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slower than std::fread()? Note that this line is only supposed to run on platforms where POSIX is not available.

Copy link
Collaborator Author

@hcho3 hcho3 Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::ifstream shows competitive performance with respect to std::fread().

Interesting as I just tried on Linux, it's about 10x slower. But I guess when file is that huge JSON would be the bottleneck here.

buffer = std::string((std::istreambuf_iterator<char>(ifs)),
(std::istreambuf_iterator<char>()));
#endif // defined(__unix__)
buffer.back() = '\0';
return buffer;
}

Expand Down
2 changes: 2 additions & 0 deletions src/common/json.cc
Original file line number Diff line number Diff line change
Expand Up @@ -427,6 +427,8 @@ void JsonReader::Error(std::string msg) const {
for (auto c : raw_portion) {
if (c == '\n') {
portion += "\\n";
} else if (c == '\0') {
portion += "\\0";
} else {
portion += c;
}
Expand Down
1 change: 1 addition & 0 deletions tests/ci_build/ci_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@ ${DOCKER_BINARY} run --rm --pid=host \
-v "${WORKSPACE}":/workspace \
-w /workspace \
${USER_IDS} \
--shm-size=3g \
"${CI_DOCKER_EXTRA_PARAMS[@]}" \
"${DOCKER_IMG_NAME}" \
"${COMMAND[@]}"
Expand Down
50 changes: 50 additions & 0 deletions tests/cpp/common/test_io.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
* Copyright (c) by XGBoost Contributors 2019
*/
#include <gtest/gtest.h>
#include <dmlc/filesystem.h>
#include <atomic>
#include <type_traits>
#include <fstream>
#include <cstdint>
#include "../../../src/common/io.h"

namespace xgboost {
Expand Down Expand Up @@ -39,5 +44,50 @@ TEST(IO, FixedSizeStream) {
ASSERT_EQ(huge_buffer, out_buffer);
}
}


#if SIZE_MAX == 0xFFFFFFFFFFFFFFFF // Only run this test on 64-bit system
TEST(IO, LoadSequentialFile) {
hcho3 marked this conversation as resolved.
Show resolved Hide resolved
const size_t nbyte = static_cast<size_t>(2896075944LL); // About 2.69 GB
static_assert(sizeof(size_t) == 8, "Assumption failed: size_t was assumed to be 8-bytes long");
static_assert(std::is_same<size_t, std::string::size_type>::value,
"Assumption failed: size_type of std::string was assumed to be 8-bytes long");

dmlc::TemporaryDirectory tempdir;
std::string path = "/dev/shm/xgboost_test_io_big_file.txt";
{
std::ofstream f(path);
if (!f) { // /dev/shm not present
LOG(INFO) << "No /dev/shm; using dmlc::TemporaryDirectory instead";
path = tempdir.path + "/xgboost_test_io_big_file.txt";
f = std::ofstream(path);
}
CHECK(f);
std::string str(nbyte, 'a');
CHECK_EQ(str.size(), nbyte);
f << str;
}
{
std::string str = LoadSequentialFile(path);
CHECK_GE(str.size(), nbyte);
dmlc::OMPException omp_exc;
std::atomic<bool> success{true};
#pragma omp parallel for schedule(static)
for (int64_t i = 0; i < static_cast<int64_t>(nbyte); ++i) {
omp_exc.Run([&] {
if (str[i] != 'a' && success.load(std::memory_order_acquire)) {
success.store(false, std::memory_order_release);
LOG(FATAL) << "Big file got corrupted. Expected: str[" << i << "] = 'a', "
<< "Actual: str[" << i << "] = '"
<< (str[i] ? std::string(1, str[i]) : std::string("\\0")) << "'";
}
});
}
omp_exc.Rethrow();
CHECK(success.load(std::memory_order_acquire));
}
}
#endif // SIZE_MAX == 0xFFFFFFFFFFFFFFFF

} // namespace common
} // namespace xgboost