Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that LoadSequentialFile() actually read the whole file #5831

Merged
merged 13 commits into from
Jul 4, 2020
4 changes: 3 additions & 1 deletion include/xgboost/json_io.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@
#include <vector>
#include <memory>
#include <string>
#include <cinttypes>
#include <utility>
#include <map>
#include <limits>
#include <sstream>
#include <locale>
#include <cinttypes>

namespace xgboost {
/*
Expand Down Expand Up @@ -86,6 +86,8 @@ class JsonReader {
msg += "\", got: \"";
if (got == -1) {
msg += "EOF\"";
} else if (got == 0) {
msg += "\\0\"";
} else {
msg += std::to_string(got) + " \"";
}
Expand Down
45 changes: 12 additions & 33 deletions src/common/io.cc
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
#include <unistd.h>
#endif // defined(__unix__)
#include <algorithm>
#include <cstdio>
#include <fstream>
#include <string>
#include <utility>
#include <cstdio>

#include "xgboost/logging.h"
#include "io.h"
Expand Down Expand Up @@ -108,39 +109,17 @@ std::string LoadSequentialFile(std::string fname) {
};

std::string buffer;
#if defined(__unix__)
struct stat fs;
if (stat(fname.c_str(), &fs) != 0) {
OpenErr();
}

size_t f_size_bytes = fs.st_size;
buffer.resize(f_size_bytes + 1);
int32_t fd = open(fname.c_str(), O_RDONLY);
#if defined(__linux__)
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
#endif // defined(__linux__)
ssize_t bytes_read = read(fd, &buffer[0], f_size_bytes);
Copy link
Collaborator Author

@hcho3 hcho3 Jun 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Previously, we called read() only once. For a file containing 2896075944 bytes (2.69 GB), read() only reads 2147479552 bytes (1.99 GB), and the variables would be set as follows:

bytes_read = 2147479552, f_size_bytes = 2896075944

This PR ensures that bytes_read would be identical to f_size_bytes, by calling read() multiple times in a while loop.

if (bytes_read < 0) {
close(fd);
ReadErr();
}
close(fd);
#else // defined(__unix__)
FILE *f = fopen(fname.c_str(), "r");
if (f == NULL) {
std::string msg;
OpenErr();
}
fseek(f, 0, SEEK_END);
auto fsize = ftell(f);
fseek(f, 0, SEEK_SET);
Comment on lines -135 to -137
Copy link
Collaborator Author

@hcho3 hcho3 Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: According to the MSDN documentation, ftell() on Windows does not yield the correct byte offset when the file was opened in the text mode.

Fix. We do away with fseek() and ftell() entirely and use std::istreambuf_iterator to read the whole file into std::string. The fix only applies on Windows.

On Linux, we use stat() to measure the file size and then call read() to read all the bytes in the file.


buffer.resize(fsize + 1);
fread(&buffer[0], 1, fsize, f);
fclose(f);
#endif // defined(__unix__)
// Open in binary mode so that correct file size can be computed with seekg().
// This accommodates Windows platform:
// https://docs.microsoft.com/en-us/cpp/standard-library/basic-istream-class?view=vs-2019#seekg
std::ifstream ifs(fname, std::ios_base::binary | std::ios_base::in);
ifs.seekg(0, std::ios_base::end);
const size_t file_size = static_cast<size_t>(ifs.tellg());
ifs.seekg(0, std::ios_base::beg);
buffer.resize(file_size + 1);
ifs.read(&buffer[0], file_size);
buffer.back() = '\0';

return buffer;
}

Expand Down
1 change: 0 additions & 1 deletion src/common/io.h
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ class FixedSizeStream : public PeekableInStream {
std::string buffer_;
};

// Optimized for consecutive file loading in unix like systime.
std::string LoadSequentialFile(std::string fname);

inline std::string FileExtension(std::string const& fname) {
Expand Down
2 changes: 2 additions & 0 deletions src/common/json.cc
Original file line number Diff line number Diff line change
Expand Up @@ -427,6 +427,8 @@ void JsonReader::Error(std::string msg) const {
for (auto c : raw_portion) {
if (c == '\n') {
portion += "\\n";
} else if (c == '\0') {
portion += "\\0";
} else {
portion += c;
}
Expand Down