Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancements for <filesystem> #3850

Merged
merged 22 commits into from
Jul 14, 2023
Merged

Conversation

achabense
Copy link
Contributor

  • refactor path::lexically_normal (replacing list with vector)
  • drop <list> inclusion
Benchmark for lexically_normal
#include <filesystem>
#include <iostream>

#include "benchmark/benchmark.h"

using namespace std;
namespace fs = std::filesystem;

fs::path _lexically_normal_bef(const fs::path& path) {
    return path.lexically_normal();
}

fs::path _lexically_normal_now(const fs::path& path) {
    const auto& _Text = path.native();//path::_Text.
    constexpr auto preferred_separator = fs::path::preferred_separator;

    constexpr wstring_view _Dot = L"."sv;
    constexpr wstring_view _Dot_dot = L".."sv;

    // N4950 [fs.path.generic]/6:
    // "Normalization of a generic format pathname means:"

    // "1. If the path is empty, stop."
    if (path.empty()) {
        return {};
    }

    // "2. Replace each slash character in the root-name with a preferred-separator."
    const auto _First = _Text.data();
    const auto _Last = _First + _Text.size();
    const auto _Root_name_end = fs::_Find_root_name_end(_First, _Last);
    fs::path::string_type _Normalized(_First, _Root_name_end);
    _STD replace(_Normalized.begin(), _Normalized.end(), L'/', L'\\');

    // "3. Replace each directory-separator with a preferred-separator.
    // [ Note: The generic pathname grammar (29.11.7.1) defines directory-separator
    // as one or more slashes and preferred-separators. -end note ]"
    vector<wstring_view> _Vec; // Empty wstring_view means directory-separator
    // that will be normalized to a preferred-separator.
    // Non-empty wstring_view means filename.
    _Vec.reserve(13); // avoid frequent re-allocations
    bool _Has_root_directory = false; // true: there is a slash right after root name.
    auto _Pos = _Root_name_end;
    if (_Pos != _Last && _Is_slash(*_Pos)) {
        _Has_root_directory = true;
        _Normalized += preferred_separator;
        ++_Pos;
        while (_Pos != _Last && _Is_slash(*_Pos)) { ++_Pos; }
    }
    // _Vec will start with a filename (if not empty).
    while (_Pos != _Last) {
        if (_Is_slash(*_Pos)) {
            if (_Vec.empty() || !_Vec.back().empty()) {
                // collapse one or more slashes and preferred-separators to one empty wstring_view
                _Vec.emplace_back();
            }
            ++_Pos;
        }
        else {
            const auto _Filename_end = _STD find_if(_Pos + 1, _Last, _Is_slash);
            _Vec.emplace_back(_Pos, static_cast<size_t>(_Filename_end - _Pos));
            _Pos = _Filename_end;
        }
    }

    // "4. Remove each dot filename and any immediately following directory-separator."
    // "5. As long as any appear, remove a non-dot-dot filename immediately followed by a
    // directory-separator and a dot-dot filename, along with any immediately following directory-separator."
    // "6. If there is a root-directory, remove all dot-dot filenames
    // and any directory-separators immediately following them.
    // [ Note: These dot-dot filenames attempt to refer to nonexistent parent directories. -end note ]"
    auto _New_end = _Vec.begin();
    for (auto _Pos = _Vec.begin(); _Pos != _Vec.end();) {
        auto _Elem = *_Pos++; // _Pos points at end or a separator after ++.
        if (_Elem == _Dot) {
            // ignore dot (and following separator).
            if (_Pos == _Vec.end()) { break; }
        }
        else if (_Elem == _Dot_dot) {
            if (_New_end != _Vec.begin() && *prev(_New_end, 2) != _Dot_dot) {
                // note: _New_end == _Vec.begin() + 2n
                // remove preceding non-dot-dot filename and separator.
                _New_end -= 2;
                if (_Pos == _Vec.end()) { break; }
            }
            else if (!_Has_root_directory) {
                // due to 6, append dot-dot only when !_Has_root_directory.
                *_New_end++ = _Dot_dot;
                if (_Pos == _Vec.end()) { break; }
                *_New_end++ = {}; // as _Pos != _Vec.end(), it points at a separator; add it.
            }
            else {
                // ignore dot-dot (and following separator).
                if (_Pos == _Vec.end()) { break; }
            }
        }
        else {
            // append normal filename and separator.
            *_New_end++ = _Elem;
            if (_Pos == _Vec.end()) { break; }
            *_New_end++ = {}; // add separator.
        }
        ++_Pos; // _Pos points at a separator here in all cases; skip it.
    }
    _Vec.erase(_New_end, _Vec.end());

    // "7. If the last filename is dot-dot, remove any trailing directory-separator."
    if (_Vec.size() >= 2 && _Vec.back().empty() && *(_STD prev(_Vec.end(), 2)) == _Dot_dot) {
        _Vec.pop_back();
    }

    // Build up _Normalized by flattening _Vec.
    for (const auto& _Elem : _Vec) {
        if (_Elem.empty()) {
            _Normalized += preferred_separator;
        }
        else {
            _Normalized += _Elem;
        }
    }

    // "8. If the path is empty, add a dot."
    if (_Normalized.empty()) {
        _Normalized = _Dot;
    }

    // "The result of normalization is a path in normal form, which is said to be normalized."
    return fs::path(_STD move(_Normalized));
}

using path = fs::path;
using _pfn = fs::path(*)(const fs::path&);

// modified from tests\P2018R1_filesystem\test.cpp
void _check(_pfn _lexically_normal, const path& path, wstring_view target) {
    fs::path result = _lexically_normal(path);
    if (result != target) [[unlikely]] {
        terminate();
        //wcerr << quoted(path.native()) << "->" << quoted(result.native()) << '\n';
    }
}

void test_1(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(cat/./dog/..)"sv), LR"(cat\)"sv);
    _check(_lexically_normal, path(LR"(cat/.///dog/../)"sv), LR"(cat\)"sv);
    _check(_lexically_normal, path(LR"(.)"sv), LR"(.)"sv);
    _check(_lexically_normal, path(LR"(.\)"sv), LR"(.)"sv);
    _check(_lexically_normal, path(LR"(.\.)"sv), LR"(.)"sv);
    _check(_lexically_normal, path(LR"(.\.\)"sv), LR"(.)"sv);
}

void test_2(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"()"sv), LR"()"sv);
    _check(_lexically_normal, path(LR"(X:)"sv), LR"(X:)"sv);
    _check(_lexically_normal, path(LR"(X:DriveRelative)"sv), LR"(X:DriveRelative)"sv);
    _check(_lexically_normal, path(LR"(X:\)"sv), LR"(X:\)"sv);
    _check(_lexically_normal, path(LR"(X:/)"sv), LR"(X:\)"sv);
    _check(_lexically_normal, path(LR"(X:\\\)"sv), LR"(X:\)"sv);
    _check(_lexically_normal, path(LR"(X:///)"sv), LR"(X:\)"sv);
}

void test_3(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(X:\DosAbsolute)"sv), LR"(X:\DosAbsolute)"sv);
    _check(_lexically_normal, path(LR"(X:/DosAbsolute)"sv), LR"(X:\DosAbsolute)"sv);
    _check(_lexically_normal, path(LR"(X:\\\DosAbsolute)"sv), LR"(X:\DosAbsolute)"sv);
    _check(_lexically_normal, path(LR"(X:///DosAbsolute)"sv), LR"(X:\DosAbsolute)"sv);
    _check(_lexically_normal, path(LR"(\RootRelative)"sv), LR"(\RootRelative)"sv);
    _check(_lexically_normal, path(LR"(/RootRelative)"sv), LR"(\RootRelative)"sv);
    _check(_lexically_normal, path(LR"(\\\RootRelative)"sv), LR"(\RootRelative)"sv);
    _check(_lexically_normal, path(LR"(///RootRelative)"sv), LR"(\RootRelative)"sv);
    _check(_lexically_normal, path(LR"(\\server\share)"sv), LR"(\\server\share)"sv);
    _check(_lexically_normal, path(LR"(//server/share)"sv), LR"(\\server\share)"sv);
    _check(_lexically_normal, path(LR"(\\server\\\share)"sv), LR"(\\server\share)"sv);
    _check(_lexically_normal, path(LR"(//server///share)"sv), LR"(\\server\share)"sv);
}

void test_4(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(\\?\device)"sv), LR"(\\?\device)"sv);
    _check(_lexically_normal, path(LR"(//?/device)"sv), LR"(\\?\device)"sv);
    _check(_lexically_normal, path(LR"(\??\device)"sv), LR"(\??\device)"sv);
    _check(_lexically_normal, path(LR"(/??/device)"sv), LR"(\??\device)"sv);
    _check(_lexically_normal, path(LR"(\\.\device)"sv), LR"(\\.\device)"sv);
    _check(_lexically_normal, path(LR"(//./device)"sv), LR"(\\.\device)"sv);
}

void test_5(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(\\?\UNC\server\share)"sv), LR"(\\?\UNC\server\share)"sv);
    _check(_lexically_normal, path(LR"(//?/UNC/server/share)"sv), LR"(\\?\UNC\server\share)"sv);
    _check(_lexically_normal, path(LR"(C:\a/b\\c\/d/\e//f)"sv), LR"(C:\a\b\c\d\e\f)"sv);
}

void test_6(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(C:\meow\)"sv), LR"(C:\meow\)"sv);
    _check(_lexically_normal, path(LR"(C:\meow/)"sv), LR"(C:\meow\)"sv);
    _check(_lexically_normal, path(LR"(C:\meow\\)"sv), LR"(C:\meow\)"sv);
    _check(_lexically_normal, path(LR"(C:\meow\/)"sv), LR"(C:\meow\)"sv);
    _check(_lexically_normal, path(LR"(C:\meow/\)"sv), LR"(C:\meow\)"sv);
    _check(_lexically_normal, path(LR"(C:\meow//)"sv), LR"(C:\meow\)"sv);
}

void test_7(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(C:\a\.\b\.\.\c\.\.\.)"sv), LR"(C:\a\b\c\)"sv);
    _check(_lexically_normal, path(LR"(C:\a\.\b\.\.\c\.\.\.\)"sv), LR"(C:\a\b\c\)"sv);
    _check(_lexically_normal, path(LR"(C:\a\b\c\d\e\..\f\..\..\..\g\h)"sv), LR"(C:\a\b\g\h)"sv);
    _check(_lexically_normal, path(LR"(C:\a\b\c\d\e\..\f\..\..\..\g\h\..)"sv), LR"(C:\a\b\g\)"sv);
    _check(_lexically_normal, path(LR"(C:\a\b\c\d\e\..\f\..\..\..\g\h\..\)"sv), LR"(C:\a\b\g\)"sv);
}

void test_8(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(..\..\..)"sv), LR"(..\..\..)"sv);
    _check(_lexically_normal, path(LR"(..\..\..\)"sv), LR"(..\..\..)"sv);
    _check(_lexically_normal, path(LR"(..\..\..\a\b\c)"sv), LR"(..\..\..\a\b\c)"sv);
    _check(_lexically_normal, path(LR"(\..\..\..)"sv), LR"(\)"sv);
    _check(_lexically_normal, path(LR"(\..\..\..\)"sv), LR"(\)"sv);
    _check(_lexically_normal, path(LR"(\..\..\..\a\b\c)"sv), LR"(\a\b\c)"sv);
    _check(_lexically_normal, path(LR"(a\..)"sv), LR"(.)"sv);
    _check(_lexically_normal, path(LR"(a\..\)"sv), LR"(.)"sv);
}

void test_9(_pfn _lexically_normal) {
    _check(_lexically_normal, path(LR"(/\server/\share/\a/\b/\c/\./\./\d/\../\../\../\../\../\../\../\other/x/y/z/.././..\meow.txt)"sv)
        , LR"(\\server\other\x\meow.txt)"sv);
}

template<auto _test>
void BM_bef(benchmark::State& state) {
    for (auto _ : state) {
        _test(_lexically_normal_bef);
    }
}

template<auto _test>
void BM_now(benchmark::State& state) {
    for (auto _ : state) {
        _test(_lexically_normal_now);
    }
}

void BM_bef_Tot(benchmark::State& state) {
    for (auto _ : state) {
        test_1(_lexically_normal_bef);
        test_2(_lexically_normal_bef);
        test_3(_lexically_normal_bef);
        test_4(_lexically_normal_bef);
        test_5(_lexically_normal_bef);
        test_6(_lexically_normal_bef);
        test_7(_lexically_normal_bef);
        test_8(_lexically_normal_bef);
        test_9(_lexically_normal_bef);
    }
}

void BM_now_Tot(benchmark::State& state) {
    for (auto _ : state) {
        test_1(_lexically_normal_now);
        test_2(_lexically_normal_now);
        test_3(_lexically_normal_now);
        test_4(_lexically_normal_now);
        test_5(_lexically_normal_now);
        test_6(_lexically_normal_now);
        test_7(_lexically_normal_now);
        test_8(_lexically_normal_now);
        test_9(_lexically_normal_now);
    }
}

BENCHMARK(BM_bef<test_1>);
BENCHMARK(BM_now<test_1>);
BENCHMARK(BM_bef<test_2>);
BENCHMARK(BM_now<test_2>);
BENCHMARK(BM_bef<test_3>);
BENCHMARK(BM_now<test_3>);
BENCHMARK(BM_bef<test_4>);
BENCHMARK(BM_now<test_4>);
BENCHMARK(BM_bef<test_5>);
BENCHMARK(BM_now<test_5>);
BENCHMARK(BM_bef<test_6>);
BENCHMARK(BM_now<test_6>);
BENCHMARK(BM_bef<test_7>);
BENCHMARK(BM_now<test_7>);
BENCHMARK(BM_bef<test_8>);
BENCHMARK(BM_now<test_8>);
BENCHMARK(BM_bef<test_9>);
BENCHMARK(BM_now<test_9>);
BENCHMARK(BM_bef_Tot);
BENCHMARK(BM_now_Tot);

// typical path
void test_10(_pfn _lexically_normal) {
    const path p(LR"(C:\Program Files\Azure Data Studio\resources\app\extensions\bat\snippets\batchfile.code-snippets)");
    benchmark::DoNotOptimize(_lexically_normal(p));
}

void test_11(_pfn _lexically_normal) {
    const path p(LR"(..\snippets\batchfile.code-snippets)");
    benchmark::DoNotOptimize(_lexically_normal(p));
}

BENCHMARK(BM_bef<test_10>);
BENCHMARK(BM_now<test_10>);
BENCHMARK(BM_bef<test_11>);
BENCHMARK(BM_now<test_11>);

BENCHMARK_MAIN();
Result

image

The main bottleneck of the original implementation was memory allocation. As the benchmark shows, the new approach works much better for "long" paths(like cat/./dog/.. or longer). For very-short paths(like X:///) the efficiency gain will not be significant.

@achabense achabense requested a review from a team as a code owner July 5, 2023 21:25
@StephanTLavavej StephanTLavavej added performance Must go faster filesystem C++17 filesystem labels Jul 5, 2023
@StephanTLavavej
Copy link
Member

As lexically_normal is pure computation, it would seem reasonable to add your benchmark to this PR. (It only needs to measure the improved implementation of lexically_normal, not provide the old implementation for comparison.)

Additionally, { break; } will fail clang-format, which adds our conventional newlines here. As a reminder, we recommend configuring your editor to clang-format on save, to avoid having such formatting issues detected by the CI system. Opening the repo in VSCode with a clang-format extension will automatically do this:

"editor.formatOnSave": false,
"[cpp]": {
"editor.formatOnSave": true
},

@achabense
Copy link
Contributor Author

I will update these benchmarks once I have time.

It only needs to measure the improved implementation of lexically_normal, not provide the old implementation for comparison.

Does this mean that I only need to provide test cases in the benchmark, and call library functions directly, roughly like this?

Example
#include <filesystem>

#include "benchmark/benchmark.h"

using namespace std;
namespace fs = std::filesystem;

void BM_lexically_normal(benchmark::State& state) {
  const fs::path& p(
      LR"(C:\Program Files\Azure Data Studio\resources\app\extensions\bat\snippets\batchfile.code-snippets)");
  for (auto _ : state) {
    benchmark::DoNotOptimize(p.lexically_normal());
  }
}

BENCHMARK(BM_lexically_normal);
// more tests...

BENCHMARK_MAIN();

@StephanTLavavej
Copy link
Member

Thanks!

Does this mean that I only need to provide test cases in the benchmark, and call library functions directly, roughly like this?

Yep, then we can compare the results before and after proposed changes.

Copy link
Contributor

@frederick-vs-ja frederick-vs-ja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make calls to std::prev ADL-proof. As you've changed to use vector now, I believe we can use _New_end[-2]/_Vec.end()[-2].

stl/inc/filesystem Outdated Show resolved Hide resolved
stl/inc/filesystem Outdated Show resolved Hide resolved
achabense and others added 5 commits July 6, 2023 15:53
Co-authored-by: A. Jiang <de34@live.cn>
Co-authored-by: A. Jiang <de34@live.cn>
Copy link
Contributor

@strega-nil-ms strega-nil-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the only thing I see; this is fantastic otherwise. Thanks so much!

stl/inc/filesystem Outdated Show resolved Hide resolved
benchmarks/src/path_lexically_normal.cpp Outdated Show resolved Hide resolved
benchmarks/src/path_lexically_normal.cpp Show resolved Hide resolved
benchmarks/src/path_lexically_normal.cpp Outdated Show resolved Hide resolved
stl/inc/filesystem Outdated Show resolved Hide resolved
stl/inc/filesystem Outdated Show resolved Hide resolved
stl/inc/filesystem Outdated Show resolved Hide resolved
stl/inc/filesystem Outdated Show resolved Hide resolved
stl/inc/filesystem Outdated Show resolved Hide resolved
benchmarks/src/path_lexically_normal.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

Thanks, looks great! 😻 🗃️ I pushed minor changes, the most significant being an update to benchmarks/CMakeLists.txt so the new benchmark is automatically built by the suite. FYI @strega-nil-ms as you had already approved.

@StephanTLavavej StephanTLavavej removed their assignment Jul 7, 2023
LR"(X:DriveRelative)"sv,
LR"(\\server\\\share)"sv,
LR"(STL/.github/workflows/../..)"sv,
LR"(C:\Program Files\Azure Data Studio\resources\app\extensions\bat\snippets\batchfile.code-snippets)"sv,
Copy link
Contributor Author

@achabense achabense Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to replace these two test cases with more generic ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me since it's a realistic Microsoft path. If it were from some other program then I'd request a generic meow-style example.

Copy link
Contributor Author

@achabense achabense Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is a real path in my computer.

for C:\Program Files\Azure Data Studio... I find a better one:
C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\VC\Snippets (it has the same number of segments, and looks shorter than the last test case(/\server/\share/\a..., so that it's more obvious that the test cases are increasing in segment size).

(I'm not sure whether I can make pushes to ready-to-merge prs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to add more tests, it's probably better to do it in a new PR to not reset the review process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, when a PR has been moved to Ready To Merge, changes should only be pushed when they're big enough to be worth it (fixing bugs or major omissions), or they're small enough to be unquestionably safe like comment typo fixes. You should ping the maintainers who approved, on GitHub and/or Discord, so they notice and can take another look.

Basically, we maintainers love doing work, but we want to focus it on necessary work that keeps PRs flowing and the codebase moving forward. We try to minimize avoidable work, so we prefer to avoid scenarios like:

  • When a PR that was ready to go is changed in a way that requires further rework.
    • If the change was an attempt to fix a real bug, then this is not annoying - we'd rather pull a PR at the last minute than merge a bug. However, if the change wasn't really necessary, then iterating on it again is extra work.
  • When changes are pushed after Ready To Merge without maintainers being notified.
    • This is a potential process loophole so we don't like it - we require all codebase modifications to be reviewed by another maintainer so last-minute changes that try to sneak in are not cool. (This is applied uniformly; if maintainer X creates a PR, and maintainer Y approves, then X doesn't get to push further changes without Y taking a look. The only exceptions are fixing build breaks and test failures during mirroring, in which case we notify but don't block on getting re-approval.)
  • When changes are pushed after a maintainer has self-assigned the PR and commented "I'm mirroring this to the MSVC-internal codebase, please notify me of any changes", and they aren't for a really good reason.
    • Resetting this mirroring process is extra extra work and adds a lot of delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, thanks for clarification 👀

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit f6401b1 into microsoft:main Jul 14, 2023
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for improving the performance of my old implementation here! 😻 🚀 🐇

@achabense achabense deleted the _For_filesystem branch July 14, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filesystem C++17 filesystem performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants