Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE2 & AVX2 std::find & std::count #2434

Merged
merged 94 commits into from
Apr 4, 2022

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Dec 18, 2021

📝 Summary

🏁 Perf benchmark

Benchmark
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <iostream>
#include <ranges>
#include <intrin.h>

enum class Kind {
    FindSized,
    FindUnsized,
    Count,
};

template<typename T>
void benchmark_find(T* a, std::size_t max, size_t start, size_t pos, Kind kind, size_t rep) {
    size_t count_expected;
    std::fill_n(a, max, '0');
    if (pos < max && pos >= start) {
        a[pos] = '1';
        count_expected = 1;
    }
    else {
        if (kind == Kind::FindUnsized) {
            abort();
        }
        count_expected = 0;
    }

    auto t1 = std::chrono::steady_clock::now();

    switch (kind)
    {
    case Kind::FindSized:
        for (std::size_t s = 0; s < rep; s++) {
            _ReadWriteBarrier(); // To avoid the compiler moving `memchr` out of loop

            if (std::ranges::find(a + start, a + max, '1') != a + pos) {
                abort();
            }
        }
        break;
    case Kind::FindUnsized:
        for (std::size_t s = 0; s < rep; s++) {
            _ReadWriteBarrier(); // To avoid the compiler moving `memchr` out of loop

            if (std::ranges::find(a + start, std::unreachable_sentinel, '1') != a + pos) {
                abort();
            }
        }
        break;
    case Kind::Count:
        for (std::size_t s = 0; s < rep; s++) {
            _ReadWriteBarrier(); // To avoid the compiler moving `memchr` out of loop

            if (std::ranges::count(a + start, a + max, '1') != count_expected) {
                abort();
            }
        }
        break;
    }

    auto t2 = std::chrono::steady_clock::now();

    const char* op_str = nullptr;
    switch (kind)
    {
    case Kind::FindSized:
        op_str = "find sized";
        break;
    case Kind::FindUnsized:
        op_str = "find unsized";
        break;
    case Kind::Count:
        op_str = "count";
        break;
    }
    std::cout << std::setw(10) << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s  -- "
        << "Op " << op_str << " Size " << sizeof(T) << " byte elements, array size " << max
        << " starting at " << start << " found at " << pos << "; " << rep << " repetitions \n";
}


constexpr std::size_t Nmax = 8192;

alignas(64) std::uint8_t    a8[Nmax];
alignas(64) std::uint16_t   a16[Nmax];
alignas(64) std::uint32_t   a32[Nmax];
alignas(64) std::uint64_t   a64[Nmax];

int main()
{
    std::cout << "Vector alg used: " << _USE_STD_VECTOR_ALGORITHMS << "\n";

    benchmark_find(a8, Nmax, 0, 3456, Kind::FindSized, 10000000);
    benchmark_find(a16, Nmax, 0, 3456, Kind::FindSized, 1000000);
    benchmark_find(a32, Nmax, 0, 3456, Kind::FindSized, 1000000);
    benchmark_find(a64, Nmax, 0, 3456, Kind::FindSized, 1000000);

    benchmark_find(a8, Nmax, 0, 3456, Kind::FindUnsized, 1000000);
    benchmark_find(a16, Nmax, 0, 3456, Kind::FindUnsized, 1000000);
    benchmark_find(a32, Nmax, 0, 3456, Kind::FindUnsized, 1000000);
    benchmark_find(a64, Nmax, 0, 3456, Kind::FindUnsized, 1000000);

    benchmark_find(a8, Nmax, 0, 3456, Kind::Count, 1000000);
    benchmark_find(a16, Nmax, 0, 3456, Kind::Count, 1000000);
    benchmark_find(a32, Nmax, 0, 3456, Kind::Count, 1000000);
    benchmark_find(a64, Nmax, 0, 3456, Kind::Count, 1000000);

    std::cout << "Done\n";

    return 0;
}
Benchmark run and results
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.1.0-pre.1.1
** Copyright (c) 2021 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

C:\Program Files\Microsoft Visual Studio\2022\Preview>cd/d C:\Project\vector_find_benchmark

C:\Project\vector_find_benchmark>set INCLUDE=C:\Project\STL\out\build\x64\out\inc;%INCLUDE%

C:\Project\vector_find_benchmark>set LIB=C:\Project\STL\out\build\x64\out\lib\amd64;%LIB%

C:\Project\vector_find_benchmark>set PATH=C:\Project\STL\out\build\x64\out\bin\amd64;%PATH%

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=0 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 0
  0.123933s  -- Op find sized Size 1 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.911876s  -- Op find sized Size 2 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.917381s  -- Op find sized Size 4 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.975016s  -- Op find sized Size 8 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
    1.1216s  -- Op find unsized Size 1 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
   1.14476s  -- Op find unsized Size 2 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
    1.1659s  -- Op find unsized Size 4 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
   1.11486s  -- Op find unsized Size 8 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
    4.2511s  -- Op count Size 1 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
   4.23966s  -- Op count Size 2 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
   4.26283s  -- Op count Size 4 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
   3.19778s  -- Op count Size 8 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
Done

C:\Project\vector_find_benchmark>cl /O2 /std:c++latest /EHsc /D_USE_STD_VECTOR_ALGORITHMS=1 /nologo vector_find_benchmark.cpp
vector_find_benchmark.cpp

C:\Project\vector_find_benchmark>vector_find_benchmark.exe
Vector alg used: 1
 0.0566276s  -- Op find sized Size 1 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
 0.0979886s  -- Op find sized Size 2 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.185901s  -- Op find sized Size 4 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.357151s  -- Op find sized Size 8 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.040667s  -- Op find unsized Size 1 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
 0.0683121s  -- Op find unsized Size 2 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.239233s  -- Op find unsized Size 4 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.247233s  -- Op find unsized Size 8 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.185978s  -- Op count Size 1 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.349529s  -- Op count Size 2 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
  0.722243s  -- Op count Size 4 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
   1.59221s  -- Op count Size 8 byte elements, array size 8192 starting at 0 found at 3456; 1000000 repetitions
Done
Results table
size before after
find sized 1 byte 0.123933s 0.0566276s
find sized 2 bytes 0.911876s 0.0979886s
find sized 4 bytes 0.917381s 0.185901s
find sized 8 bytes 0.975016s 0.357151s
find unsized 1 byte 1.1216s 0.040667s
find unsized 2 bytes 1.14476s 0.0683121s
find unsized 4 bytes 1.1659s 0.239233s
find unsized 8 bytes 1.11486s 0.247233s
count 1 byte 4.2511s 0.185978s
count 2 bytes 4.23966s 0.349529s
count 4 bytes 4.26283s 0.722243s
count 8 bytes 3.19778s 1.59221s

TL;DR: all cases have significant benefit. Decimal order of magnitude for some cases.

When AVX is disabled, the benefit becomes small for the following cases:

  • 64-bit find sized Size 1 - as memchr is also SSE2, and my version only slightly beats it, as they are very similar
  • 64-bit Size 8 all algorithms - as vector only contains two elements, overhead for vectorization becomes comparable to gain

Still no regression found on tested cases.

⚖️ Size impact

The change adds more code, it also overrides (disables) /Os option for vector_algorithm.cpp via pragma.
DLLs and PDBs for them are not affected. Static libraries are affected.
The impact is negligible for static libs, small for import libs.

Table
File name Size before Size after
libcpmt.lib 31,713,452 31,801,394
libcpmt1.lib 32,551,896 32,641,100
libcpmtd.lib 33,254,860 33,352,526
libcpmtd0.lib 32,095,116 32,192,820
libcpmtd1.lib 32,987,834 33,085,806
msvcprt.lib 1,038,290 1,119,228
msvcprtd.lib 1,045,802 1,133,746
stl_asan.lib 3,030 3,030

✔️ Test coverage

  • Expand Dev11_0316853_find_memchr_optimization to test new types, mostly limits test
  • Add GH_002431_byte_range_find_with_unreachable_sentinel to test for as if sequential reading when the size is not set
  • Expand VSO_0000000_vector_algorithms to test newly vectorized cases with various sizes

⚠️ ARM64EC Note

I didn't try to make it working on ARM64EC. I'm not sure how to make it correctly.

  • From other parallel algorithm looks like I should enable SSE, but not AVX - see _M_ARM64EC usage
  • From <bit> it looks like that ARM version of popcount is used instead of x86.
  • Finally, parallel algorithms are disabled under _M_HYBRID, which might have to do something with ARM64EC
    It would be good to have Make ARM64EC built on GitHub #2310 addressed, so that these changes can be tested.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 18, 2021 17:03
@AlexGuteniev AlexGuteniev changed the title Vector find SSE2 & AVX2 std::find Dec 18, 2021
@StephanTLavavej StephanTLavavej added the bug Something isn't working label Mar 31, 2022
@StephanTLavavej StephanTLavavej removed their assignment Mar 31, 2022
@StephanTLavavej
Copy link
Member

This is amazing, thanks! 😻 I think this will be ready to merge after resolving the issues I found - please let me know if you'd like me to push changes for them (none were in the core vectorized machinery - the number was just large enough that I didn't feel comfortable pushing changes without asking first).

@AlexGuteniev
Copy link
Contributor Author

Yes please go ahead!

@StephanTLavavej StephanTLavavej removed the affects redist Results in changes to separately compiled bits label Apr 1, 2022
@StephanTLavavej StephanTLavavej self-assigned this Apr 1, 2022
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

To fix an internal build break (where the new vector algorithms wouldn't compile for ARM64EC), I've completely disabled the vector algorithms for ARM64EC. This includes the existing attempt to enable the SSE2 parts of reverse and reverse_copy. When investigating how to add ARM64EC build coverage, @cbezault found he had to do the same thing.

We can explore re-enabling the vector algorithms (and extending them to find/count) for ARM64EC in the future, but for now I believe that this is the lowest-risk option.

@StephanTLavavej
Copy link
Member

Thanks for this major performance improvement and bugfix! 🚀 🐞 😻

@cbezault
Copy link
Contributor

cbezault commented Apr 4, 2022

Disabling these algorithms for ARM64EC will be a major perf degradation for x64 binaries running on ARM64 devices 😢 that dynamically link or that statically link and directly target ARM64EC.

@StephanTLavavej
Copy link
Member

I'll happily approve re-enabling them once we have confidence that we're not damaging reverse/reverse_copy correctness - the broken instruction you mentioned made me extremely nervous that we had over-extended the code beyond our test coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Must go faster
Projects
None yet
6 participants