Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve count vectorization: replace popcnt implementation with vector counting #4614

Merged
merged 27 commits into from
Apr 26, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Apr 21, 2024

Somewhat better in a long run, some pessimization for a short one.
Before:

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bm<uint8_t, Op::Count>/8021/3056        97.4 ns         95.3 ns     10000000
bm<uint8_t, Op::Count>/63/62            3.86 ns         3.13 ns    194782609
bm<uint8_t, Op::Count>/31/30            9.98 ns         8.89 ns     89600000
bm<uint8_t, Op::Count>/15/14            5.00 ns         4.97 ns    144516129
bm<uint8_t, Op::Count>/7/6              3.08 ns         3.14 ns    248888889
bm<uint16_t, Op::Count>/8021/3056        198 ns          193 ns      3733333
bm<uint16_t, Op::Count>/63/62           3.92 ns         3.75 ns    179200000
bm<uint16_t, Op::Count>/31/30           3.51 ns         3.53 ns    203636364
bm<uint16_t, Op::Count>/15/14           4.06 ns         2.13 ns    248888889
bm<uint16_t, Op::Count>/7/6             3.31 ns         1.79 ns    280000000
bm<uint32_t, Op::Count>/8021/3056        364 ns          245 ns      3446154
bm<uint32_t, Op::Count>/63/62           4.62 ns         2.20 ns    248888889
bm<uint32_t, Op::Count>/31/30           3.45 ns         2.29 ns    464592593
bm<uint32_t, Op::Count>/15/14           2.83 ns         1.50 ns    407272727
bm<uint32_t, Op::Count>/7/6             3.05 ns         2.08 ns    263529412
bm<uint64_t, Op::Count>/8021/3056        737 ns          600 ns      1120000
bm<uint64_t, Op::Count>/63/62           8.65 ns         6.25 ns    100000000
bm<uint64_t, Op::Count>/31/30           4.75 ns         3.39 ns    248888889
bm<uint64_t, Op::Count>/15/14           3.54 ns         1.79 ns    280000000
bm<uint64_t, Op::Count>/7/6             3.08 ns         2.05 ns    640000000

After:

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bm<uint8_t, Op::Count>/8021/3056        77.5 ns         53.5 ns     27151515
bm<uint8_t, Op::Count>/63/62            4.34 ns         3.52 ns    497777778
bm<uint8_t, Op::Count>/31/30            10.1 ns         7.98 ns    172307692
bm<uint8_t, Op::Count>/15/14            5.30 ns         3.96 ns    320000000
bm<uint8_t, Op::Count>/7/6              3.50 ns         2.42 ns    471578947
bm<uint16_t, Op::Count>/8021/3056        136 ns          105 ns     10000000
bm<uint16_t, Op::Count>/63/62           4.85 ns         3.40 ns    358400000
bm<uint16_t, Op::Count>/31/30           4.33 ns         3.31 ns    471578947
bm<uint16_t, Op::Count>/15/14           5.08 ns         3.71 ns    358400000
bm<uint16_t, Op::Count>/7/6             3.64 ns         2.49 ns    426666667
bm<uint32_t, Op::Count>/8021/3056        254 ns          206 ns      6892308
bm<uint32_t, Op::Count>/63/62           4.60 ns         2.52 ns    670690059
bm<uint32_t, Op::Count>/31/30           4.23 ns         1.80 ns    995555556
bm<uint32_t, Op::Count>/15/14           3.72 ns         1.66 ns   1000000000
bm<uint32_t, Op::Count>/7/6             3.88 ns         1.48 ns    896000000
bm<uint64_t, Op::Count>/8021/3056        724 ns          392 ns      3584000
bm<uint64_t, Op::Count>/63/62           6.15 ns         2.42 ns    426666667
bm<uint64_t, Op::Count>/31/30           4.09 ns         2.07 ns    828499485
bm<uint64_t, Op::Count>/15/14           3.43 ns         1.42 ns   1000000000
bm<uint64_t, Op::Count>/7/6             3.15 ns         1.25 ns    814545455

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner April 21, 2024 16:39
@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 21, 2024
@StephanTLavavej StephanTLavavej self-assigned this Apr 21, 2024
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed their assignment Apr 25, 2024
@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev AlexGuteniev marked this pull request as draft April 25, 2024 13:13
@AlexGuteniev AlexGuteniev marked this pull request as ready for review April 25, 2024 14:14
@AlexGuteniev
Copy link
Contributor Author

Let me see if I'll be able to optimize this in easier way to reason about, or if making tail scalar would be better here.

I think it is good to use __popcnt approach for the tail. I've pushed this change.

@StephanTLavavej StephanTLavavej self-assigned this Apr 25, 2024
@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed a commit to further clarify the comments. Things make so much more sense now!

I verified that my manual 20 GB test case passes. I reran the benchmarks, and it looks like you've reversed the perf damage I did, which is great. Curiously, bm<uint64_t, Op::Count>/8021/3056 is now 2x faster compared to main, which wasn't present in the previous iterations - I'm not sure what changed for the 8-byte codepath though!

Benchmark (Times in ns) main Alex1 STL Alex2
bm<uint8_t, Op::Count>/8021/3056 113 62.3 62.7 61.2
bm<uint8_t, Op::Count>/63/62 4.70 5.36 5.80 5.56
bm<uint8_t, Op::Count>/31/30 6.84 10.5 10.5 9.73
bm<uint8_t, Op::Count>/15/14 6.80 9.99 10.0 6.85
bm<uint8_t, Op::Count>/7/6 5.10 6.62 6.59 5.11
bm<uint16_t, Op::Count>/8021/3056 277 117 118 116
bm<uint16_t, Op::Count>/63/62 4.91 6.46 8.10 6.77
bm<uint16_t, Op::Count>/31/30 4.28 6.07 7.68 6.31
bm<uint16_t, Op::Count>/15/14 5.13 7.21 7.06 6.78
bm<uint16_t, Op::Count>/7/6 4.87 6.65 6.63 5.52
bm<uint32_t, Op::Count>/8021/3056 543 223 224 221
bm<uint32_t, Op::Count>/63/62 5.34 6.64 7.68 6.59
bm<uint32_t, Op::Count>/31/30 4.49 5.78 6.84 5.77
bm<uint32_t, Op::Count>/15/14 3.84 5.56 6.63 5.50
bm<uint32_t, Op::Count>/7/6 4.90 5.77 5.77 5.54
bm<uint64_t, Op::Count>/8021/3056 864 872 865 438 💚
bm<uint64_t, Op::Count>/63/62 9.61 9.65 9.63 7.27
bm<uint64_t, Op::Count>/31/30 6.19 6.41 6.22 5.56
bm<uint64_t, Op::Count>/15/14 4.70 4.61 4.93 4.67
bm<uint64_t, Op::Count>/7/6 4.08 4.12 4.50 4.48

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 362228f into microsoft:main Apr 26, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks again for continuing to refine this important algorithm, and working so hard to educate my cat-sized brain about SIMD! 🐈 🧠 💚

@AlexGuteniev AlexGuteniev deleted the count_is_not_find branch April 27, 2024 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants