Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spannified internals of BigInteger #35565

Merged
merged 99 commits into from
Oct 8, 2021
Merged

Spannified internals of BigInteger #35565

merged 99 commits into from
Oct 8, 2021

Conversation

sakno
Copy link
Contributor

@sakno sakno commented Apr 28, 2020

Proposed refactoring according with issue: #22609. The implementation plan consists of the following steps:

  • Replace unmanaged pointers with managed ones as well as remove unsafe code and pinning
  • Beautify stack allocation
  • Use spans wherever possible, especially for memory slicing
  • Simplify (or probably remove at all) BitsBuffer value type
  • Spannify FastReducer value type
  • Attempt to replace some array allocations with span slicing
  • Square, Multiply and bitwise operations use heap-based allocation of arrays if their length is greater than or equal to stack allocation threshold. Maybe replace it with array pooling using shared ArrayPool<T>?
  • BMI intrinsics (moved to separated issue BigInteger performance improvements #41495)

Spannified versions of internal and private static methods look pretty nice. However, I'm not sure about performance of passing span to the method. If RyuJIT uses scalar replacement then it's good news. Otherwise, maybe pass length and managed pointer to the first element as separate arguments to ensure that they passed through registers. This version was implemented in the first commit. I need advice here as well as preliminary code review because further work fully based on signatures of spannified methods.

@ghost
Copy link

ghost commented Apr 28, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of some Unsafe.Add usage by letting the JIT do the work and have the safety back.
(Didn't comment on all places, please do one pass of clean-up.)

Plus some nits.

@sakno
Copy link
Contributor Author

sakno commented Apr 29, 2020

@gfoidl , could you please take a look at the next iteration (commit a1e5006). I tried to reduce memory allocation for bitwise operations. But I'm not satisfied with necessity of delegate type that is introduced to reduce code duplication. Moreover, this approach leads to delegate instance allocation on every call. AFAIK, the current version of C# still doesn't support method pointers.

ToUInt32Array is renamed to CopyTo and now working with the memory allocated by the immediate caller.

@danmoseley
Copy link
Member

@GrabYourPitchforks sounds good

@danmoseley
Copy link
Member

@sakno I haven't used it myself, but I believe you can run this on the before and after results to get a better comparison:
https://github.com/dotnet/performance/blob/main/src/tools/ResultsComparer/README.md
Hopefully it looks at allocations as well.

@stephentoub
Copy link
Member

stephentoub commented Aug 26, 2021

I believe you can run this on the before and after results to get a better comparison

I just add this to my dotnet run benchmarkdotnet cmd line:

--corerun d:\coreclrtest\main\corerun.exe d:\coreclrtest\pr\corerun.exe

where those main and pr folders contain a copy of the contents of D:\repos\runtime\artifacts\bin\testhost\net6.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0 before and after the change.

@danmoseley
Copy link
Member

Just checking in on this one since it's so old, seems like the next action is @GrabYourPitchforks offered to have another look through. And @sakno if you are interested you could try the trick above to get a single table.

@GrabYourPitchforks
Copy link
Member

I had something come up which will make me busy for the next few days, but I can continue driving this once my schedule frees up. Should we un-milestone from 6.0 then?

@danmoseley danmoseley modified the milestones: 6.0.0, 7.0.0 Aug 30, 2021
@danmoseley
Copy link
Member

Yes, they won't take this change into 6.0.

@sakno
Copy link
Contributor Author

sakno commented Sep 6, 2021

Method Branch numberString arguments Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated
Ctor_ByteArray main -2147483648 ? 13.954 ns 0.1019 ns 0.0851 ns 13.941 ns 13.819 ns 14.118 ns 1.00 0.00 - -
Ctor_ByteArray PR -2147483648 ? 14.761 ns 0.0901 ns 0.0799 ns 14.760 ns 14.580 ns 14.865 ns 1.06 0.01 - -
ToByteArray main -2147483648 ? 20.727 ns 0.7729 ns 0.8900 ns 20.618 ns 19.637 ns 22.860 ns 1.00 0.00 0.0101 32 B
ToByteArray PR -2147483648 ? 20.204 ns 0.3181 ns 0.2975 ns 20.061 ns 19.861 ns 20.859 ns 0.99 0.04 0.0102 32 B
Parse main -2147483648 ? 194.382 ns 4.7403 ns 5.4589 ns 192.446 ns 189.305 ns 209.048 ns 1.00 0.00 0.0427 136 B
Parse PR -2147483648 ? 185.266 ns 0.9011 ns 0.7525 ns 184.929 ns 184.252 ns 186.515 ns 0.94 0.03 0.0426 136 B
ToStringX main -2147483648 ? 80.635 ns 3.3895 ns 3.7674 ns 78.761 ns 75.194 ns 88.298 ns 1.00 0.00 0.0126 40 B
ToStringX PR -2147483648 ? 72.292 ns 0.6056 ns 0.5369 ns 72.084 ns 71.747 ns 73.678 ns 0.89 0.05 0.0126 40 B
ToStringD main -2147483648 ? 76.453 ns 0.8156 ns 0.7230 ns 76.179 ns 75.101 ns 77.658 ns 1.00 0.00 0.0481 152 B
ToStringD PR -2147483648 ? 71.113 ns 0.2899 ns 0.2712 ns 71.067 ns 70.690 ns 71.631 ns 0.93 0.01 0.0484 152 B
Add main ? 1024,1024 bits 56.347 ns 1.1633 ns 1.1946 ns 55.965 ns 55.397 ns 59.876 ns 1.00 0.00 0.0510 160 B
Add PR ? 1024,1024 bits 56.439 ns 0.2406 ns 0.2009 ns 56.427 ns 56.043 ns 56.775 ns 1.00 0.02 0.0509 160 B
Subtract main ? 1024,1024 bits 57.248 ns 0.3050 ns 0.2853 ns 57.254 ns 56.701 ns 57.831 ns 1.00 0.00 0.0482 152 B
Subtract PR ? 1024,1024 bits 57.372 ns 0.5303 ns 0.4960 ns 57.226 ns 56.800 ns 58.195 ns 1.00 0.01 0.0484 152 B
Multiply main ? 1024,1024 bits 1,018.982 ns 5.7862 ns 5.4124 ns 1,017.372 ns 1,010.925 ns 1,029.580 ns 1.00 0.00 0.0857 280 B
Multiply PR ? 1024,1024 bits 1,070.928 ns 4.9263 ns 4.6081 ns 1,070.226 ns 1,064.082 ns 1,078.841 ns 1.05 0.01 0.0859 280 B
GreatestCommonDivisor main ? 1024,1024 bits 10,890.242 ns 76.4818 ns 67.7992 ns 10,885.738 ns 10,799.565 ns 10,999.333 ns 1.00 0.00 0.0870 304 B
GreatestCommonDivisor PR ? 1024,1024 bits 11,022.761 ns 86.6214 ns 76.7876 ns 10,992.526 ns 10,948.845 ns 11,209.706 ns 1.01 0.01 - -
ModPow main ? 1024,1024,64 bits 176,924.900 ns 1,493.7503 ns 1,247.3492 ns 177,199.311 ns 174,384.102 ns 178,660.212 ns 1.00 0.00 - 305 B
ModPow PR ? 1024,1024,64 bits 167,633.288 ns 803.3346 ns 627.1908 ns 167,786.488 ns 166,646.389 ns 168,750.669 ns 0.95 0.01 - 33 B
Divide main ? 1024,512 bits 642.343 ns 4.2045 ns 3.7272 ns 642.788 ns 636.328 ns 648.605 ns 1.00 0.00 0.0286 96 B
Divide PR ? 1024,512 bits 662.570 ns 7.4316 ns 6.2057 ns 661.077 ns 656.037 ns 678.460 ns 1.03 0.01 0.0291 96 B
Remainder main ? 1024,512 bits 701.984 ns 3.9658 ns 3.7096 ns 701.717 ns 696.460 ns 708.885 ns 1.00 0.00 0.0748 240 B
Remainder PR ? 1024,512 bits 657.863 ns 3.2204 ns 3.0123 ns 657.702 ns 652.353 ns 663.535 ns 0.94 0.00 0.0264 88 B
Ctor_ByteArray main 123 ? 9.109 ns 0.1340 ns 0.1253 ns 9.082 ns 8.944 ns 9.307 ns 1.00 0.00 - -
Ctor_ByteArray PR 123 ? 8.949 ns 0.0736 ns 0.0688 ns 8.951 ns 8.853 ns 9.075 ns 0.98 0.02 - -
ToByteArray main 123 ? 16.474 ns 0.1133 ns 0.1060 ns 16.497 ns 16.320 ns 16.688 ns 1.00 0.00 0.0101 32 B
ToByteArray PR 123 ? 16.809 ns 0.2037 ns 0.1905 ns 16.770 ns 16.528 ns 17.156 ns 1.02 0.01 0.0102 32 B
Parse main 123 ? 125.084 ns 1.7400 ns 1.5425 ns 125.059 ns 123.403 ns 129.047 ns 1.00 0.00 0.0328 104 B
Parse PR 123 ? 125.753 ns 1.0728 ns 0.9510 ns 125.476 ns 124.542 ns 127.676 ns 1.01 0.02 0.0328 104 B
ToStringX main 123 ? 60.686 ns 0.4545 ns 0.4251 ns 60.796 ns 59.957 ns 61.458 ns 1.00 0.00 0.0100 32 B
ToStringX PR 123 ? 61.624 ns 0.3370 ns 0.3153 ns 61.519 ns 61.216 ns 62.294 ns 1.02 0.01 0.0100 32 B
ToStringD main 123 ? 43.681 ns 0.1716 ns 0.1521 ns 43.717 ns 43.313 ns 43.854 ns 1.00 0.00 0.0101 32 B
ToStringD PR 123 ? 45.596 ns 0.3587 ns 0.3180 ns 45.640 ns 45.087 ns 46.047 ns 1.04 0.01 0.0101 32 B
Ctor_ByteArray main 123456789012(...)901234567890 [200] ? 144.698 ns 1.3993 ns 1.3089 ns 144.616 ns 142.887 ns 147.656 ns 1.00 0.00 0.0352 112 B
Ctor_ByteArray PR 123456789012(...)901234567890 [200] ? 143.912 ns 0.6425 ns 0.6010 ns 144.048 ns 143.003 ns 144.906 ns 0.99 0.01 0.0355 112 B
ToByteArray main 123456789012(...)901234567890 [200] ? 69.283 ns 0.5334 ns 0.4454 ns 69.176 ns 68.703 ns 70.031 ns 1.00 0.00 0.0356 112 B
ToByteArray PR 123456789012(...)901234567890 [200] ? 70.821 ns 0.3920 ns 0.3667 ns 70.853 ns 70.167 ns 71.335 ns 1.02 0.01 0.0357 112 B
Parse main 123456789012(...)901234567890 [200] ? 1,743.566 ns 7.1283 ns 5.9525 ns 1,742.759 ns 1,734.400 ns 1,755.052 ns 1.00 0.00 0.3071 984 B
Parse PR 123456789012(...)901234567890 [200] ? 1,779.592 ns 8.3089 ns 7.3656 ns 1,779.426 ns 1,770.266 ns 1,794.882 ns 1.02 0.00 0.3131 984 B
ToStringX main 123456789012(...)901234567890 [200] ? 464.817 ns 2.9414 ns 2.6075 ns 464.115 ns 461.699 ns 469.537 ns 1.00 0.00 0.1137 360 B
ToStringX PR 123456789012(...)901234567890 [200] ? 456.938 ns 2.6321 ns 2.4621 ns 456.267 ns 453.956 ns 461.539 ns 0.98 0.01 0.1135 360 B
ToStringD main 123456789012(...)901234567890 [200] ? 1,083.860 ns 6.4113 ns 5.6835 ns 1,083.582 ns 1,072.836 ns 1,094.347 ns 1.00 0.00 0.3151 992 B
ToStringD PR 123456789012(...)901234567890 [200] ? 1,041.487 ns 9.7771 ns 8.6671 ns 1,038.352 ns 1,032.884 ns 1,058.836 ns 0.96 0.01 0.3155 992 B
Add main ? 16,16 bits 11.229 ns 0.0470 ns 0.0393 ns 11.229 ns 11.162 ns 11.305 ns 1.00 0.00 - -
Add PR ? 16,16 bits 7.047 ns 0.0570 ns 0.0506 ns 7.045 ns 6.973 ns 7.149 ns 0.63 0.00 - -
Subtract main ? 16,16 bits 11.216 ns 0.0828 ns 0.0774 ns 11.204 ns 11.092 ns 11.341 ns 1.00 0.00 - -
Subtract PR ? 16,16 bits 6.494 ns 0.0715 ns 0.0634 ns 6.489 ns 6.424 ns 6.664 ns 0.58 0.01 - -
Multiply main ? 16,16 bits 10.973 ns 0.1227 ns 0.1147 ns 10.972 ns 10.802 ns 11.209 ns 1.00 0.00 - -
Multiply PR ? 16,16 bits 7.742 ns 0.1315 ns 0.1230 ns 7.731 ns 7.602 ns 8.052 ns 0.71 0.01 - -
GreatestCommonDivisor main ? 16,16 bits 76.718 ns 0.4069 ns 0.3607 ns 76.750 ns 76.264 ns 77.468 ns 1.00 0.00 - -
GreatestCommonDivisor PR ? 16,16 bits 76.764 ns 0.4106 ns 0.3640 ns 76.640 ns 76.164 ns 77.423 ns 1.00 0.01 - -
ModPow main ? 16,16,16 bits 179.341 ns 0.7725 ns 0.7226 ns 179.333 ns 177.896 ns 180.873 ns 1.00 0.00 - -
ModPow PR ? 16,16,16 bits 200.235 ns 1.0669 ns 0.9458 ns 200.308 ns 198.739 ns 202.131 ns 1.12 0.01 - -
Divide main ? 16,8 bits 9.324 ns 0.0411 ns 0.0343 ns 9.323 ns 9.277 ns 9.385 ns 1.00 0.00 - -
Divide PR ? 16,8 bits 10.364 ns 0.0747 ns 0.0662 ns 10.370 ns 10.247 ns 10.496 ns 1.11 0.01 - -
Remainder main ? 16,8 bits 9.433 ns 0.0795 ns 0.0705 ns 9.420 ns 9.358 ns 9.602 ns 1.00 0.00 - -
Remainder PR ? 16,8 bits 10.091 ns 0.0647 ns 0.0605 ns 10.081 ns 9.978 ns 10.178 ns 1.07 0.01 - -
ModPow main ? 16384,16384,64 bits 2,982,284.055 ns 20,304.5000 ns 18,992.8419 ns 2,981,140.681 ns 2,951,121.619 ns 3,024,813.569 ns 1.00 0.00 - 2,236 B
ModPow PR ? 16384,16384,64 bits 3,149,539.852 ns 52,327.1029 ns 48,946.8045 ns 3,144,966.418 ns 3,064,405.190 ns 3,231,942.696 ns 1.06 0.02 - 47 B
Divide main ? 65536,32768 bits 5,211,368.132 ns 20,965.3161 ns 18,585.2067 ns 5,207,763.708 ns 5,186,675.948 ns 5,253,423.531 ns 1.00 0.00 - 12,363 B
Divide PR ? 65536,32768 bits 5,469,820.721 ns 23,984.5013 ns 22,435.1174 ns 5,474,290.333 ns 5,426,964.667 ns 5,501,347.646 ns 1.05 0.01 - 4,160 B
Remainder main ? 65536,32768 bits 5,274,423.949 ns 40,466.6923 ns 35,872.6687 ns 5,260,983.656 ns 5,228,020.708 ns 5,351,705.146 ns 1.00 0.00 - 12,355 B
Remainder PR ? 65536,32768 bits 5,327,585.356 ns 21,686.8215 ns 20,285.8663 ns 5,329,391.406 ns 5,293,440.948 ns 5,360,807.240 ns 1.01 0.01 - 4,146 B
Add main ? 65536,65536 bits 2,261.169 ns 17.0800 ns 15.1409 ns 2,261.419 ns 2,236.812 ns 2,289.504 ns 1.00 0.00 2.6132 8,224 B
Add PR ? 65536,65536 bits 2,474.616 ns 12.8627 ns 11.4024 ns 2,476.329 ns 2,454.378 ns 2,490.396 ns 1.09 0.01 2.6163 8,224 B
Subtract main ? 65536,65536 bits 2,298.021 ns 25.6683 ns 22.7543 ns 2,294.305 ns 2,268.096 ns 2,338.891 ns 1.00 0.00 2.6141 8,216 B
Subtract PR ? 65536,65536 bits 2,523.440 ns 28.7476 ns 24.0056 ns 2,519.823 ns 2,496.089 ns 2,586.340 ns 1.10 0.01 2.6145 8,216 B
Multiply main ? 65536,65536 bits 927,272.318 ns 18,900.2201 ns 21,765.5216 ns 928,409.371 ns 902,183.483 ns 974,650.134 ns 1.00 0.00 47.7941 153,523 B
Multiply PR ? 65536,65536 bits 1,004,700.576 ns 3,988.3908 ns 3,730.7432 ns 1,004,817.846 ns 998,809.113 ns 1,011,070.479 ns 1.08 0.02 4.1667 16,414 B
GreatestCommonDivisor main ? 65536,65536 bits 5,591,038.422 ns 34,612.6535 ns 32,376.6975 ns 5,589,624.021 ns 5,530,768.417 ns 5,649,730.417 ns 1.00 0.00 - 16,451 B
GreatestCommonDivisor PR ? 65536,65536 bits 6,993,940.330 ns 78,301.4450 ns 73,243.2202 ns 6,972,483.861 ns 6,905,103.250 ns 7,155,806.472 ns 1.25 0.02 - 32 B

@jeffhandley
Copy link
Member

@sakno Thanks for sharing those performance numbers. We're finally down to the final stages of getting this merged!

@GrabYourPitchforks, @bartonjs, @tannergooding, and I all spent time reviewing the performance results, looking over the changes again, and discussing with each other what the risks and benefits are with these changes. With such a large refactoring, I needed to ensure we're looking at this from all angles, including potential future improvements. We've concluded that the improvements you've made here do indeed put us in a better position overall. Thank you!

I've asked @GrabYourPitchforks to resolve the 2 conflicts that exist right now; he'll do that later this week and push directly to the PR's branch. After that, assuming the CI shows up as green, we'll finally merge this in. These changes will then be part of .NET 7.0 Preview 1.

Thank you so much for your persistence and all of your effort on this, @sakno! This is a significant contribution to .NET.

@sakno
Copy link
Contributor Author

sakno commented Oct 6, 2021

@jeffhandley , @GrabYourPitchforks , thanks for the feedback! I can resolve the conflict by myself without any problem if you want. However, I see the usage of index expression (xd[^1]) that was previously proposed by Tanner and then rejected by Stephen. If we decided to use index expressions then we need to use it everywhere across BigInteger code base. Otherwise, it should be transformed to xd[xd.Length - 1].

@stephentoub
Copy link
Member

However, I see the usage of index expression (xd[^1]) that was previously proposed by Tanner and then rejected by Stephen.

Where did I reject it?

@sakno
Copy link
Contributor Author

sakno commented Oct 6, 2021

@stephentoub , here is reverting PR: #57297. It was also applied to this PR.
Comment by @tannergooding

Did some brief additional investigation and it looks like the only changes that really need to be reverted are the ones using C# indexing expressions.

The motivation was:

But to your specific question: yes, consider removing range expressions. We don't use them frequently within the base libraries because they tend to interfere with assembly trimming.

@stephentoub
Copy link
Member

stephentoub commented Oct 6, 2021

Right, that wasn't me rejecting their usage, that was their usage in those cases showing to regress performance.

@jeffhandley
Copy link
Member

@sakno Thanks! We'd happily take you up on the offer to resolve the 2 conflicts. And good catch on the range expressions having snuck made it in. In your conflict resolution, let's change those instances back to xd[xd.Length - 1].

@sakno
Copy link
Contributor Author

sakno commented Oct 7, 2021

@jeffhandley , done.

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We missed one thing in earlier reviews, but otherwise commit 65485fa LGTM! 🥳

uint divHi = right[rightLength - 1];
uint divLo = rightLength > 1 ? right[rightLength - 2] : 0;
uint divHi = right[right.Length - 1];
uint divLo = right.Length > 1 ? right[right.Length - 2] : 0;

// We measure the leading zeros of the divisor
int shift = LeadingZeros(divHi);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup opportunity we didn't spot earlier: get rid of this LeadingZeros method and prefer BitOperations.LeadingZeroCount instead.

@GrabYourPitchforks
Copy link
Member

@sakno We're at the final step! If you want to address #35565 (comment) (it's just deleting some dead code) as part of this PR, please feel free to do so, otherwise we can tackle it in a follow-up PR. Whatever you decide, I think we're good to merge right after we kick CI a bit.

@jeffhandley
Copy link
Member

The test failure is #60119, which was fixed in #60140.

@sakno
Copy link
Contributor Author

sakno commented Oct 8, 2021

@GrabYourPitchforks , we have one more potential cleanup in BigIntegerCalculator.Utils file. Here is public static int Compare(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right) method. It can be replaced with MemoryExtensions.SequenceCompareTo method. It uses IComparable<T>.CompareTo constrained call to compare elements. However, the span element is of type unit but I expect that JIT will be able to do inlining for CompareTo implementation

Oh sorry, not applicable. SequenceCompareTo starts from zero element, which is not suitable for BigInteger purposes.

Also, ActualLength can be replaced with MemoryExtensions.TrimEnd everywhere.

Copy link
Member

@jeffhandley jeffhandley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sakno! 💯

@jeffhandley jeffhandley merged commit 22b516c into dotnet:main Oct 8, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 7, 2021
@dakersnar
Copy link
Contributor

@sakno I'm skimming these changes trying to determine if this PR is the cause of #70330. If you happen to have any insight, let me know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.