Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

Closed
kotlarmilos opened this issue Apr 4, 2023 · 1 comment
Closed
Assignees
Labels
area-Codegen-AOT-mono tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark tracking This issue is tracking the completion of other related issues.
Milestone

Comments

@kotlarmilos
Copy link
Member

kotlarmilos commented Apr 4, 2023

This report provides an overview of the major performance improvements and regressions in WASM, Mono AOT, and Interpreter during the timeframe of .NET 8 per-preview releases. It focuses on relevant improvements and regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System Bit Processor Name
macOS 13.0 Arm64 Apple M1
ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Preview 7

The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The performance regressions and improvements are analyzed separately in #89238.

Mono Interpreter

The following sections presents improvements and regressions introduced in Interpreter in the Preview 7.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 7.

Name Baseline Value Compare Value % Difference
PerfLabTests.EnumPerf.EnumEquals 646.25 229.29 -64.52
System.Tests.Perf_Enum.ToString_NonFlags_Small(value: TopDirectoryOnly) 633.28 235.90 -62.74
"System.Tests.Perf_Enum.ToString_Format_Flags_Large(value: All format: ""g"")" 667.24 271.04
System.Reflection.Attributes.IsDefinedClassHitInherit 1315.59 562.93 -57.21
System.Reflection.Activator<EmptyStruct>.CreateInstanceGeneric 721.39 330.82 -54.14
System.Numerics.Tests.Perf_Vector4.SubtractOperatorBenchmark 20.82 9.59 -53.92
System.Reflection.Invoke.Method0_NoParms 853.86 399.59 -53.20
System.Numerics.Tests.Perf_Matrix4x4.CreateRotationZBenchmark 78.54 40.02 -49.03
System.Reflection.Attributes.IsDefinedMethodBaseMissInherit 2512.81 1431.26 -43.04
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByScalarBenchmark 183.31 106.83 -41.71
System.Tests.Perf_Enum.InterpolateIntoStringBuilder_Flags(value: 32) 7501.15 4383.76 -41.55
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark 189.92 111.79 -41.13
"System.IO.Tests.Perf_RandomAccess.ReadScatter(fileSize: 1048576 buffersSize: 16384 options: None)" 400115.22
System.Numerics.Tests.Perf_Matrix4x4.CreateRotationXWithCenterBenchmark 90.04 60.34 -32.98
"System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US IgnoreCase True))" 1024.28
"System.Tests.Perf_Enum.StringFormat(value: Red Green)" 7002.80 4942.10
"System.Tests.Perf_Enum.ToString_Flags(value: Red Orange Yellow Green
System.Numerics.Tests.Perf_VectorOf<Byte>.AddBenchmark 11.28 8.19 -27.44
System.Numerics.Tests.Perf_Vector4.DivideByScalarBenchmark 30.25 21.97 -27.36
System.Numerics.Tests.Perf_Vector2.EqualsBenchmark 35.85 27.68 -22.78

Vectorization of Vector4 in #87822 improved over 100 microbenchmarks in dotnet/perf-autofiling-issues#19758 and dotnet/perf-autofiling-issues#19760.

Fix path for empty partition in Enumerable.Select in #88425 improved EmptyTakeSelectToArray microbenchmarks as reported in dotnet/perf-autofiling-issues#19761.

Improved BigInteger operators +, - and * for trivial cases in #84733 improved some of BigInteger microbenchmarks in dotnet/perf-autofiling-issues#19762.

Precomputing the CallInfo structure in #88369 improved about 200 microbenchmarks.

The BCL change #86287 and vectorization of Vector128 in #88064 improved a dozen of Equals microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 7.

Name Baseline Value Compare Value % Difference
System.Collections.CtorFromCollection<String>.FrozenDictionary(Size: 512) 44266.49 396363.53 795.40
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAllBenchmark 6.90 9.58 38.82
"Microsoft.Extensions.DependencyInjection.TimeToFirstService.Scoped(Mode: ""Expressions"")" 49567.25 65031.35 31.19
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.BitwiseOrOperatorBenchmark 9.62 12.45 29.41
System.Numerics.Tests.Perf_VectorOf<SByte>.OnesComplementOperatorBenchmark 6.04 7.80 29.23
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AllBitsSetBenchmark 2.04 2.61 28.32
System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000) 4495.94 5733.46 27.52
System.Memory.Span<Char>.SequenceEqual(Size: 33) 85.83 108.56 26.49
System.Numerics.Tests.Perf_VectorOf<Single>.AddOperatorBenchmark 7.67 9.58 24.98
"Microsoft.Extensions.DependencyInjection.TimeToFirstService.Scoped(Mode: ""ILEmit"")" 49928.88 62377.01 24.93
System.Memory.Constructors<String>.SpanFromArray 15.59 19.40 24.46
Microsoft.Extensions.DependencyInjection.ScopeValidation.TransientWithScopeValidation 1815.08 2227.85 22.74
System.Numerics.Tests.Perf_VectorOf<Int64>.EqualityOperatorBenchmark 6.56 7.77 18.48
System.IO.Tests.Perf_File.CopyToOverwrite(size: 4096) 47118.52 55507.12 17.80
"System.Tests.Perf_Decimal.TryParse(value: ""123456.789"")" 895.48 1023.98 14.34
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.AllBitsSetBenchmark 1.48 1.69 14.11
System.Numerics.Tests.Perf_VectorOf<UInt16>.AndNotBenchmark 9.16 10.44 13.96
System.Memory.Span<Byte>.IndexOfValue(Size: 33) 58.20 65.95 13.31
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.BitwiseOrOperatorBenchmark 7.62 8.61 12.96
"System.Tests.Perf_Int32.ParseSpan(value: ""2147483647"")" 206.91 233.69 12.94

Preview 6

The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT WASM

The following sections presents improvements and regressions introduced in Mono AOT WASM in the Preview 6.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 6.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark 0.38 0.00 -100
System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark 1.87 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.CountBenchmark 0.34 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark 0.22 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.InequalityOperatorBenchmark 0.97 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.CountBenchmark 0.29 0.00 -100
System.Tests.Perf_Enum.HasFlag 1.35 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualityOperatorBenchmark 2.28 0.01 <
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark 0.22 0.00 -99.57
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAllBenchmark 2.50 0.02 -99.35
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark 85.94 2.58 -97.00
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark 85.93 4.27 -95.02
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark 85.94 4.30 -94.99
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark 85.93 4.35 -94.94
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualBenchmark 2.91 0.26 -91.04
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmark 2.26 0.25 -88.80
System.Numerics.Tests.Perf_Vector3.UnitZBenchmark 3.84 0.54 -85.93
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.BitwiseAndBenchmark 4.07 0.69 -83.07
System.Runtime.Intrinsics.Tests.Perf_Vector128.FloorFloatBenchmark 20.82 3.59 -82.73
System.Net.Primitives.Tests.IPAddressPerformanceTests.TryWriteBytes(address: 1020:3040:5060:7080:9010:1112:1314:1516) 78.86 13.78 -82.52

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 6.

Name Baseline Value Compare Value % Difference
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark 0.00 0.14 26004.19
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark 0.00 0.07 12106.45
System.Numerics.Tests.Perf_VectorOf<Double>.CountBenchmark 0.09 3.36 3767.73
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark 0.00 0.06 2106.86
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.AllBitsSetBenchmark 1.95 10.77 452.08
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark 0.00 0.01 405.57
System.Numerics.Tests.Perf_VectorOf<UInt16>.MaxBenchmark 0.75 3.50 365.24
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.DotBenchmark 0.87 3.58 312.42
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualBenchmark 0.92 3.67 300.46
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualBenchmark 0.92 3.55 286.90
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.DotBenchmark 0.78 2.61 236.42
System.Numerics.Tests.Perf_VectorOf<SByte>.OnesComplementOperatorBenchmark 0.75 2.51 236.33
System.Numerics.Tests.Perf_VectorOf<SByte>.BitwiseOrBenchmark 2.62 8.52 225.70
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.ZeroBenchmark 2.00 5.96 198.55
System.Numerics.Tests.Perf_VectorOf<Int64>.ZeroBenchmark 1.98 5.88 196.21
System.Numerics.Tests.Perf_VectorOf<UInt16>.MultiplyBenchmark 3.10 9.12 194.26
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark 0.98 2.75 180.71
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark 0.98 2.69 174.16
System.Numerics.Tests.Perf_VectorOf<SByte>.UnaryNegateOperatorBenchmark 1.08 2.80 159.06
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MinBenchmark 2.70 6.92 156.32

Mono AOT compiler

The performance regressions and improvements are analyzed separately in #89238.

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 6.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 6.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Double>.CountBenchmark 0.00 0.00 -100
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark 0.02 0.00 -100
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark 0.00 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark 0.40 0.00 -100
System.Numerics.Tests.Perf_VectorOf<SByte>.OneBenchmark 76.06 1.57 -97.93
System.Numerics.Tests.Perf_VectorOf<Byte>.OneBenchmark 76.01 1.87 -97.53
System.Numerics.Tests.Perf_VectorOf<SByte>.NegateBenchmark 221.32 6.26 -97.16
System.Numerics.Tests.Perf_VectorOf<SByte>.UnaryNegateOperatorBenchmark 221.61 6.27 -97.16
System.Numerics.Tests.Perf_VectorOf<Byte>.UnaryNegateOperatorBenchmark 214.44 6.20 -97.10
System.Numerics.Tests.Perf_VectorOf<Byte>.NegateBenchmark 214.55 6.37 -97.02
System.Numerics.Tests.Perf_VectorOf<SByte>.SubtractBenchmark 231.29 7.90 -96.58
System.Numerics.Tests.Perf_VectorOf<SByte>.SubtractionOperatorBenchmark 221.04 7.90 -96.42
System.Numerics.Tests.Perf_VectorOf<UInt16>.OneBenchmark 50.92 1.83 -96.41
System.Numerics.Tests.Perf_VectorOf<Byte>.AddBenchmark 216.21 7.83 -96.37
System.Numerics.Tests.Perf_VectorOf<Byte>.SubtractBenchmark 214.79 7.79 -96.37
System.Numerics.Tests.Perf_VectorOf<Byte>.SubtractionOperatorBenchmark 215.60 7.92 -96.32
System.Numerics.Tests.Perf_VectorOf<SByte>.MultiplyOperatorBenchmark 225.86 8.35 -96.30
System.Numerics.Tests.Perf_VectorOf<Byte>.AddOperatorBenchmark 209.41 7.95 -96.20
System.Numerics.Tests.Perf_VectorOf<SByte>.MultiplyBenchmark 217.21 8.39 -96.13
System.Numerics.Tests.Perf_VectorOf<SByte>.AddOperatorBenchmark 214.44 8.33 -96.11

Vectorization of Vector<T> operators in dotnet/perf-autofiling-issues#18537 improved over 200 microbenchmarks.

Changes in #87219 introduced Math.BigMul in NextUInt64 random method and improved several microbenchmarks reported in dotnet/perf-autofiling-issues#18690.

About 120 microbenchmarks were improved dotnet/perf-autofiling-issues#19027 potentialy by #87555 or other interpreter and BCL changes.

Fozen dictionary creation is improved by 72% in #87510.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 6.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Int64>.CountBenchmark 0.01 0.23 2775.54
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark 0.01 0.17 2177.17
System.Numerics.Tests.Perf_VectorOf<UInt16>.ZeroBenchmark 2.24 4.95 121.29
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmark 7.65 16.63 117.46
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.OnesComplementOperatorBenchmark 3.03 6.11 101.75
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark 0.04 0.08 86.25
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark 18.37 33.12 80.26
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Validated(ssl: True, chunkedResponse: False, responseLength: 100000)" 2230622.93 3965252.94 77.76
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark 0.12 0.20 69.81
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 100000)" 2181340.94 3635706.61 66.67
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark 18.27 30.07 64.56
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark 1.36 2.10 55.23
HardwareIntrinsics.RayTracer.SoA.Render 1.15 1.76 52.81
System.Numerics.Tests.Perf_Vector2.DivideByScalarBenchmark 13.77 20.17 46.46
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: True, responseLength: 100000)" 2621801.93 3807493.79 45.22
System.Runtime.Intrinsics.Tests.Perf_Vector128.ConvertDoubleToLongBenchmark 64.48 89.74 39.17
System.Linq.Tests.Perf_Enumerable.WhereSingleOrDefault_LastElementMatches(input: Array) 2714.67 3708.23 36.59
System.Memory.Constructors_ValueTypesOnly<Byte>.SpanFromPointerLength 6.95 9.47 36.28
Span.IndexerBench.CoveredIndex3(length: 1024) 16595.22 22106.92 33.21
"System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)" 867.68 1154.02 33.00

Preview 5

There are a number of improvements introduced in Preview 5 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The performance regressions and improvements are analyzed separately in #89238.

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 5.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 5.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark 0.18 0.00 -100
System.Numerics.Tests.Perf_VectorOf<UInt16>.CountBenchmark 0.10 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark 0.01 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.CountBenchmark 0.03 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark 1.12 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.CountBenchmark 0.22 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.CountBenchmark 0.08 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.CountBenchmark 0.48 0.00 -99.74
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark 0.14 0.00 -99.30
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark 2.36 0.12 -95.07
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivideBenchmark 127.11 7.82 -93.85
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyOperatorBenchmark 123.89 7.68 -93.80
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyBenchmark 126.45 7.94 -93.71
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyOperatorBenchmark 125.08 7.87 -93.70
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivisionOperatorBenchmark 123.79 7.83 -93.67
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivideBenchmark 126.19 8.05 -93.62
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyBenchmark 127.05 8.23 -93.52
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivisionOperatorBenchmark 123.95 8.22 -93.37
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark 0.06 0.01 -86.49
System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000) 483385521.57 66414495.75 -86.26

Vectorization of IndexOf in #85437 improved System.Text.RegularExpressions microbenchmarks reported in dotnet/perf-autofiling-issues#17517. Addition of Vector128 and PackedSimd in #82773 improved about 70 microbenchmarks reported in dotnet/perf-autofiling-issues#17563 and dotnet/perf-autofiling-issues#17819.

Change in Plane and Quaternion improved several microbenchmarks in dotnet/perf-autofiling-issues#18043.

Change in #85528 addressed performance problems with code like EqualityComparer<T>.Default.Equals() which improved over 200 microbenchmarks reported in dotnet/perf-autofiling-issues#18349. Implementation of float32 Vector128.Equals intrnsic improved System.Numerics.Tests microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 5.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_Vector2.ZeroBenchmark 0.03 1.05 3076.49
System.Numerics.Tests.Perf_VectorOf<Double>.ZeroBenchmark 2.96 9.10 207.86
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.BitwiseOrOperatorBenchmark 8.51 21.64 154.37
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAnyBenchmark 24.29 47.23 94.44
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.InequalityOperatorBenchmark 3.94 7.15 81.24
System.Numerics.Tests.Perf_Plane.CreateFromVerticesBenchmark 76.92 132.40 72.12
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.ConditionalSelectBenchmark 11.14 17.45 56.64
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) 1877.78 2918.99 55.44
System.Diagnostics.Perf_Process.StartAndWaitForExit 1286337.51 1968645.19 53.04
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAllBenchmark 24.23 36.78 51.79
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.ZeroBenchmark 2.99 4.47 49.41
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.SubtractionOperatorBenchmark 7.62 11.13 45.99
System.Memory.Span<Char>.Reverse(Size: 512) 789.89 1116.00 41.28
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) 1963.38 2745.38 39.82
System.Numerics.Tests.Perf_VectorOf<Single>.LessThanAllBenchmark 59.72 82.75 38.57
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualityOperatorBenchmark 27.40 37.64 37.35
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, None, False)) 6382.39 8678.93 35.98
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.OnesComplementBenchmark 6.38 8.61 34.98
System.Numerics.Tests.Perf_VectorOf<Int64>.ZeroBenchmark 2.81 3.78 34.72
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAllBenchmark 26.61 35.79 34.51

Preview 4

There are a number of improvements introduced in Preview 4 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 4.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 4.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark 0.01 0.00 -100
System.Numerics.Tests.Perf_VectorOf<UInt16>.CountBenchmark 0.01 0.00 -100
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark 0.01 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.CountBenchmark 0.01 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark 0.01 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark 0.01 0.00 -100
System.Tests.Perf_DateTime.ToString(format: "s") 417.41 103.88 -75.11
System.Tests.Perf_DateTimeOffset.ToString(format: "s") 431.57 114.37 -73.49
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 100000) 25903.87 7803.06 -69.87
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000) 25653.57 7923.08 -69.11
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000000) 24916.24 7700.13 -69.09
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 1000000) 25328.88 7962.83 -68.56
System.Collections.Tests.Add_Remove_SteadyState<Int32>.Queue(Count: 512) 18.37 8.31 -54.78
System.Threading.Tests.Perf_Volatile.Read_double 0.26 0.12 -53.92
System.Numerics.Tests.Perf_VectorOf<Byte>.ZeroBenchmark 5.66 2.67 -52.77
System.Net.Primitives.Tests.IPAddressPerformanceTests.TryFormat(address: 1020:3040:5060:7080:9010:1112:1314:1516) 243.27 128.93 -46.99
System.Numerics.Tests.Perf_Vector3.DistanceSquaredBenchmark 16.92 9.15 -45.90
System.Numerics.Tests.Perf_Vector3.DistanceBenchmark 23.13 13.70 -40.79
PerfLabTests.EnumPerf.ObjectGetType 0.03 0.02 -38.31
System.Numerics.Tests.Perf_Vector3.DivideByVector3OperatorBenchmark 17.44 10.91 -37.47

BCL changes in #84210 and #84210 improved Guid.Parse and vectorized all sets in Regex, as reported in dotnet/perf-autofiling-issues#15183 and dotnet/perf-autofiling-issues#15177.

Implementation of fast path for mini_init_method_rgctx in #84226 improved over 50 microbenchmarks reported in dotnet/perf-autofiling-issues#15717, dotnet/perf-autofiling-issues#15796, and dotnet/perf-autofiling-issues#15799.

Intrinsics get_Count and get_AllBitsSet on arm64 improved around 400 microbenchmarks, as reported in dotnet/perf-autofiling-issues#15800, dotnet/perf-autofiling-issues#15718, and dotnet/perf-autofiling-issues#15797.

Allow inlining methods containing constructor calls and Intrinsified additional calls to Type:op_Equality improved over 100 microbenchmarks reported in dotnet/perf-autofiling-issues#16371 and dotnet/perf-autofiling-issues#16509.

V128 SIMD intrinsics on Arm64 across all codegen engines in #84289 improved over 400 microbenchmarks reported in dotnet/perf-autofiling-issues#16460, dotnet/perf-autofiling-issues#16621, and dotnet/perf-autofiling-issues#16660. Adding Vector128.ConvertXX and Vector128.Create as intrinsics on arm64 improved 48 microbenchmarks reported in dotnet/perf-autofiling-issues#17314 and in dotnet/perf-autofiling-issues#17315.

Make Guid.HexsToChars aggressively inlined in #85322 improved a couple of microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 4.

Name Baseline Value Compare Value % Difference
System.Tests.Perf_String.Substring_IntInt(s: "dzsdzsDDZSDZSDZSddsz", i1: 7, i2: 4) 23.92 42.38 77.13
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) 14.05 23.66 68.37
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt32(value: 4) 13.98 22.92 64.00
Benchstone.BenchI.IniArray.Test 186909527.87 304502098.85 62.91
Span.IndexerBench.Ref(length: 1024) 686.54 1110.42 61.74
System.Tests.Perf_Int64.TryParse(value: "9223372036854775807") 58.15 93.40 60.60
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.DivideBenchmark 23.30 37.16 59.44
System.Tests.Perf_Int64.TryParse(value: "-9223372036854775808") 59.06 93.58 58.45
System.Tests.Perf_Int64.TryParseSpan(value: "9223372036854775807") 59.71 93.89 57.26
System.Buffers.Binary.Tests.BinaryReadAndWriteTests.MeasureReverseUsingNtoH 1432.42 2191.50 52.99
System.Tests.Perf_Int64.TryParseSpan(value: "-9223372036854775808") 61.80 94.18 52.39
System.Threading.Tests.Perf_Volatile.Write_double 0.23 0.35 52.13
System.Numerics.Tests.Perf_VectorOf<Int32>.EqualsBenchmark 0.81 1.23 50.47
System.Tests.Perf_String.Trim(s: "Test ") 76.12 113.79 49.48
System.Tests.Perf_UInt16.Parse(value: "12345") 35.63 52.72 47.98
System.Tests.Perf_Int64.Parse(value: "-9223372036854775808") 62.30 91.72 47.22
System.Tests.Perf_UInt64.Parse(value: "18446744073709551615") 70.51 103.27 46.44
System.Tests.Perf_Int64.Parse(value: "9223372036854775807") 61.62 90.17 46.34
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.SumBenchmark 2.76 3.99 44.34
System.Collections.Tests.Perf_BitArray.BitArrayGet(Size: 512) 8039.61 11602.79 44.32

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 4.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 4.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark 0.00 0.00 -100
System.Numerics.Tests.Perf_VectorOf<Int16>.CountBenchmark 0.18 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.CountBenchmark 0.16 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark 1.29 0.00 -100
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark 0.20 0.00 -99.20
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark 0.07 0.00 -95.73
System.Tests.Perf_DateTime.ToString(format: "s") 2233.23 281.76 -87.38
System.Text.Json.Serialization.Tests.ColdStartSerialization<SimpleStructWithProperties>.NewJsonSerializerContext 185975.98 28969.63 -84.42
System.Tests.Perf_DateTimeOffset.ToString(format: "s") 2311.74 385.39 -83.32
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark 0.44 0.10 -77.43
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000000) 45039.52 12494.67 -72.25
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000) 44649.63 12502.98 -71.99
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 1000000) 45124.15 13007.76 -71.17
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 100000) 44604.36 13258.02 -70.27
System.Reflection.Invoke.Ctor0_NoParams 393.98 123.35 -68.69
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark 0.00 0.00 -68.38
System.Tests.Perf_DateTimeOffset.ToString(format: null) 6639.43 2509.03 -62.21
System.Reflection.Activator<EmptyClass>.CreateInstanceGeneric 575.27 221.73 -61.45
System.Tests.Perf_DateTimeOffset.ToString(value: 12/30/2017 3:45:22 AM -08:00) 6959.23 2746.69 -60.53
System.Memory.ReadOnlySpan.Trim(input: "") 49.19 19.80 -59.73

Implementation of IUtf8SpanFormattable in #84469 caused both improvements and regressions as reported in dotnet/perf-autofiling-issues#15630 and dotnet/perf-autofiling-issues#15626. DateTime{Offset} formatting improvement about 120 microbenchmarks reported in dotnet/perf-autofiling-issues#17009. PR #85288 improved about 30 microbenchmarks reported in dotnet/perf-autofiling-issues#17245. Handling of the Utf8Formatter.TryFormat and then delegating to the relevant helpers in #85277 improved about 30 microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 4.

Name Baseline Value Compare Value % Difference
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.CountBenchmark 0.00 0.23 9893.94
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark 0.02 0.75 4216.78
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark 0.00 0.12 3988.20
Microsoft.Extensions.DependencyInjection.ActivatorUtilitiesBenchmark.Factory 276.60 852.40 208.17
System.Numerics.Tests.Perf_VectorOf<UInt64>.AbsBenchmark 2.32 4.51 94.06
System.Numerics.Tests.Perf_VectorOf<UInt16>.AbsBenchmark 2.37 4.34 83.29
System.Numerics.Tests.Perf_Vector2.ZeroBenchmark 0.44 0.78 78.01
System.Memory.Constructors<Byte>.ArrayAsSpan 12.20 21.63 77.34
Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Indexer_FirstElement_String 8.60 14.85 72.68
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 100000) 1903905.78 3227992.49 69.54
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.OnesComplementBenchmark 6.62 10.83 63.43
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterDecimal(value: 123456.789) 491.42 801.06 63.00
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.OnesComplementOperatorBenchmark 6.29 10.12 60.75
Microsoft.AspNetCore.Server.Kestrel.Performance.PipeThroughputBenchmark.Parse_ParallelAsync(Length: 4096, Chunks: 1) 8112.10 12805.61 57.85
System.Memory.Constructors<Byte>.MemoryMarshalCreateReadOnlySpan 7.75 12.19 57.15
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark 0.12 0.19 54.21
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.BitwiseAndBenchmark 8.47 12.73 50.32
System.Numerics.Tests.Constructor.ConstructorBenchmark_Int16 29.48 43.17 46.45
System.Numerics.Tests.Perf_VectorOf<UInt16>.InequalityOperatorBenchmark 19.53 27.98 43.23
System.Numerics.Tests.Perf_VectorOf<UInt64>.BitwiseOrBenchmark 39.39 55.74 41.51

Preview 3

The following section overviews only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 3.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 3.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark 0.01 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark 0.01 0.00 -100
System.Tests.Perf_Boolean.ToString(value: True) 0.23 0.00 -100
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark 1.96 0.80 -59.04
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.SumBenchmark 6.65 3.26 -50.93
System.Numerics.Tests.Perf_Vector4.InequalityOperatorBenchmark 1.39 0.74 -46.53
System.Tests.Perf_Enum.HasFlag 0.23 0.13 -44.47
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint 1096.23 667.83 -39.07
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong 1102.75 746.09 -32.34
System.Numerics.Tests.Perf_BitOperations.Log2_ulong 1320.59 895.14 -32.21
System.Tests.Perf_String.IndexerCheckLengthHoisting 88.84 60.29 -32.13
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAllBenchmark 4.44 3.03 -31.65
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.SumBenchmark 4.02 2.76 -31.25
System.Numerics.Tests.Perf_VectorOf<SByte>.MinBenchmark 48.27 33.34 -30.93
Inlining.InlineGCStruct.WithFormat 2.86 1.99 -30.52
PerfLabTests.CastingPerf.ObjScalarValueType 108762.72 76497.64 -29.66
System.Numerics.Tests.Perf_VectorOf<Byte>.InequalityOperatorBenchmark 0.55 0.39 -29.07
Microsoft.Extensions.Primitives.StringSegmentBenchmark.Equals_Object_Invalid 2.86 2.04 -28.66
System.Numerics.Tests.Perf_VectorOf<UInt64>.EqualityOperatorBenchmark 0.52 0.37 -28.49
System.Numerics.Tests.Perf_VectorOf<UInt64>.InequalityOperatorBenchmark 0.62 0.45 -28.32

The most improved groupings of benchmark are System.Numerics as outlined dotnet/perf-autofiling-issues#14023, dotnet/perf-autofiling-issues#14224, dotnet/perf-autofiling-issues#14573, and dotnet/perf-autofiling-issues#14322. The changes implemented in #82420, #83337, and #83094 introduced Arm64 SIMD operations and improved about 1000 microbenchmarks.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 3.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.ZeroBenchmark 2.65 5.66 113.78
System.Numerics.Tests.Perf_BitOperations.Log2_uint 791.53 1539.09 94.44
System.Collections.Tests.Add_Remove_SteadyState<Int32>.Queue(Count: 512) 9.64 18.37 90.64
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000) 2769.97 5142.05 85.63
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000) 2771.03 5139.62 85.47
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 100) 377.30 646.53 71.35
System.Numerics.Tests.Perf_BitOperations.PopCount_uint 668.42 1104.04 65.17
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 100) 377.61 598.53 58.50
System.Threading.Tests.Perf_Volatile.Read_double 0.16 0.26 57.96
System.Memory.Span<Char>.Reverse(Size: 512) 258.69 407.47 57.51
PerfLabTests.LowLevelPerf.StructWithInterfaceInterfaceMethod 154024.04 239168.34 55.27
System.Text.Json.Tests.Perf_Segment.ReadSingleSegmentSequenceByN(numberOfBytes: 8192, TestCase: Json4KB) 13635.35 20935.97 53.54
System.Text.Json.Tests.Perf_Reader.ReadSpanEmptyLoop(IsDataCompact: True, TestCase: Json4KB) 10415.86 15732.85 51.04
System.Text.Json.Tests.Perf_Reader.ReadSingleSpanSequenceEmptyLoop(IsDataCompact: True, TestCase: Json4KB) 10436.16 15712.23 50.55
System.Numerics.Tests.Perf_VectorOf<Int32>.EqualityOperatorBenchmark 0.24 0.36 50.01
System.Collections.IndexerSetReverse.Array(Size: 512) 456.86 681.13 49.08
System.Collections.IndexerSet<Int32>.Span(Size: 512) 458.27 682.26 48.87
System.Numerics.Tests.Perf_VectorOf<Int64>.EqualityOperatorBenchmark 0.27 0.40 48.57
System.Numerics.Tests.Perf_BitOperations.PopCount_ulong 745.13 1102.84 48.00
System.Text.Json.Tests.Perf_Reader.ReadReturnBytes(IsDataCompact: False, TestCase: Json40KB) 158074.36 231420.75 46.39

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 3.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 3.

Name Baseline Value Compare Value % Difference
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark 0.16 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark 0.01 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.CountBenchmark 0.11 0.00 -100
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.CountBenchmark 0.43 0.00 -100
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') 99861.87 2074.68 -97.92
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark 2.79 0.07 -97.41
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark 234.80 6.26 -97.33
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark 246.33 6.63 -97.30
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.NegateBenchmark 235.81 6.49 -97.24
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.NegateBenchmark 235.54 6.56 -97.21
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark 3.10 0.09 -97.00
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanBenchmark 273.32 8.63 -96.84
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanBenchmark 273.20 8.91 -96.73
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsStaticBenchmark 273.84 9.19 -96.64
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractBenchmark 247.26 8.65 -96.50
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanBenchmark 250.97 8.85 -96.47
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractBenchmark 244.27 8.76 -96.41
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyOperatorBenchmark 249.17 8.97 -96.40
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddBenchmark 238.40 8.67 -96.36
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddOperatorBenchmark 236.35 8.68 -96.32

The most improved groupings of benchmark are System.Buffers, System.Collections, System.Memory, and System.Text as outlined in dotnet/perf-autofiling-issues#14324, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14326, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14355, dotnet/perf-autofiling-issues#14359, and dotnet/perf-autofiling-issues#14361. The changes implemented in #83498 and #83490 increased inlining length limit from 20 to 30 and implemented shr.un.imm which improved over 1000 microbenchmarks.

Add vector horizontal sums on Arm64 #83675 improved about 20 microbenchmarks, as detailed in dotnet/perf-autofiling-issues#14531.

Changes in #83512 caused both improvements and regressions as reported in dotnet/perf-autofiling-issues#15008 and dotnet/perf-autofiling-issues#15154.

Regressions

Here is a list of top 20 regressed microbenchmarks in Preview 3.

Name Baseline Value Compare Value % Difference
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark 0.00 0.12 661187.94
System.Numerics.Tests.Perf_VectorOf<Int16>.CountBenchmark 0.01 0.18 2061.26
System.Numerics.Tests.Perf_Vector3.EqualsBenchmark 23.78 443.27 1764.35
System.Numerics.Tests.Perf_Vector4.EqualsBenchmark 24.01 406.03 1590.83
System.Numerics.Tests.Perf_Vector2.EqualsBenchmark 33.71 435.39 1191.71
System.Numerics.Tests.Perf_Matrix3x2.EqualsBenchmark 162.13 1346.77 730.69
System.Numerics.Tests.Perf_Plane.EqualsBenchmark 57.84 411.46 611.36
System.Numerics.Tests.Perf_Quaternion.EqualsBenchmark 80.35 436.94 443.80
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark 0.04 0.20 431.24
System.Numerics.Tests.Perf_Matrix4x4.EqualsBenchmark 376.19 1808.21 380.66
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark 0.99 2.52 154.02
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsBenchmark 124.90 305.09 144.27
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark 0.19 0.44 127.07
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsBenchmark 191.86 410.58 113.99
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsBenchmark 199.71 410.56 105.57
System.Threading.Tests.Perf_Thread.CurrentThread 3.50 6.37 81.95
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Unvalidated(ssl: True, chunkedResponse: True, responseLength: 100000) 1951914.28 3529445.53 80.81
System.Text.Json.Serialization.Tests.ReadJson<BinaryData>.DeserializeFromReader(Mode: SourceGen) 33011.31 59326.04 79.71
System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US, OrdinalIgnoreCase, False)) 913.26 1618.90 77.26
System.Text.Json.Serialization.Tests.ReadJson<BinaryData>.DeserializeFromReader(Mode: Reflection) 32968.66 58440.45 77.26

Preview 2

There are a number of improvements introduced in Preview 2 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512) 34.07 μs 310.43 ns -33756.76 ns 99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512) 17.32 μs 314.25 ns -17007.28 ns 98%
System.Tests.Perf_Decimal.Floor 81.17 ns 16.81 ns -64.36 ns 79%
System.Tests.Perf_Decimal.Round 82.24 ns 18.69 ns -63.55 ns 77%
System.Tests.Perf_UInt32.TryFormat(value: 0) 78.23 ns 20.05 ns -58.18 ns 74%
System.Tests.Perf_Int32.TryFormat(value: 4) 78.02 ns 20.47 ns -57.55 ns 74%
System.Collections.TryGetValueFalse<String, String>.ConcurrentDictionary(Size: 512) 44.69 μs 12.92 μs -31.77 μs 71%
System.Tests.Perf_Decimal.Divide 346.08 ns 102.16 ns -243.92 ns 70%
System.Collections.ContainsKeyFalse<String, String>.ConcurrentDictionary(Size: 512) 45.29 μs 13.50 μs -31.79 μs 70%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000) 8.93 μs 2.77 μs -6.16 μs 69%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000) 8.83 μs 2.77 μs -6.06 μs 69%
System.Tests.Perf_UInt64.TryFormat(value: 0) 84.40 ns 26.53 ns -57.87 ns 69%
System.Tests.Perf_Byte.ToString(value: 255) 91.65 ns 29.95 ns -61.69 ns 67%
System.Tests.Perf_Version.TryFormat3 265.42 ns 88.04 ns -177.38 ns 67%
System.Tests.Perf_Version.TryFormat4 345.05 ns 115.05 ns -230.00 ns 67%
System.Collections.TryGetValueTrue<String, String>.ConcurrentDictionary(Size: 512) 49.50 μs 16.53 μs -32.97 μs 67%
System.Tests.Perf_Version.TryFormat2 176.63 ns 59.61 ns -117.02 ns 66%
System.Collections.ContainsKeyTrue<String, String>.ConcurrentDictionary(Size: 512) 50.43 μs 17.54 μs -32.89 μs 65%
LinqBenchmarks.Where01ForX 1.57 secs 548.00 ms -1022.61 ms 65%
LinqBenchmarks.Where01LinqMethodX 1.68 secs 588.39 ms -1095.38 ms 65%

The most improved groupings of benchmark are System.Collections, System.Decimal, System.Int, and System.Text as outlined in dotnet/perf-autofiling-issues#12996, dotnet/perf-autofiling-issues#13006, dotnet/perf-autofiling-issues#13217, and dotnet/perf-autofiling-issues#13264. The changes implemented in #81695 intrinsified RuntimeHelpers.CreateSpan<T> widely used in the BCL and replaced icall performance path.

Arm64 SIMD operations implemented in #83094 and #82420 improved over 1000 microbenchmarks according to the dotnet/perf-autofiling-issues#13808, dotnet/perf-autofiling-issues#13807, dotnet/perf-autofiling-issues#14023, and dotnet/perf-autofiling-issues#13990.

The grouping of benchmarks related to System.Collections have been improved by the changes made in #81902. as outlined in dotnet/perf-autofiling-issues#13220. The changes added support for v128 constants and improved performance in about 75 microbenchmarks.

The benchmark grouping of System.Text has been improved by the addition of S.R.I Vectors in JsonReaderHelper, introduced in #81758 and outlined in dotnet/perf-autofiling-issues#12993. Furthermore, improved handling of the ldtoken+ltoken+Type::op_EqualThe optimization implemented in #81277 have significantly improved the benchmark grouping of System.Text, as detailed in dotnet/perf-autofiling-issues#12313.

The changes introduced in #81306 removed types deriving from JsonTypeInfo<T> have had a positive impact on the benchmark groupings of both System.Numerics and System.Collections, as reported in dotnet/perf-autofiling-issues#12488 and dotnet/perf-autofiling-issues#12550.

All above mentioned changes are speed-related improvements of microbechmarks. There was a significant size improvement on WASM and iOS by enabling deduplication of generics. Issue #80419 contains references to changes that reduced size on disk (SOD) for about 11% and 3% respectively.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Tests.Perf_Random.Next_long_unseeded 10.17 ns 28.84 ns 18.67 ns -184%
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark 0.79 ns 1.96 ns 1.17 ns -148%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark 60.14 ns 140.30 ns 80.17 ns -133%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark 60.73 ns 132.19 ns 71.46 ns -118%
System.Numerics.Tests.Perf_Vector4.TransformVector3ByMatrix4x4Benchmark 62.72 ns 131.48 ns 68.76 ns -110%
System.Numerics.Tests.Perf_Vector4.TransformByMatrix4x4Benchmark 63.09 ns 131.10 ns 68.00 ns -108%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark 56.47 ns 112.12 ns 55.65 ns -99%
System.Numerics.Tests.Perf_Quaternion.LengthSquaredBenchmark 7.76 ns 14.35 ns 6.59 ns -85%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark 56.66 ns 103.10 ns 46.44 ns -82%
System.Numerics.Tests.Perf_Vector4.TransformVector2ByMatrix4x4Benchmark 61.08 ns 103.66 ns 42.58 ns -70%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark 20.85 ns 35.00 ns 14.15 ns -68%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint 667.85 ns 1.10 μs 428.39 ns -64%
System.Tests.Perf_Random.Next_long_long_unseeded 14.28 ns 22.44 ns 8.15 ns -57%
System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark 18.32 ns 28.76 ns 10.44 ns -57%
System.Numerics.Tests.Perf_Quaternion.InverseBenchmark 26.70 ns 41.60 ns 14.89 ns -56%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark 13.45 ns 20.35 ns 6.90 ns -51%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong 745.74 ns 1.10 μs 357.01 ns -48%
System.Numerics.Tests.Perf_BitOperations.Log2_ulong 894.61 ns 1.32 μs 425.98 ns -48%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark 21.03 ns 30.87 ns 9.85 ns -47%
System.Numerics.Tests.Perf_Vector3.ReflectBenchmark 37.23 ns 54.13 ns 16.90 ns -45%

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report Description
dotnet/perf-autofiling-issues#12546 Quaternion and Plane SIMD intrinsics
dotnet/perf-autofiling-issues#12957 Improve ConcurrentDictionary performance for strings
dotnet/perf-autofiling-issues#12660 Improved codegen of the vector accelerated System.Numerics.* types
dotnet/perf-autofiling-issues#13187 Implementation of Lemire's nearly divisionless method
dotnet/perf-autofiling-issues#13500 Use of Array.Reverse<T> in ImmutableArray<T>.Builder.Reverse

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512) 140.03 μs 1.76 μs -138.26 μs 99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512) 136.03 μs 1.86 μs -134.17 μs 99%
System.Threading.Tests.Perf_Interlocked.CompareExchange_long 37.56 ns 6.66 ns -30.90 ns 82%
System.Threading.Tests.Perf_Interlocked.CompareExchange_int 34.18 ns 8.33 ns -25.85 ns 76%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False) 3.81 μs 1.09 μs -2.72 μs 71%
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark 3.21 ns 0.99 ns -2.22 ns 69%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False) 3.42 μs 1.06 μs -2.36 μs 69%
System.Tests.Perf_Decimal.Floor 175.25 ns 65.77 ns -109.48 ns 62%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark 63.64 ns 24.08 ns -39.56 ns 62%
System.Numerics.Tests.Perf_Quaternion.InequalityOperatorBenchmark 89.74 ns 34.82 ns -54.93 ns 61%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) 4.34 μs 1.70 μs -2.64 μs 61%
System.Tests.Perf_Decimal.Round 191.52 ns 75.77 ns -115.76 ns 60%
System.Numerics.Tests.Perf_Quaternion.DotBenchmark 77.60 ns 31.33 ns -46.27 ns 60%
System.Numerics.Tests.Perf_Quaternion.DivideBenchmark 88.55 ns 36.47 ns -52.07 ns 59%
System.Tests.Perf_Random.Next_int_int_unseeded 154.47 ns 65.37 ns -89.11 ns 58%
System.Numerics.Tests.Perf_Quaternion.IsIdentityBenchmark 81.52 ns 35.06 ns -46.46 ns 57%
System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark 83.75 ns 36.09 ns -47.67 ns 57%
System.Numerics.Tests.Perf_Quaternion.SubtractBenchmark 84.49 ns 36.50 ns -47.99 ns 57%
System.Collections.CtorFromCollection<Int32>.ConcurrentDictionary(Size: 512) 461.77 μs 200.10 μs -261.67 μs 57%
System.Tests.Perf_UInt64.TryFormat(value: 0) 250.12 ns 109.72 ns -140.40 ns 56%

The most improved groupings of benchmark are System.Collections, System.Numerics, and System.Decimal as outlined in dotnet/perf-autofiling-issues#12504, dotnet/perf-autofiling-issues#12544, dotnet/perf-autofiling-issues#13303, dotnet/perf-autofiling-issues#13247, dotnet/perf-autofiling-issues#13752, dotnet/perf-autofiling-issues#13761, and dotnet/perf-autofiling-issues#12744. The changes implemented in #81335 which intrinsified System.Numerics.* types, in #82093 which intrinsified CreateSpan, and in #81782 which introduced common Vector128 SIMD operations widely used in the BCL improved over 1000 microbenchmarks.

Implementation of synch block fast paths created a regression in Mono AOT compiler #81380, but led to an improvement of about 100 microbenchmarks in Mono Interpreter, as detailed in dotnet/perf-autofiling-issues#13245.

Similar to a change in AOT compiler, changes introduced in #81306 removed types deriving from JsonTypeInfo<T> improved several microbenchmarks in Mono Interpreter. Improve ConcurrentDictionary performance for strings in #81557 improved dotnet/perf-autofiling-issues#13003. Also, code refactors led to several improvements presented in dotnet/perf-autofiling-issues#12301.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark 0.06 ns 3.10 ns 3.04 ns -5,059%
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark 0.36 ns 1.75 ns 1.39 ns -391%
System.Collections.TryAddDefaultSize<String>.ConcurrentDictionary(Count: 512) 297.96 μs 574.34 μs 276.38 μs -93%
System.Numerics.Tests.Perf_Vector2.UnitYBenchmark 7.38 ns 13.69 ns 6.31 ns -85%
HardwareIntrinsics.RayTracer.SoA.Render 2.41 ns 4.38 ns 1.97 ns -82%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark 48.06 ns 86.28 ns 38.22 ns -80%
System.IO.Compression.Brotli.Compress_WithoutState(level: Fastest, file: "TestDocument.pdf") 291.36 μs 522.83 μs 231.47 μs -79%
System.IO.Compression.Brotli.Compress_WithState(level: Fastest, file: "TestDocument.pdf") 296.93 μs 525.99 μs 229.06 μs -77%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark 44.65 ns 75.61 ns 30.96 ns -69%
System.Memory.Constructors_ValueTypesOnly<Byte>.ReadOnlyFromPointerLength 6.33 ns 10.49 ns 4.16 ns -66%
PerfLabTests.EnumPerf.ObjectGetTypeNoBoxing 3.87 ns 6.20 ns 2.32 ns -60%
System.Numerics.Tests.Perf_Vector3.SquareRootBenchmark 23.34 ns 37.02 ns 13.68 ns -59%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark 124.53 ns 196.66 ns 72.12 ns -58%
System.Diagnostics.Perf_Process.StartAndWaitForExit 871.51 μs 1.35 ms 474.57 μs -54%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark 144.68 ns 217.99 ns 73.31 ns -51%
System.Collections.AddGivenSize<String>.List(Size: 512) 12.21 μs 18.32 μs 6.11 μs -50%
System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 2000000) 8.14 ms 12.20 ms 4.06 ms -50%
System.Numerics.Tests.Perf_VectorOf<Int32>.ZeroBenchmark 3.20 ns 4.80 ns 1.59 ns 50%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) 5.73 μs 8.56 μs 2.83 μs -49%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) 5.62 μs 8.37 μs 2.75 μs -49%

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report Description
dotnet/perf-autofiling-issues#12707 use of not implemented Vector operations
dotnet/perf-autofiling-issues#13747 Intrinsified common Vector128 operations

Preview 1

This report presents .NET 8 Preview 1 overview of major performance improvements and regressions in Mono Interpreter.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 1.

Name Baseline Value Compare Value Difference % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanAnyBenchmark 292.17 ns 18.88 ns -273.29 ns 94%
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark 298.08 ns 20.47 ns -277.61 ns 93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanOrEqualAnyBenchmark 294.38 ns 20.33 ns -274.05 ns 93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanAnyBenchmark 298.45 ns 20.63 ns -277.82 ns 93%
System.Numerics.Tests.Perf_VectorOf<Byte>.GreaterThanOrEqualAllBenchmark 331.73 ns 24.25 ns -307.48 ns 93%
System.Numerics.Tests.Perf_VectorOf<UInt16>.GreaterThanOrEqualAllBenchmark 218.05 ns 20.58 ns -197.47 ns 91%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanAllBenchmark 209.57 ns 20.48 ns -189.08 ns 90%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanOrEqualAllBenchmark 231.47 ns 23.03 ns -208.44 ns 90%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanOrEqualAnyBenchmark 188.87 ns 20.02 ns -168.84 ns 89%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanAnyBenchmark 186.21 ns 20.05 ns -166.16 ns 89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanOrEqualAnyBenchmark 189.87 ns 20.76 ns -169.11 ns 89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanAnyBenchmark 186.54 ns 21.38 ns -165.15 ns 89%
System.Memory.Span<Byte>.IndexOfAnyFourValues(Size: 512) 11.82 μs 1.60 μs -10.23 μs 87%
System.Memory.Span<Byte>.IndexOfAnyFiveValues(Size: 512) 14.32 μs 2.42 μs -11.90 μs 83%
System.Numerics.Tests.Perf_VectorOf<Int32>.GreaterThanAllBenchmark 120.71 ns 20.59 ns -100.11 ns 83%
System.Numerics.Tests.Perf_VectorOf<UInt32>.GreaterThanAllBenchmark 124.72 ns 21.39 ns -103.32 ns 83%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanOrEqualAllBenchmark 136.11 ns 24.20 ns -111.91 ns 82%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanAllBenchmark 128.50 ns 24.30 ns -104.20 ns 81%
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark 105.81 ns 20.48 ns -85.33 ns 81%
System.Numerics.Tests.Perf_VectorOf<Int64>.GreaterThanAllBenchmark 105.16 ns 20.57 ns -84.60 ns 80%

There are a number of improvements introduced in Preview 1 to individually call out. The following section presents only major improvements with high-level analysis.
The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis.

The most improved groupings of benchmark are System.Runtime.Vectors, System.Runtime.Intrinsics and System.Collections as outlined here and in dotnet/perf-autofiling-issues#10468.
Adding stobj.vt.noref version for no reference case that is twice as fast compared to the stobj.v improved over 400 microbenchmarks as outlined in dotnet/perf-autofiling-issues#10468 and dotnet/perf-autofiling-issues#10464.

SpanHelpers are widly used in BCL and improvements related to them could significantly improve performance. Changes in 200a90a, 7fa0d5b, and c0447bc removed mono-specific SpanHelpers, replaced branch patterns with super-instructions, and improved detection of dead bblocks. Over 300 microbenchmarks are improved as outlined in dotnet/perf-autofiling-issues#10989 and dotnet/perf-autofiling-issues#11155.
Change #77331 simplified getitem.span opcode and avoided typical use of ldloca with it, which improved over 50 microbenchmarks.

Allow passing vtypes with a single scalar field to native code using the faster code path improved System.Text an System.Collections groupings of benchmarks as outlined in dotnet/perf-autofiling-issues#10987 and dotnet/perf-autofiling-issues#10938. The assumption is that those libraries rely on ObjectHandleOnStack types.

Intrinsic for string allocation newstr in #79392 improved various microbenchmarks as outlined in dotnet/perf-autofiling-issues#10694 and dotnet/perf-autofiling-issues#10670.

9a65109 contributed to dotnet/perf-autofiling-issues#10695 and dotnet/perf-autofiling-issues#10671.

All above mentioned changes are speed improvements of microbechmarks. There was a significant size improvement in web assembly by #79672 that reduced size on disk (SOD) in blazor template application for ~270kb by trimming S.N.Vector class in non-SIMD cases. With deduplication of symbols in web assembly additional size savings are achieved.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 1.

Name Baseline Value Compare Value Difference % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark 0.10 ns 1.10 ns 1.00 ns -969%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql 11.63 μs 101.96 μs 90.33 μs -777%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", ol 1.30 μs 8.82 μs 7.52 μs -578%
System.Tests.Perf_Byte.ToString(value: 255) 38.31 ns 257.96 ns 219.65 ns -573%
System.Tests.Perf_String.Replace_String(text: "This is a very nice sentence. This is another very nice sentence.", oldValue: "a", newValue: "b") 962.59 ns 6.30 μs 5335.40 ns -554%
PerfLabTests.LowLevelPerf.IntegerFormatting 6.08 ms 34.30 ms 28.21 ms -464%
System.Tests.Perf_Int32.ToString(value: 2147483647) 59.17 ns 332.19 ns 273.01 ns -461%
System.Tests.Perf_Int16.ToString(value: 32767) 53.24 ns 297.84 ns 244.60 ns -459%
System.Tests.Perf_Int32.ToString(value: 12345) 52.90 ns 293.56 ns 240.66 ns -455%
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') 531.46 ns 2.89 μs 2355.30 ns -443%
System.Tests.Perf_SByte.ToString(value: 127) 52.62 ns 276.41 ns 223.79 ns -425%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark 21.70 ns 108.97 ns 87.28 ns -402%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark 26.37 ns 114.02 ns 87.65 ns -332%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixOperatorBenchmark 246.08 ns 1.04 μs 797.11 ns -324%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixBenchmark 243.24 ns 1.02 μs 779.98 ns -321%
System.Tests.Perf_Byte.ToString(value: 0) 7.06 ns 27.18 ns 20.11 ns -285%
System.Numerics.Tests.Perf_Matrix4x4.CreateTranslationFromScalarXYZ 25.27 ns 91.61 ns 66.34 ns -263%
System.Numerics.Tests.Perf_Matrix4x4.AddBenchmark 90.93 ns 304.20 ns 213.27 ns -235%
System.Numerics.Tests.Perf_Matrix4x4.LerpBenchmark 141.51 ns 443.45 ns 301.94 ns -213%
System.Numerics.Tests.Perf_Matrix4x4.SubtractOperatorBenchmark 100.31 ns 307.60 ns 207.29 ns -207%

Here is a list of ongoing regressions in Preview 1 snapshot with short description.

Issue report Description
dotnet/perf-autofiling-issues#12299 Extracted code outside of interp main loop
dotnet/perf-autofiling-issues#11449 Investigating
dotnet/perf-autofiling-issues#11453 Redundant ldloca and stfld opcodes in the new Matrix4x4 implementation
dotnet/perf-autofiling-issues#11147 New ASCII APIs
#79973 Dependencies update
#79336 Managed implementation of UInt32ToDecStr
#79876 Unoptimized pattern ldstr; if (uncommon) throw ex (string)
@kotlarmilos kotlarmilos added tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark tracking This issue is tracking the completion of other related issues. labels Apr 4, 2023
@kotlarmilos kotlarmilos added this to the Future milestone Apr 4, 2023
@kotlarmilos kotlarmilos self-assigned this Apr 4, 2023
@ghost
Copy link

ghost commented Apr 4, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This report provides an overview of the major performance improvements and regressions in Mono AOT and Interpreter during the timeframe of .NET 8 per-preview releases.

[WIP] Preview 3

This report presents .NET 8 Preview 3 overview of major performance improvements and regressions in Mono AOT and Interpreter.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.

There are a number of improvements introduced in Preview 3 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System Bit Processor Name
macOS 13.0 Arm64 Apple M1
ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 3.

Improvements

The most improved groupings of benchmark are System.Numerics as outlined dotnet/perf-autofiling-issues#14023, dotnet/perf-autofiling-issues#14224, dotnet/perf-autofiling-issues#14573, and dotnet/perf-autofiling-issues#14322. The changes implemented in #82420, #83337, and #83094 introduced Arm64 SIMD operations and improved about 1000 microbenchmarks.

Regressions

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 3.

Improvements

The most improved groupings of benchmark are System.Buffers, System.Collections, System.Memory, and System.Text as outlined in dotnet/perf-autofiling-issues#14324, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14326, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14355, dotnet/perf-autofiling-issues#14359, and dotnet/perf-autofiling-issues#14361. The changes implemented in #83498 and #83490 increased inlining length limit from 20 to 30 and implemented shr.un.imm which improved over 1000 microbenchmarks.

Add vector horizontal sums on Arm64 #83675 improved about 20 microbenchmarks, as detailed in dotnet/perf-autofiling-issues#14531.

Regressions

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.


Preview 2

This report presents .NET 8 Preview 2 overview of major performance improvements and regressions in Mono AOT and Interpreter.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.

There are a number of improvements introduced in Preview 2 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System Bit Processor Name
macOS 13.0 Arm64 Apple M1
ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Mono AOT compiler

The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512) 34.07 μs 310.43 ns -33756.76 ns 99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512) 17.32 μs 314.25 ns -17007.28 ns 98%
System.Tests.Perf_Decimal.Floor 81.17 ns 16.81 ns -64.36 ns 79%
System.Tests.Perf_Decimal.Round 82.24 ns 18.69 ns -63.55 ns 77%
System.Tests.Perf_UInt32.TryFormat(value: 0) 78.23 ns 20.05 ns -58.18 ns 74%
System.Tests.Perf_Int32.TryFormat(value: 4) 78.02 ns 20.47 ns -57.55 ns 74%
System.Collections.TryGetValueFalse<String, String>.ConcurrentDictionary(Size: 512) 44.69 μs 12.92 μs -31.77 μs 71%
System.Tests.Perf_Decimal.Divide 346.08 ns 102.16 ns -243.92 ns 70%
System.Collections.ContainsKeyFalse<String, String>.ConcurrentDictionary(Size: 512) 45.29 μs 13.50 μs -31.79 μs 70%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000) 8.93 μs 2.77 μs -6.16 μs 69%
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000) 8.83 μs 2.77 μs -6.06 μs 69%
System.Tests.Perf_UInt64.TryFormat(value: 0) 84.40 ns 26.53 ns -57.87 ns 69%
System.Tests.Perf_Byte.ToString(value: 255) 91.65 ns 29.95 ns -61.69 ns 67%
System.Tests.Perf_Version.TryFormat3 265.42 ns 88.04 ns -177.38 ns 67%
System.Tests.Perf_Version.TryFormat4 345.05 ns 115.05 ns -230.00 ns 67%
System.Collections.TryGetValueTrue<String, String>.ConcurrentDictionary(Size: 512) 49.50 μs 16.53 μs -32.97 μs 67%
System.Tests.Perf_Version.TryFormat2 176.63 ns 59.61 ns -117.02 ns 66%
System.Collections.ContainsKeyTrue<String, String>.ConcurrentDictionary(Size: 512) 50.43 μs 17.54 μs -32.89 μs 65%
LinqBenchmarks.Where01ForX 1.57 secs 548.00 ms -1022.61 ms 65%
LinqBenchmarks.Where01LinqMethodX 1.68 secs 588.39 ms -1095.38 ms 65%

The most improved groupings of benchmark are System.Collections, System.Decimal, System.Int, and System.Text as outlined in dotnet/perf-autofiling-issues#12996, dotnet/perf-autofiling-issues#13006, dotnet/perf-autofiling-issues#13217, and dotnet/perf-autofiling-issues#13264. The changes implemented in #81695 intrinsified RuntimeHelpers.CreateSpan<T> widely used in the BCL and replaced icall performance path.

Arm64 SIMD operations implemented in #83094 and #82420 improved over 1000 microbenchmarks according to the dotnet/perf-autofiling-issues#13808, dotnet/perf-autofiling-issues#13807, dotnet/perf-autofiling-issues#14023, and dotnet/perf-autofiling-issues#13990.

The grouping of benchmarks related to System.Collections have been improved by the changes made in #81902. as outlined in dotnet/perf-autofiling-issues#13220. The changes added support for v128 constants and improved performance in about 75 microbenchmarks.

The benchmark grouping of System.Text has been improved by the addition of S.R.I Vectors in JsonReaderHelper, introduced in #81758 and outlined in dotnet/perf-autofiling-issues#12993. Furthermore, improved handling of the ldtoken+ltoken+Type::op_EqualThe optimization implemented in #81277 have significantly improved the benchmark grouping of System.Text, as detailed in dotnet/perf-autofiling-issues#12313.

The changes introduced in #81306 removed types deriving from JsonTypeInfo<T> have had a positive impact on the benchmark groupings of both System.Numerics and System.Collections, as reported in dotnet/perf-autofiling-issues#12488 and dotnet/perf-autofiling-issues#12550.

All above mentioned changes are speed-related improvements of microbechmarks. There was a significant size improvement on WASM and iOS by enabling deduplication of generics. Issue #80419 contains references to changes that reduced size on disk (SOD) for about 11% and 3% respectively.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Tests.Perf_Random.Next_long_unseeded 10.17 ns 28.84 ns 18.67 ns -184%
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark 0.79 ns 1.96 ns 1.17 ns -148%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark 60.14 ns 140.30 ns 80.17 ns -133%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark 60.73 ns 132.19 ns 71.46 ns -118%
System.Numerics.Tests.Perf_Vector4.TransformVector3ByMatrix4x4Benchmark 62.72 ns 131.48 ns 68.76 ns -110%
System.Numerics.Tests.Perf_Vector4.TransformByMatrix4x4Benchmark 63.09 ns 131.10 ns 68.00 ns -108%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark 56.47 ns 112.12 ns 55.65 ns -99%
System.Numerics.Tests.Perf_Quaternion.LengthSquaredBenchmark 7.76 ns 14.35 ns 6.59 ns -85%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark 56.66 ns 103.10 ns 46.44 ns -82%
System.Numerics.Tests.Perf_Vector4.TransformVector2ByMatrix4x4Benchmark 61.08 ns 103.66 ns 42.58 ns -70%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark 20.85 ns 35.00 ns 14.15 ns -68%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint 667.85 ns 1.10 μs 428.39 ns -64%
System.Tests.Perf_Random.Next_long_long_unseeded 14.28 ns 22.44 ns 8.15 ns -57%
System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark 18.32 ns 28.76 ns 10.44 ns -57%
System.Numerics.Tests.Perf_Quaternion.InverseBenchmark 26.70 ns 41.60 ns 14.89 ns -56%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark 13.45 ns 20.35 ns 6.90 ns -51%
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong 745.74 ns 1.10 μs 357.01 ns -48%
System.Numerics.Tests.Perf_BitOperations.Log2_ulong 894.61 ns 1.32 μs 425.98 ns -48%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark 21.03 ns 30.87 ns 9.85 ns -47%
System.Numerics.Tests.Perf_Vector3.ReflectBenchmark 37.23 ns 54.13 ns 16.90 ns -45%

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report Description
dotnet/perf-autofiling-issues#12546 Quaternion and Plane SIMD intrinsics
dotnet/perf-autofiling-issues#12957 Improve ConcurrentDictionary performance for strings
dotnet/perf-autofiling-issues#12660 Improved codegen of the vector accelerated System.Numerics.* types
dotnet/perf-autofiling-issues#13187 Implementation of Lemire's nearly divisionless method
dotnet/perf-autofiling-issues#13500 Use of Array.Reverse<T> in ImmutableArray<T>.Builder.Reverse

Mono Interpreter

The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 2.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512) 140.03 μs 1.76 μs -138.26 μs 99%
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512) 136.03 μs 1.86 μs -134.17 μs 99%
System.Threading.Tests.Perf_Interlocked.CompareExchange_long 37.56 ns 6.66 ns -30.90 ns 82%
System.Threading.Tests.Perf_Interlocked.CompareExchange_int 34.18 ns 8.33 ns -25.85 ns 76%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False) 3.81 μs 1.09 μs -2.72 μs 71%
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark 3.21 ns 0.99 ns -2.22 ns 69%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False) 3.42 μs 1.06 μs -2.36 μs 69%
System.Tests.Perf_Decimal.Floor 175.25 ns 65.77 ns -109.48 ns 62%
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark 63.64 ns 24.08 ns -39.56 ns 62%
System.Numerics.Tests.Perf_Quaternion.InequalityOperatorBenchmark 89.74 ns 34.82 ns -54.93 ns 61%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) 4.34 μs 1.70 μs -2.64 μs 61%
System.Tests.Perf_Decimal.Round 191.52 ns 75.77 ns -115.76 ns 60%
System.Numerics.Tests.Perf_Quaternion.DotBenchmark 77.60 ns 31.33 ns -46.27 ns 60%
System.Numerics.Tests.Perf_Quaternion.DivideBenchmark 88.55 ns 36.47 ns -52.07 ns 59%
System.Tests.Perf_Random.Next_int_int_unseeded 154.47 ns 65.37 ns -89.11 ns 58%
System.Numerics.Tests.Perf_Quaternion.IsIdentityBenchmark 81.52 ns 35.06 ns -46.46 ns 57%
System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark 83.75 ns 36.09 ns -47.67 ns 57%
System.Numerics.Tests.Perf_Quaternion.SubtractBenchmark 84.49 ns 36.50 ns -47.99 ns 57%
System.Collections.CtorFromCollection<Int32>.ConcurrentDictionary(Size: 512) 461.77 μs 200.10 μs -261.67 μs 57%
System.Tests.Perf_UInt64.TryFormat(value: 0) 250.12 ns 109.72 ns -140.40 ns 56%

The most improved groupings of benchmark are System.Collections, System.Numerics, and System.Decimal as outlined in dotnet/perf-autofiling-issues#12504, dotnet/perf-autofiling-issues#12544, dotnet/perf-autofiling-issues#13303, dotnet/perf-autofiling-issues#13247, dotnet/perf-autofiling-issues#13752, dotnet/perf-autofiling-issues#13761, and dotnet/perf-autofiling-issues#12744. The changes implemented in #81335 which intrinsified System.Numerics.* types, in #82093 which intrinsified CreateSpan, and in #81782 which introduced common Vector128 SIMD operations widely used in the BCL improved over 1000 microbenchmarks.

Implementation of synch block fast paths created a regression in Mono AOT compiler #81380, but led to an improvement of about 100 microbenchmarks in Mono Interpreter, as detailed in dotnet/perf-autofiling-issues#13245.

Similar to a change in AOT compiler, changes introduced in #81306 removed types deriving from JsonTypeInfo<T> improved several microbenchmarks in Mono Interpreter. Improve ConcurrentDictionary performance for strings in #81557 improved dotnet/perf-autofiling-issues#13003. Also, code refactors led to several improvements presented in dotnet/perf-autofiling-issues#12301.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.

Name Baseline Value Compare Value Difference % Difference
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark 0.06 ns 3.10 ns 3.04 ns -5,059%
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark 0.36 ns 1.75 ns 1.39 ns -391%
System.Collections.TryAddDefaultSize<String>.ConcurrentDictionary(Count: 512) 297.96 μs 574.34 μs 276.38 μs -93%
System.Numerics.Tests.Perf_Vector2.UnitYBenchmark 7.38 ns 13.69 ns 6.31 ns -85%
HardwareIntrinsics.RayTracer.SoA.Render 2.41 ns 4.38 ns 1.97 ns -82%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark 48.06 ns 86.28 ns 38.22 ns -80%
System.IO.Compression.Brotli.Compress_WithoutState(level: Fastest, file: "TestDocument.pdf") 291.36 μs 522.83 μs 231.47 μs -79%
System.IO.Compression.Brotli.Compress_WithState(level: Fastest, file: "TestDocument.pdf") 296.93 μs 525.99 μs 229.06 μs -77%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark 44.65 ns 75.61 ns 30.96 ns -69%
System.Memory.Constructors_ValueTypesOnly<Byte>.ReadOnlyFromPointerLength 6.33 ns 10.49 ns 4.16 ns -66%
PerfLabTests.EnumPerf.ObjectGetTypeNoBoxing 3.87 ns 6.20 ns 2.32 ns -60%
System.Numerics.Tests.Perf_Vector3.SquareRootBenchmark 23.34 ns 37.02 ns 13.68 ns -59%
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark 124.53 ns 196.66 ns 72.12 ns -58%
System.Diagnostics.Perf_Process.StartAndWaitForExit 871.51 μs 1.35 ms 474.57 μs -54%
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark 144.68 ns 217.99 ns 73.31 ns -51%
System.Collections.AddGivenSize<String>.List(Size: 512) 12.21 μs 18.32 μs 6.11 μs -50%
System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 2000000) 8.14 ms 12.20 ms 4.06 ms -50%
System.Numerics.Tests.Perf_VectorOf<Int32>.ZeroBenchmark 3.20 ns 4.80 ns 1.59 ns 50%
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) 5.73 μs 8.56 μs 2.83 μs -49%
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) 5.62 μs 8.37 μs 2.75 μs -49%

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Here is a list of ongoing regressions in Preview 2 snapshot with short description.

Issue report Description
dotnet/perf-autofiling-issues#12707 use of not implemented Vector operations
dotnet/perf-autofiling-issues#13747 Intrinsified common Vector128 operations

Preview 1

This report presents .NET 8 Preview 1 overview of major performance improvements and regressions in Mono Interpreter.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/.

Setup

According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.

Operating System Bit Processor Name
macOS 13.0 Arm64 Apple M1
ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz

More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.

Improvements

Here is a list of top 20 microbenchmarks improvements in Preview 1.

Name Baseline Value Compare Value Difference % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanAnyBenchmark 292.17 ns 18.88 ns -273.29 ns 94%
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark 298.08 ns 20.47 ns -277.61 ns 93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanOrEqualAnyBenchmark 294.38 ns 20.33 ns -274.05 ns 93%
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanAnyBenchmark 298.45 ns 20.63 ns -277.82 ns 93%
System.Numerics.Tests.Perf_VectorOf<Byte>.GreaterThanOrEqualAllBenchmark 331.73 ns 24.25 ns -307.48 ns 93%
System.Numerics.Tests.Perf_VectorOf<UInt16>.GreaterThanOrEqualAllBenchmark 218.05 ns 20.58 ns -197.47 ns 91%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanAllBenchmark 209.57 ns 20.48 ns -189.08 ns 90%
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanOrEqualAllBenchmark 231.47 ns 23.03 ns -208.44 ns 90%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanOrEqualAnyBenchmark 188.87 ns 20.02 ns -168.84 ns 89%
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanAnyBenchmark 186.21 ns 20.05 ns -166.16 ns 89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanOrEqualAnyBenchmark 189.87 ns 20.76 ns -169.11 ns 89%
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanAnyBenchmark 186.54 ns 21.38 ns -165.15 ns 89%
System.Memory.Span<Byte>.IndexOfAnyFourValues(Size: 512) 11.82 μs 1.60 μs -10.23 μs 87%
System.Memory.Span<Byte>.IndexOfAnyFiveValues(Size: 512) 14.32 μs 2.42 μs -11.90 μs 83%
System.Numerics.Tests.Perf_VectorOf<Int32>.GreaterThanAllBenchmark 120.71 ns 20.59 ns -100.11 ns 83%
System.Numerics.Tests.Perf_VectorOf<UInt32>.GreaterThanAllBenchmark 124.72 ns 21.39 ns -103.32 ns 83%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanOrEqualAllBenchmark 136.11 ns 24.20 ns -111.91 ns 82%
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanAllBenchmark 128.50 ns 24.30 ns -104.20 ns 81%
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark 105.81 ns 20.48 ns -85.33 ns 81%
System.Numerics.Tests.Perf_VectorOf<Int64>.GreaterThanAllBenchmark 105.16 ns 20.57 ns -84.60 ns 80%

There are a number of improvements introduced in Preview 1 to individually call out. The following section presents only major improvements with high-level analysis.
The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis.

The most improved groupings of benchmark are System.Runtime.Vectors, System.Runtime.Intrinsics and System.Collections as outlined here and in dotnet/perf-autofiling-issues#10468.
Adding stobj.vt.noref version for no reference case that is twice as fast compared to the stobj.v improved over 400 microbenchmarks as outlined in dotnet/perf-autofiling-issues#10468 and dotnet/perf-autofiling-issues#10464.

SpanHelpers are widly used in BCL and improvements related to them could significantly improve performance. Changes in 200a90a, 7fa0d5b, and c0447bc removed mono-specific SpanHelpers, replaced branch patterns with super-instructions, and improved detection of dead bblocks. Over 300 microbenchmarks are improved as outlined in dotnet/perf-autofiling-issues#10989 and dotnet/perf-autofiling-issues#11155.
Change #77331 simplified getitem.span opcode and avoided typical use of ldloca with it, which improved over 50 microbenchmarks.

Allow passing vtypes with a single scalar field to native code using the faster code path improved System.Text an System.Collections groupings of benchmarks as outlined in dotnet/perf-autofiling-issues#10987 and dotnet/perf-autofiling-issues#10938. The assumption is that those libraries rely on ObjectHandleOnStack types.

Intrinsic for string allocation newstr in #79392 improved various microbenchmarks as outlined in dotnet/perf-autofiling-issues#10694 and dotnet/perf-autofiling-issues#10670.

9a65109 contributed to dotnet/perf-autofiling-issues#10695 and dotnet/perf-autofiling-issues#10671.

We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this section.

All above mentioned changes are speed improvements of microbechmarks. There was a significant size improvement in web assembly by #79672 that reduced size on disk (SOD) in blazor template application for ~270kb by trimming S.N.Vector class in non-SIMD cases. With deduplication of symbols in web assembly additional size savings are achieved.

Regressions

Here is a list of top 20 microbenchmarks regressions in Preview 1.

Name Baseline Value Compare Value Difference % Difference
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark 0.10 ns 1.10 ns 1.00 ns -969%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql 11.63 μs 101.96 μs 90.33 μs -777%
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", ol 1.30 μs 8.82 μs 7.52 μs -578%
System.Tests.Perf_Byte.ToString(value: 255) 38.31 ns 257.96 ns 219.65 ns -573%
System.Tests.Perf_String.Replace_String(text: "This is a very nice sentence. This is another very nice sentence.", oldValue: "a", newValue: "b") 962.59 ns 6.30 μs 5335.40 ns -554%
PerfLabTests.LowLevelPerf.IntegerFormatting 6.08 ms 34.30 ms 28.21 ms -464%
System.Tests.Perf_Int32.ToString(value: 2147483647) 59.17 ns 332.19 ns 273.01 ns -461%
System.Tests.Perf_Int16.ToString(value: 32767) 53.24 ns 297.84 ns 244.60 ns -459%
System.Tests.Perf_Int32.ToString(value: 12345) 52.90 ns 293.56 ns 240.66 ns -455%
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') 531.46 ns 2.89 μs 2355.30 ns -443%
System.Tests.Perf_SByte.ToString(value: 127) 52.62 ns 276.41 ns 223.79 ns -425%
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark 21.70 ns 108.97 ns 87.28 ns -402%
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark 26.37 ns 114.02 ns 87.65 ns -332%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixOperatorBenchmark 246.08 ns 1.04 μs 797.11 ns -324%
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixBenchmark 243.24 ns 1.02 μs 779.98 ns -321%
System.Tests.Perf_Byte.ToString(value: 0) 7.06 ns 27.18 ns 20.11 ns -285%
System.Numerics.Tests.Perf_Matrix4x4.CreateTranslationFromScalarXYZ 25.27 ns 91.61 ns 66.34 ns -263%
System.Numerics.Tests.Perf_Matrix4x4.AddBenchmark 90.93 ns 304.20 ns 213.27 ns -235%
System.Numerics.Tests.Perf_Matrix4x4.LerpBenchmark 141.51 ns 443.45 ns 301.94 ns -213%
System.Numerics.Tests.Perf_Matrix4x4.SubtractOperatorBenchmark 100.31 ns 307.60 ns 207.29 ns -207%

This report focuses on relevant regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.

Here is a list of ongoing regressions in Preview 1 snapshot with short description.

Issue report Description
dotnet/perf-autofiling-issues#12299 Extracted code outside of interp main loop
dotnet/perf-autofiling-issues#11449 Investigating
dotnet/perf-autofiling-issues#11453 Redundant ldloca and stfld opcodes in the new Matrix4x4 implementation
dotnet/perf-autofiling-issues#11147 New ASCII APIs
#79973 Dependencies update
#79336 Managed implementation of UInt32ToDecStr
#79876 Unoptimized pattern ldstr; if (uncommon) throw ex (string)
Author: kotlarmilos
Assignees: kotlarmilos
Labels:

area-System.Numerics, tenet-performance, tenet-performance-benchmarks, tracking

Milestone: Future

@kotlarmilos kotlarmilos modified the milestones: Future, 8.0.0 May 22, 2023
@kotlarmilos kotlarmilos changed the title .NET 8 Per-Preview Performance report on Mono AOT and Interpreter .NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter Jul 12, 2023
@kotlarmilos kotlarmilos modified the milestones: 8.0.0, 9.0.0 Aug 11, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Codegen-AOT-mono tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark tracking This issue is tracking the completion of other related issues.
Projects
None yet
Development

No branches or pull requests

4 participants