Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Implement synch block fast paths in managed #81380

Merged
merged 44 commits into from
Feb 15, 2023

Conversation

lambdageek
Copy link
Member

Investigate one incremental alternative from #48058

Rather than switching to NativeAOT's Monitor implementation in one step, first get access to Mono's synchronization block from managed, and implement the fast paths for hash code and monitor code in C#.

@vargaz
Copy link
Contributor

vargaz commented Feb 1, 2023

This might lead to perf problems since the previous optimized c code is replaced by the not so optimized JITted code.

@lambdageek
Copy link
Member Author

lambdageek commented Feb 1, 2023

This might lead to perf problems since the previous optimized c code is replaced by the not so optimized JITted code.

@vargaz not everything has intrinsics (Monitor_test_synchronised and Monitor_test_owner were just plain icalls). Also the interpreter doesn't have intrinsics for Enter(object, bool&).

On JIT/AOT the intrinsic is:

  if (mono_monitor_enter_v4_fast (_lock, ref lockTaken)) // no wrapper icall
    goto done;
 mono_monitor_enter_v4_internal (_lock, ref lockTaken); // icall with wrapper
 done:

I think it probably makes sense to keep this, but it's also possible that mono_monitor_enter_v4_fast is slower than a managed fast path. With aggressive inlining it could just turn into a volatile load and a couple of ALU ops and a CompareExchange. That might be cheaper than an icall.

For interp+AOT I think it will for sure be cheaper than a interp->native transition.

By the way, my plan is just to do the fast paths in managed. If we see contention, we'll still call mono_monitor_enter_v4_internal - that code is just too difficult to do in managed and would have a ton of icalls

@lambdageek
Copy link
Member Author

lambdageek commented Feb 2, 2023

On one laptop, I see this from running Release WASM System.Runtime.Tests using v8 without AOT:

Baseline:

  info: Total: 50616, Errors: 0, Failed: 0, Skipped: 92, Time: 130.155189s

This PR:

  info: Total: 50616, Errors: 0, Failed: 0, Skipped: 92, Time: 130.30528s

So we lost about 150ms. Although neither run was very stable - I think my CPU is throttling.

I'm planning to look at optimizing the interpreter IR. For single threaded wasm every lock(obj) should be able to avoid going into native code except if obj has its hash code taken and was locked

@lambdageek
Copy link
Member Author

lambdageek commented Feb 2, 2023

Interesting. I didn't realize it, but we don't have Interlocked.CompareExchange intrinsics in the interpreter. That's unfortunate for this project.

@lambdageek
Copy link
Member Author

lambdageek commented Feb 2, 2023

Added CompareExchange and some AggressiveInlining hints. I now see this for the optimized IR for ReliableEnterTimeout on amd64 desktop. As far as I can tell, the fast path has calls to ThrowIfNull, GC.KeepAlive, Thread.CurrentThread and Thread.GetSmallId. After that we're manipulating the LockWord directly. I'll get updated timing for System.Runtime.Tests shortly

Interpreter IR for ReliableEnterTimeout
Runtime method: System.Threading.Monitor:ReliableEnterTimeout (object,int,bool&) 0x16da01470
Locals size 128
Calculated stack height: 4, stated height: 8
IR_0000: ldstr          [104 <- nil], 0
IR_0003: mov.8          [96 <- 0],
IR_0006: call           [48 <- 96], 1
IR_000a: bge.i4.imm.sp  [nil <- 8], 0, IR_001c
IR_000e: beq.i4.imm.sp  [nil <- 8], 65535, IR_001c
IR_0012: ldstr          [104 <- nil], 2
IR_0015: newobj         [48 <- 96], 4
IR_001a: throw          [nil <- 48],
IR_001c: mov.8          [40 <- 0],
IR_001f: ldfld.i8       [48 <- 0], 8
IR_0023: mov.8          [48 <- 48],
IR_0026: mov.8          [96 <- 0],
IR_0029: call           [56 <- 96], 5
IR_002d: mov.vt         [32 <- 48], 8
IR_0031: mov.8          [48 <- 32],
IR_0034: bne.un.i8.imm.sp [nil <- 48], 0, IR_0070
IR_0038: call           [48 <- 96], 6
IR_003c: cknull         [96 <- 48],
IR_003f: call           [48 <- 96], 7
IR_0043: conv.i8.u4     [48 <- 48],
IR_0046: shl.i8.imm     [48 <- 48], 10
IR_004a: mov.8          [48 <- 48],
IR_004d: mov.vt         [56 <- 32], 8
IR_0051: ldflda         [64 <- 40], 8
IR_0055: mov.8          [72 <- 48],
IR_0058: mov.8          [80 <- 56],
IR_005b: mono_interlocked.cmpxchg.i8 [48 <- 64 72 80],
IR_0060: mov.8          [96 <- 40],
IR_0063: call           [56 <- 96], 5
IR_0067: mov.8          [56 <- 32],
IR_006a: bne.un.i8.s    [nil <- 48 56], IR_00f1
IR_006e: br.s           [nil <- nil], IR_00eb
IR_0070: mov.8          [48 <- 32],
IR_0073: ldc.i8         [56 <- nil], 2
IR_0079: and.i8         [48 <- 48 56],
IR_007d: ble.un.i8.imm.sp [nil <- 48], 0, IR_008d
IR_0081: mov.8          [96 <- 40],
IR_0084: call           [48 <- 96], 8
IR_0088: brfalse.i4.s   [nil <- 48], IR_00f1
IR_008b: br.s           [nil <- nil], IR_00eb
IR_008d: mov.8          [48 <- 32],
IR_0090: ldc.i8         [56 <- nil], 3
IR_0096: and.i8         [48 <- 48 56],
IR_009a: bne.un.i8.imm.sp [nil <- 48], 0, IR_00f1
IR_009e: call           [48 <- 96], 6
IR_00a2: cknull         [96 <- 48],
IR_00a5: call           [48 <- 96], 7
IR_00a9: mov.8          [56 <- 32],
IR_00ac: shr.un.i8.imm  [56 <- 56], 10
IR_00b0: bne.un.i4.s    [nil <- 56 48], IR_00f1
IR_00b4: ldloca.s       [96 <- nil], 32
IR_00b7: call           [48 <- 96], 9
IR_00bb: brfalse.i4.s   [nil <- 48], IR_00c0
IR_00be: br.s           [nil <- nil], IR_00f1
IR_00c0: mov.8          [48 <- 32],
IR_00c3: add.i8.imm     [48 <- 48], 4
IR_00c7: mov.8          [48 <- 48],
IR_00ca: mov.vt         [56 <- 32], 8
IR_00ce: ldflda         [64 <- 40], 8
IR_00d2: mov.8          [72 <- 48],
IR_00d5: mov.8          [80 <- 56],
IR_00d8: mono_interlocked.cmpxchg.i8 [48 <- 64 72 80],
IR_00dd: mov.8          [96 <- 40],
IR_00e0: call           [56 <- 96], 5
IR_00e4: mov.8          [56 <- 32],
IR_00e7: bne.un.i8.s    [nil <- 48 56], IR_00f1
IR_00eb: ldc.i4.1       [48 <- nil],
IR_00ed: stind.i1       [nil <- 16 48],
IR_00f0: ret.void       [nil <- nil],
IR_00f1: ldc.i4.1       [112 <- nil],
IR_00f3: mov.8.3        [nil <- nil], 96 <- 0, 104 <- 8, 120 <- 16
IR_00fa: call           [48 <- 96], 10
IR_00fe: ret.void       [nil <- nil],

@lambdageek
Copy link
Member Author

With intrinsics

  info: Total: 50616, Errors: 0, Failed: 0, Skipped: 92, Time: 129.2762049s

@lambdageek
Copy link
Member Author

One final update. I think this is ready for review @vargaz @kg @BrzVlad

Used a ObjectHeaderOnStack ref struct to pass around the ref Header that we need while also keeping the underlying object pointer on the stack. With this the two fast path (locking an unlocked object - either flat or previously inflated) only has calls for argument validation and is otherwise just ALU ops and a CompareExchange intrinsic.

Interpreter IR for ReliableEnterTimeout
Runtime method: System.Threading.Monitor:ReliableEnterTimeout (object,int,bool&) 0x16b9d5470
Locals size 144
Calculated stack height: 4, stated height: 8
IR_0000: ldstr          [120 <- nil], 0
IR_0003: mov.8          [112 <- 0],
IR_0006: call           [64 <- 112], 1
IR_000a: bge.i4.imm.sp  [nil <- 8], 0, IR_001c
IR_000e: beq.i4.imm.sp  [nil <- 8], 65535, IR_001c
IR_0012: ldstr          [120 <- nil], 2
IR_0015: newobj         [64 <- 112], 4
IR_001a: throw          [nil <- 64],
IR_001c: mov.8          [32 <- 0],
IR_001f: ldloca.s       [64 <- nil], 32
IR_0022: newobj_vt_inlined [80 <- 72], 8
IR_0026: stfld.i8       [nil <- 80 64], 0
IR_002a: mov.vt         [48 <- 72], 8
IR_002e: mov.8          [64 <- 72],
IR_0031: ldind.i8       [64 <- 64],
IR_0034: ldfld.i8       [64 <- 64], 8
IR_0038: mov.8          [64 <- 64],
IR_003b: mov.vt         [40 <- 64], 8
IR_003f: mov.8          [64 <- 40],
IR_0042: bne.un.i8.imm.sp [nil <- 64], 0, IR_007d
IR_0046: call           [64 <- 112], 5
IR_004a: cknull         [112 <- 64],
IR_004d: call           [64 <- 112], 6
IR_0051: conv.i8.u4     [64 <- 64],
IR_0054: shl.i8.imm     [64 <- 64], 10
IR_0058: mov.8          [64 <- 64],
IR_005b: mov.vt         [72 <- 40], 8
IR_005f: mov.8          [80 <- 48],
IR_0062: ldind.i8       [80 <- 80],
IR_0065: ldflda         [80 <- 80], 8
IR_0069: mov.8          [88 <- 64],
IR_006c: mov.8          [96 <- 72],
IR_006f: mono_interlocked.cmpxchg.i8 [64 <- 80 88 96],
IR_0074: mov.8          [72 <- 40],
IR_0077: bne.un.i8.s    [nil <- 64 72], IR_00fe
IR_007b: br.s           [nil <- nil], IR_00f8
IR_007d: mov.8          [64 <- 40],
IR_0080: ldc.i8         [72 <- nil], 2
IR_0086: and.i8         [64 <- 64 72],
IR_008a: ble.un.i8.imm.sp [nil <- 64], 0, IR_009b
IR_008e: mov.vt         [112 <- 48], 8
IR_0092: call           [64 <- 112], 7
IR_0096: brfalse.i4.s   [nil <- 64], IR_00fe
IR_0099: br.s           [nil <- nil], IR_00f8
IR_009b: mov.8          [64 <- 40],
IR_009e: ldc.i8         [72 <- nil], 3
IR_00a4: and.i8         [64 <- 64 72],
IR_00a8: bne.un.i8.imm.sp [nil <- 64], 0, IR_00fe
IR_00ac: call           [64 <- 112], 5
IR_00b0: cknull         [112 <- 64],
IR_00b3: call           [64 <- 112], 6
IR_00b7: mov.8          [72 <- 40],
IR_00ba: shr.un.i8.imm  [72 <- 72], 10
IR_00be: bne.un.i4.s    [nil <- 72 64], IR_00fe
IR_00c2: ldloca.s       [112 <- nil], 40
IR_00c5: call           [64 <- 112], 8
IR_00c9: brfalse.i4.s   [nil <- 64], IR_00ce
IR_00cc: br.s           [nil <- nil], IR_00fe
IR_00ce: mov.8          [64 <- 40],
IR_00d1: add.i8.imm     [64 <- 64], 4
IR_00d5: mov.8          [64 <- 64],
IR_00d8: mov.vt         [72 <- 40], 8
IR_00dc: mov.8          [80 <- 48],
IR_00df: ldind.i8       [80 <- 80],
IR_00e2: ldflda         [80 <- 80], 8
IR_00e6: mov.8          [88 <- 64],
IR_00e9: mov.8          [96 <- 72],
IR_00ec: mono_interlocked.cmpxchg.i8 [64 <- 80 88 96],
IR_00f1: mov.8          [72 <- 40],
IR_00f4: bne.un.i8.s    [nil <- 64 72], IR_00fe
IR_00f8: ldc.i4.1       [64 <- nil],
IR_00fa: stind.i1       [nil <- 16 64],
IR_00fd: ret.void       [nil <- nil],
IR_00fe: ldc.i4.1       [128 <- nil],
IR_0100: mov.8.3        [nil <- nil], 112 <- 0, 120 <- 8, 136 <- 16
IR_0107: call           [64 <- 112], 9
IR_010b: ret.void       [nil <- nil],

@BrzVlad
Copy link
Member

BrzVlad commented Feb 3, 2023

The numbers from the library tests suite are not really useful, since there is no significant difference there. I was able to miraculously find the benchmark suite that I used for testing when I implemented the monitor thin locks ages ago. Running those benchmarks on jit and interp, with and without your change might provide some better context of the performance implications of this.
tests.zip

@kg
Copy link
Contributor

kg commented Feb 3, 2023

only has calls for argument validation and is otherwise just ALU ops and a CompareExchange intrinsic.

Can the argument validation methods become aggressive inlined so that the fast path never performs a call? Calls aren't cheap in interp afaik, those alone might be worse than doing all this in an icall. You could put the failure case for validation in a call

avoid duplicated lockword and sync block manipulations
Instead of treating InternalGetHashCode/InternalTryGetHashCode as
intrinsics in the interpreter, intrinsify RuntimeHelpers.GetHashCode
RuntimeHelpers.TryGetHashCode and avoid the managed code entirely when possible
The code that recognizes GC transition icalls in the interpreter was
only assuming that it will see
mono_threads_enter_gc_safe_region_unbalanced /
mono_threads_exit_gc_safe_region_unbalanced

which are used by managed-to-native wrappers.

In most cases for native-to-managed wrappers the marshaller emits
mono_threads_attach_coop / mono_threads_detach_coop and those are
handled elsewhere by setting the needs_thread_attach flag on the
InterpMethod.

However when the mono_marshal_get_thunk_invoke_wrapper API is used, we
emit a thunk invoke wrapper which uses
mono_threads_enter_gc_unsafe_region_unbalanced /
mono_threads_exit_gc_unsafe_region_unbalanced

Recognize those two icalls and set the needs_thread_attach
flag.  (This is slightly more work than necessary -
mono_threads_attach_coop checks if the thread was previously attached
to the runtime and attaches it if it wasn't.  The thunk invoke
wrappers seem to assume the API caller attaches the thread - so we pay
for an extra check.  But on the other hand we don't need to change the
execution-time behavior of the interpreter by reusing existing mechanisms)
append_stloc_tail(builder, getArgU16(ip, 1), WasmOpcode.i32_store);
break;
case MintOpcode.MINT_MONO_CMPXCHG_I8:
// because i64 values can't pass through JS cleanly (c.f getRawCwrap and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these values enter/exit JS since the implementation of cmpxchg_i64 is C and the trace is pure wasm, so it should be fine to use i64s directly here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you're using getRawCwrap there's no JS wrapper involved (see

const result = (<any>Module)["asm"][name];
) - it should be a raw wasm export which is valid to shove into the import table of the trace module and wire everything through, I believe. If it's not possible it should be

Copy link
Member Author

@lambdageek lambdageek Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't work - Wasm.instantiate throws an import error. because EMSCRIPTEN_KEEPALIVE makes some weird signature for mono_jiterp_cas_i64 if I pass int64_t arguments by value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that something else was wrong when I was doing it by value. I'm not too inclined to investigate it right now to be honest. We could make an issue to follow-up.

For the i64 version, pass the address of the i64 values to the helper
function.

The issue is that since JS doesn't have i64, EMSCRIPTEN_KEEPALIVE
messes with the function's signature and we get import errors in the
JITed wasm module.  Passing by address works around the issue since
addresses are i32.
This reverts commit 5b498b8d4e265f8a68fe9453b535e48cf56756ce.
Make a "GC Unsafe Transition Builder" - that always calls
mono_threads_attach_coop / mono_threads_detach_coop.

Use it in the native to managed wrappers:
emit_thunk_invoke_wrapper and emit_managed_wrapper

This is a change in behavior for emit_thunk_invoke_wrapper -
previously it directly called
mono_threads_enter_gc_unsafe_region_unbalanced.

That means that compared to the previous behavior, the thunk invoke
wrappers are now a bit more lax: they will be able to be called on
threads that aren't attached to the runtime and they will attach
automatically.

On the other hand existing code will continue to work, with the extra
cost of a check of a thread local var.

Using mono_thread_attach_coop also makes invoke wrappers work
correctly in the interpreter - it special cases
mono_thread_attach_coop but not enter_gc_unsafe.
@lambdageek
Copy link
Member Author

The benchmark perf numbers stayed roughly the same after rebasing.

desktop

wasm

// stack that stores a pointer to the object, thus pinning the object.
private unsafe ref struct ObjectHeaderOnStack
{
private Header** _header;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a Header** doesn't that mean there won't be a visible pointer to the object on the stack? Instead, it's a pointer-to-pointer, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_header is a pointer to pointer. Which means the thing that it points to is on the stack (well... the thing it points to is in memory - but we happen to know it's on the stack) - that's the one we want the GC to notice.

lambdageek added a commit that referenced this pull request Feb 13, 2023
The code that recognizes GC transition icalls in the interpreter was only assuming that it will see `mono_threads_enter_gc_safe_region_unbalanced` / `mono_threads_exit_gc_safe_region_unbalanced` which are used by managed-to-native wrappers.

In most cases for native-to-managed wrappers the marshaller emits `mono_threads_attach_coop` / `mono_threads_detach_coop` and those are handled elsewhere by setting the `needs_thread_attach` flag on the `InterpMethod`.

However when the `mono_marshal_get_thunk_invoke_wrapper` API is used, we emit a thunk invoke wrapper which uses
`mono_threads_enter_gc_unsafe_region_unbalanced` / `mono_threads_exit_gc_unsafe_region_unbalanced`

Change the thunk invoke wrapper to also use attach_coop/detach_coop.  This makes the thunks a bit more flexible (they can now be called from unattached threads), at the cost of a TLS lookup.  Also fixes interpreter support since they use the existing attach/detach handling.

As an implementation detail, I added a GC Unsafe Transition Builder to the marshaling code.

Fixes a failure seen in #81380 in the MonoAPI/MonoMono/Thunks test (exposed by calling a cctor later than previously)

* Assert that CEE_MONO_GET_SP is followed by gc safe enter/exit icalls

   Only assert in debug builds

* [marshal] Add GCUnsafeTransitionBuilder; use for thunk_invoke_wrapper

Make a "GC Unsafe Transition Builder" - that always calls
mono_threads_attach_coop / mono_threads_detach_coop.

Use it in the native to managed wrappers:
emit_thunk_invoke_wrapper and emit_managed_wrapper

This is a change in behavior for emit_thunk_invoke_wrapper -
previously it directly called mono_threads_enter_gc_unsafe_region_unbalanced.

That means that compared to the previous behavior, the thunk invoke
wrappers are now a bit more lax: they will be able to be called on
threads that aren't attached to the runtime and they will attach
automatically.

On the other hand existing code will continue to work, with the extra
cost of a check of a thread local var.

Using mono_thread_attach_coop also makes invoke wrappers work
correctly in the interpreter - it special cases
mono_thread_attach_coop but not enter_gc_unsafe.
The underlying C functions are null-friendly, and the managed
intrinsics are allowed to C null inputs.  The assert is unnecessary.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants