JIT: Fix case where we illegally optimize away a memory barrier on ARM32 #91827

jakobbotsch · 2023-09-08T23:06:22Z

The ARM32/ARM64 backends have an optimization where they optimize out the latter of two subsequent memory barriers if no memory store/load has been seen between them. This optimization did not take calls into account, allowing the removal of a memory barrier even if a call (with potential memory operations) had happened since the last one.

Fix #91732

The problem happened in ConcurrentQueueSegment<LargeStruct>.TryEnqueue in this code:

runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs

Lines 293 to 300 in e0a8170

    
           if (Interlocked.CompareExchange(ref _headAndTail.Tail, currentTail + 1, currentTail) == currentTail) 
        
           { 
        
               // Successfully reserved the slot.  Note that after the above CompareExchange, other threads 
        
               // trying to return will end up spinning until we do the subsequent Write. 
        
               slots[slotsIndex].Item = item; 
        
               Volatile.Write(ref slots[slotsIndex].SequenceNumber, currentTail + 1); 
        
               return true; 
        
           }

The block copy was turned into a helper call to CORINFO_HELP_MEMCPY. Then the JIT effectively turned the Volatile.Write into a plain write without any memory barrier. The diff with this PR is:

@@ -1,42 +1,42 @@
 G_M50486_IG05:  ;; offset=0x0042
 ldr     r6, [r4+0x94]
 dmb     15
 ldr     r0, [r4+0x0C]
 and     r7, r0, r6
 ldr     r0, [r5+0x04]
 cmp     r7, r0
 bhs     SHORT G_M50486_IG07
 movs    r0, 72
 mul     r0, r7, r0
 adds    r0, 8
 ldr     r0, [r5+r0]
 dmb     15
 sub     r8, r0, r6
 cmp     r8, 0
 bne     SHORT G_M50486_IG06
 add     r0, r4, 148
 adds    r1, r6, 1
 mov     r2, r6
 movw    r3, 0x221
 movt    r3, 0xf732
 blx     r3		// System.Threading.Interlocked:CompareExchange(byref,int,int):int
 cmp     r0, r6
 bne     SHORT G_M50486_IG05
 movs    r2, 72
 mul     r2, r7, r2
 adds    r2, 8
 adds    r2, r5, r2
 add     r0, r2, 8
 add     r1, sp, 40
 movs    r2, 64
 movw    r12, 0x565f
 movt    r12, 0xf749
 blx     r12		// CORINFO_HELP_MEMCPY
 movs    r0, 72
 mul     r0, r7, r0
 adds    r0, 8
 adds    r3, r6, 1
+dmb     15
 str     r3, [r5+r0]
 movs    r0, 1
 b       SHORT G_M50486_IG03

The ARM32/ARM64 backends have an optimization where they optimize out the latter of two subsequent memory barriers if no memory store/load has been seen between them. This optimization should not be allowed to remove memory barriers when a call has been seen. Fix dotnet#91732

ghost · 2023-09-08T23:06:32Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

The ARM32/ARM64 backends have an optimization where they optimize out the latter of two subsequent memory barriers if no memory store/load has been seen between them. This optimization did not take calls into account, allowing the removal of a memory barrier even if a call (with potential memory operations) had happened since the last one.

Fix #91732

The problem happened in ConcurrentQueueSegment<LargeStruct>.TryEnqueue in this code:

runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Concurrent/ConcurrentQueueSegment.cs

Lines 293 to 300 in e0a8170

    
           if (Interlocked.CompareExchange(ref _headAndTail.Tail, currentTail + 1, currentTail) == currentTail) 
        
           { 
        
               // Successfully reserved the slot.  Note that after the above CompareExchange, other threads 
        
               // trying to return will end up spinning until we do the subsequent Write. 
        
               slots[slotsIndex].Item = item; 
        
               Volatile.Write(ref slots[slotsIndex].SequenceNumber, currentTail + 1); 
        
               return true; 
        
           }

The JIT effectively turned the Volatile.Write into a plain write without any memory barrier. The diff with this PR is:

@@ -1,42 +1,42 @@
 G_M50486_IG05:  ;; offset=0x0042
 ldr     r6, [r4+0x94]
 dmb     15
 ldr     r0, [r4+0x0C]
 and     r7, r0, r6
 ldr     r0, [r5+0x04]
 cmp     r7, r0
 bhs     SHORT G_M50486_IG07
 movs    r0, 72
 mul     r0, r7, r0
 adds    r0, 8
 ldr     r0, [r5+r0]
 dmb     15
 sub     r8, r0, r6
 cmp     r8, 0
 bne     SHORT G_M50486_IG06
 add     r0, r4, 148
 adds    r1, r6, 1
 mov     r2, r6
 movw    r3, 0x221
 movt    r3, 0xf732
 blx     r3		// System.Threading.Interlocked:CompareExchange(byref,int,int):int
 cmp     r0, r6
 bne     SHORT G_M50486_IG05
 movs    r2, 72
 mul     r2, r7, r2
 adds    r2, 8
 adds    r2, r5, r2
 add     r0, r2, 8
 add     r1, sp, 40
 movs    r2, 64
 movw    r12, 0x565f
 movt    r12, 0xf749
 blx     r12		// CORINFO_HELP_MEMCPY
 movs    r0, 72
 mul     r0, r7, r0
 adds    r0, 8
 adds    r3, r6, 1
+dmb     15
 str     r3, [r5+r0]
 movs    r0, 1
 b       SHORT G_M50486_IG03

Author:	jakobbotsch
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-09-08T23:14:34Z

src/coreclr/jit/emitarm64.cpp

@@ -8908,6 +8908,7 @@ void emitter::emitIns_Call(EmitCallType          callType,

    dispIns(id);
    appendToCurIG(id);
+    emitLastMemBarrier = nullptr; // Cannot optimize away future memory barriers


I think this entire peephole should be switched to use the new backwards traversal peephole mechanism, which I expect would have quite significant TP improvements as we currently have code in very hot appendToCurIG to handle this peephole. I've opened #91825 to track that. However, given that this PR should be backported I've kept the fix here surgical.

jakobbotsch · 2023-09-09T16:06:29Z

cc @dotnet/jit-contrib

Significant number of diffs on ARM32. No diffs on ARM64, presumably (as @EgorBo told me) because we use combined memory barrier and memory operations there now.

jakobbotsch · 2023-09-11T10:11:02Z

Failure is #91838

jakobbotsch · 2023-09-11T10:11:20Z

/backport to release/8.0

github-actions · 2023-09-11T10:11:32Z

Started backporting to release/8.0: https://github.com/dotnet/runtime/actions/runs/6144973534

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 8, 2023

ghost assigned jakobbotsch Sep 8, 2023

jakobbotsch added this to the 8.0.0 milestone Sep 8, 2023

jakobbotsch commented Sep 8, 2023

View reviewed changes

build-analysis bot mentioned this pull request Sep 9, 2023

Networking certificate test failures #91705

Closed

EgorBo approved these changes Sep 9, 2023

View reviewed changes

jakobbotsch merged commit 6d5ea33 into dotnet:main Sep 11, 2023
124 of 127 checks passed

github-actions bot mentioned this pull request Sep 11, 2023

[release/8.0] JIT: Fix case where we illegally optimize away a memory barrier on ARM32 #91870

Merged

jakobbotsch deleted the fix-91732 branch September 11, 2023 10:16

jakobbotsch changed the title ~~JIT: Fix case where we illegally optimize away a memory barrier on ARM32/ARM64~~ JIT: Fix case where we illegally optimize away a memory barrier on ARM32 Sep 11, 2023

ghost locked as resolved and limited conversation to collaborators Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Fix case where we illegally optimize away a memory barrier on ARM32 #91827

JIT: Fix case where we illegally optimize away a memory barrier on ARM32 #91827

jakobbotsch commented Sep 8, 2023 •

edited

Loading

ghost commented Sep 8, 2023

jakobbotsch Sep 8, 2023 •

edited

Loading

jakobbotsch commented Sep 9, 2023

jakobbotsch commented Sep 11, 2023

jakobbotsch commented Sep 11, 2023

github-actions bot commented Sep 11, 2023

	if (Interlocked.CompareExchange(ref _headAndTail.Tail, currentTail + 1, currentTail) == currentTail)
	{
	// Successfully reserved the slot. Note that after the above CompareExchange, other threads
	// trying to return will end up spinning until we do the subsequent Write.
	slots[slotsIndex].Item = item;
	Volatile.Write(ref slots[slotsIndex].SequenceNumber, currentTail + 1);
	return true;
	}

JIT: Fix case where we illegally optimize away a memory barrier on ARM32 #91827

JIT: Fix case where we illegally optimize away a memory barrier on ARM32 #91827

Conversation

jakobbotsch commented Sep 8, 2023 • edited Loading

ghost commented Sep 8, 2023

jakobbotsch Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

jakobbotsch commented Sep 9, 2023

jakobbotsch commented Sep 11, 2023

jakobbotsch commented Sep 11, 2023

github-actions bot commented Sep 11, 2023

jakobbotsch commented Sep 8, 2023 •

edited

Loading

jakobbotsch Sep 8, 2023 •

edited

Loading