Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] System.Text.Json.Tests segfault, for Libraries Test Run checked coreclr Linux_musl x64 Debug #46100

Closed
radical opened this issue Dec 15, 2020 · 23 comments
Labels
area-VM-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@radical
Copy link
Member

radical commented Dec 15, 2020

Segfault seen in a test run: https://dev.azure.com/dnceng/public/_build/results?buildId=922206&view=ms.vss-test-web.build-test-results-tab&runId=29288458&paneView=debug&resultId=173384 on #46048 .

+ ./RunTests.sh --runtime-path /root/helix/work/correlation
----- start Tue Dec 15 21:24:03 UTC 2020 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Text.Json.Tests.runtimeconfig.json --depsfile System.Text.Json.Tests.deps.json xunit.console.dll System.Text.Json.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem /root/helix/work/workitem
  Discovering: System.Text.Json.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Json.Tests (found 2250 of 2294 test cases)
  Starting:    System.Text.Json.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 161:    22 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Text.Json.Tests.runtimeconfig.json --depsfile System.Text.Json.Tests.deps.json xunit.console.dll System.Text.Json.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem
----- end Tue Dec 15 21:24:32 UTC 2020 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.
@jkotas
Copy link
Member

jkotas commented Dec 16, 2020

Crash during background GC:

(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4cc2e87ed5 libcoreclr.so`WKS::gc_heap::background_mark_simple1(unsigned char*) [inlined] MethodTable::GetFlag(this=<unavailable>, flag=enum_flag_HasComponentSize) const at methodtable.h:3705:16
    frame #1: 0x00007f4cc2e87ed5 libcoreclr.so`WKS::gc_heap::background_mark_simple1(unsigned char*) [inlined] MethodTable::HasComponentSize(this=<unavailable>) const at methodtable.h:1688
    frame #2: 0x00007f4cc2e87ed5 libcoreclr.so`WKS::gc_heap::background_mark_simple1(unsigned char*) [inlined] WKS::my_get_size(ob=<unavailable>) at gc.cpp:10315
    frame #3: 0x00007f4cc2e87ece libcoreclr.so`WKS::gc_heap::background_mark_simple1(oo=<unavailable>) at gc.cpp:21714
    frame #4: 0x00007f4cc2e88726 libcoreclr.so`WKS::gc_heap::background_mark_simple(o=<unavailable>) at gc.cpp:21815:17
    frame #5: 0x00007f4cc2e09713 libcoreclr.so`BlockScanBlocksWithoutUserData(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*) [inlined] ScanConsecutiveHandlesWithoutUserData(pValue=<unavailable>, pLast=<unavailable>, pInfo=0x00007f4cc2206458) at handletablescan.cpp:443:13
    frame #6: 0x00007f4cc2e096d4 libcoreclr.so`BlockScanBlocksWithoutUserData(pSegment=0x00007f4cc2520000, uBlock=58, uCount=1, pInfo=0x00007f4cc2206458) at handletablescan.cpp:555
    frame #7: 0x00007f4cc2e0a432 libcoreclr.so`xxxTableScanQueuedBlocksAsync(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) [inlined] ProcessScanQNode(pAsyncInfo=0x00007f4cc22062e0) at handletablescan.cpp:1231:9
    frame #8: 0x00007f4cc2e0a402 libcoreclr.so`xxxTableScanQueuedBlocksAsync(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1201
    frame #9: 0x00007f4cc2e0a3f3 libcoreclr.so`xxxTableScanQueuedBlocksAsync(pTable=0x000055bba8bb82f0, pSegment=0x00007f4cc2520708, pCrstHolder=0x00007f4cc2206498)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1312
    frame #10: 0x00007f4cc2e0a7a6 libcoreclr.so`xxxAsyncSegmentIterator(pTable=0x000055bba8bb82f0, pPrevSegment=0x00007f4cc2520000, pCrstHolder=0x00007f4cc2206498)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1537:9
    frame #11: 0x00007f4cc2e0acff libcoreclr.so`TableScanHandles(pTable=0x000055bba8bb82f0, puType=<unavailable>, uTypeCount=2, pfnSegmentIterator=(libcoreclr.so`xxxAsyncSegmentIterator(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1526), pfnBlockHandler=<unavailable>, pInfo=0x00007f4cc2206458, pCrstHolder=<unavailable>)(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*), void (*)(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*), ScanCallbackInfo*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1699:24
    frame #12: 0x00007f4cc2e0adf4 libcoreclr.so`xxxTableScanHandlesAsync(pTable=0x000055bba8bb82f0, puType=<unavailable>, uTypeCount=<unavailable>, pfnSegmentIterator=<unavailable>, pfnBlockHandler=<unavailable>, pInfo=<unavailable>, pCrstHolder=<unavailable>)(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*), void (*)(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*), ScanCallbackInfo*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1781:5
    frame #13: 0x00007f4cc2e05f48 libcoreclr.so`HndScanHandlesForGC(hTable=0x000055bba8bb82f0, scanProc=<unavailable>, param1=<unavailable>, param2=139967664260992, types=0x00007f4cc2206538, typeCount=2, condemned=<unavailable>, maxgen=<unavailable>, flags=<unavailable>)(Object**, unsigned long*, unsigned long, unsigned long), unsigned long, unsigned long, unsigned int const*, unsigned int, unsigned int, unsigned int, unsigned int) at handletable.cpp:789:9
    frame #14: 0x00007f4cc2e0e4d1 libcoreclr.so`Ref_TraceNormalRoots(condemned=<unavailable>, maxgen=2, sc=0x00007f4cc22065a8, fn=(libcoreclr.so`WKS::gc_heap::background_promote(Object**, ScanContext*, unsigned int) at gc.cpp:21840))(Object**, ScanContext*, unsigned int)) at objecthandle.cpp:1059:21
    frame #15: 0x00007f4cc2e7a3e3 libcoreclr.so`WKS::gc_heap::background_mark_phase() at gc.cpp:30188:9
    frame #16: 0x00007f4cc2e79591 libcoreclr.so`WKS::gc_heap::gc1() at gc.cpp:18723:13
    frame #17: 0x00007f4cc2e9716c libcoreclr.so`WKS::gc_heap::bgc_thread_function() at gc.cpp:31343:9
    frame #18: 0x00007f4cc2ccfca4 libcoreclr.so`(anonymous namespace)::CreateSuspendableThread(void (*)(void*), void*, char16_t const*)::$_0::__invoke(void*) [inlined] (anonymous namespace)::CreateSuspendableThread(void (*)(void*), void*, char16_t const*)::$_0::operator()(void*) const at gcenv.ee.cpp:1382:17
    frame #19: 0x00007f4cc2ccfc21 libcoreclr.so`(anonymous namespace)::CreateSuspendableThread(argument=<unavailable>)(void*), void*, char16_t const*)::$_0::__invoke(void*) at gcenv.ee.cpp:1362
    frame #20: 0x00007f4cc30ef088 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000055bba8de0de0) at thread.cpp:1845:16

@jkotas
Copy link
Member

jkotas commented Dec 16, 2020

(lldb) verifyheap
object 00007f4c2972f9f8: bad member 00007F4C292C7658 at 00007F4C2972FA48
Last good object: 00007F4C2972F9D0.
Object 00007f4c298208d0 contains free object 00007F4C29688830 at 00007F4C298208E8.
Last good object: 00007F4C29820870.

@danmoseley
Copy link
Member

@jkotas is it possible to say where this should go - JIT, JSON, ..?

@jkotas jkotas added area-VM-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-System.Text.Json labels Dec 16, 2020
@jkotas
Copy link
Member

jkotas commented Dec 16, 2020

(lldb) dumpobj 00007f4c2972f9f8
Name:        System.Object[]
MethodTable: 00007f4c48f5ea78
EEClass:     00007f4c48f5e9b0
Size:        1328(0x530) bytes
Array:       Rank 1, Number of elements 163, Type CLASS
Fields:
None
(lldb) gcroot 00007f4c2972f9f8
HandleTable:
    00007F4CC2528530 (strong handle)
    -> 00007F4C296C8ED0 System.Runtime.CompilerServices.GCHeapHash
    -> 00007F4C2972F9F8 System.Object[]

Found 1 unique roots (run 'gcroot -all' to see all roots).
(lldb) dumpobj 00007f4c298208d0
Name:        System.Collections.Generic.Dictionary`2+Entry[[System.String, System.Private.CoreLib],[System.String, System.Private.CoreLib]][]
MethodTable: 00007f4c495edbd0
EEClass:     00007f4c495edb08
Size:        432(0x1b0) bytes
Array:       Rank 1, Number of elements 17, Type VALUETYPE
Fields:
None
(lldb) gcroot -all 00007f4c298208d0
Found 0 roots.

@jkotas
Copy link
Member

jkotas commented Dec 16, 2020

We need more dumps to see the pattern

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Dec 18, 2020
@mangod9 mangod9 added this to the 6.0.0 milestone Dec 18, 2020
@k15tfu
Copy link
Contributor

k15tfu commented Dec 25, 2020

@jkotas I think I see the same problem on Linux x64 with .NET Core 3.1.8:

(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007f59b8c46100 libcoreclr.so`WKS::gc_heap::background_mark_simple1(oo=<unavailable>) at gc.cpp:19382:43
    frame #1: 0x00007f59b8c66474 libcoreclr.so`BlockScanBlocksWithoutUserData(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*) [inlined] ScanConsecutiveHandlesWithoutUserData(pValue=0x00007f59ba8715f8, pLast=<unavailable>, pInfo=0x00007f58a9ffaba0) at handletablescan.cpp:444:13
    frame #2: 0x00007f59b8c66441 libcoreclr.so`BlockScanBlocksWithoutUserData(pSegment=0x00007f59ba870000, uBlock=2, uCount=1, pInfo=0x00007f58a9ffaba0) at handletablescan.cpp:556
    frame #3: 0x00007f59b8c67412 libcoreclr.so`xxxTableScanQueuedBlocksAsync(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) [inlined] ProcessScanQNode(pAsyncInfo=0x00007f58a9ffaa10, pQNode=<unavailable>) at handletablescan.cpp:1232:9
    frame #4: 0x00007f59b8c673e2 libcoreclr.so`xxxTableScanQueuedBlocksAsync(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1202
    frame #5: 0x00007f59b8c673d3 libcoreclr.so`xxxTableScanQueuedBlocksAsync(pTable=<unavailable>, pSegment=<unavailable>, pCrstHolder=0x00007f58a9ffabd0)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1313
    frame #6: 0x00007f59b8c67841 libcoreclr.so`xxxAsyncSegmentIterator(pTable=0x00000000019c2c30, pPrevSegment=0x00007f59ba870000, pCrstHolder=0x00007f58a9ffabd0)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1538:9
    frame #7: 0x00007f59b8c67c43 libcoreclr.so`TableScanHandles(pTable=0x00000000019c2c30, puType=<unavailable>, uTypeCount=0, pfnSegmentIterator=(libcoreclr.so`xxxAsyncSegmentIterator(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1527), pfnBlockHandler=(libcoreclr.so`BlockQueueBlocksForAsyncScan(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*) at handletablescan.cpp:1067), pInfo=0x00007f58a9ffaba0, pCrstHolder=<unavailable>)(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*), void (*)(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*), ScanCallbackInfo*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1700:24
    frame #8: 0x00007f59b8c67d1b libcoreclr.so`xxxTableScanHandlesAsync(pTable=0x00000000019c2c30, puType=<unavailable>, uTypeCount=<unavailable>, pfnSegmentIterator=<unavailable>, pfnBlockHandler=<unavailable>, pInfo=<unavailable>, pCrstHolder=<unavailable>)(HandleTable*, TableSegment*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*), void (*)(TableSegment*, unsigned int, unsigned int, ScanCallbackInfo*), ScanCallbackInfo*, Holder<CrstBase*, &(CrstBase::AcquireLock(CrstBase*)), &(CrstBase::ReleaseLock(CrstBase*)), 0ul, &(int CompareDefault<CrstBase*>(CrstBase*, CrstBase*)), true>*) at handletablescan.cpp:1782:5
    frame #9: 0x00007f59b8c6403a libcoreclr.so`HndScanHandlesForGC(hTable=0x00000000019c2c30, scanProc=(libcoreclr.so`PinObject(Object**, unsigned long*, unsigned long, unsigned long) at objecthandle.cpp:257), param1=140018785955096, param2=140023328696816, types=<unavailable>, typeCount=<unavailable>, condemned=<unavailable>, maxgen=<unavailable>, flags=<unavailable>)(Object**, unsigned long*, unsigned long, unsigned long), unsigned long, unsigned long, unsigned int const*, unsigned int, unsigned int, unsigned int, unsigned int) at handletable.cpp:790:9
    frame #10: 0x00007f59b8bf0bff libcoreclr.so`Ref_TracePinningRoots(condemned=2, maxgen=2, sc=<unavailable>, fn=(libcoreclr.so`WKS::gc_heap::background_promote(Object**, ScanContext*, unsigned int) at gc.cpp:19633))(Object**, ScanContext*, unsigned int)) at objecthandle.cpp:1028:21
    frame #11: 0x00007f59b8c67fe3 libcoreclr.so`GCScan::GcScanHandles(fn=(libcoreclr.so`WKS::gc_heap::background_promote(Object**, ScanContext*, unsigned int) at gc.cpp:19633), condemned=2, max_gen=2, sc=0x00007f58a9ffad18)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcscan.cpp:169:9
    frame #12: 0x00007f59b8c3b9e9 libcoreclr.so`WKS::gc_heap::background_mark_phase() at gc.cpp:26936:9
    frame #13: 0x00007f59b8c3ae6d libcoreclr.so`WKS::gc_heap::gc1() at gc.cpp:16335:13
    frame #14: 0x00007f59b8c5288c libcoreclr.so`WKS::gc_heap::bgc_thread_function() at gc.cpp:28105:9
    frame #15: 0x00007f59b8c634d2 libcoreclr.so`(anonymous namespace)::CreateSuspendableThread(void (*)(void*), void*, char16_t const*)::$_0::__invoke(void*) [inlined] (anonymous namespace)::CreateSuspendableThread(void (*)(void*), void*, char16_t const*)::$_0::operator()(void*) const at gcenv.ee.cpp:1301:17
    frame #16: 0x00007f59b8c63411 libcoreclr.so`(anonymous namespace)::CreateSuspendableThread(argument=<unavailable>)(void*), void*, char16_t const*)::$_0::__invoke(void*) at gcenv.ee.cpp:1281
    frame #17: 0x00007f59b8d7561d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x00007f58ac0074e0) at thread.cpp:1807:16
    frame #18: 0x00007f59ba4706db libpthread.so.0`start_thread(arg=0x00007f58a9ffb700) at pthread_create.c:463
    frame #19: 0x00007f59b965671f libc.so.6`__GI___clone at clone.S:95
(lldb) sos VerifyHeap
No heap corruption detected.

In another crash, I have the same backtrace, but the heap is corrupted:

(lldb) sos VerifyHeap
Object 00007f26d17c22a8 has an invalid method table.
Last good object: 00007F26D17C2280.
(lldb) dumpobj 00007f26d17c22a8
<Note: this object has an invalid CLASS field>
Invalid object
DumpObj 00007f26d17c22a8  failed

@jkotas
Copy link
Member

jkotas commented Dec 25, 2020

Crash with this stacktrace is a very generic symptom of GC hole or GC heap corruption. It is unlikely that the problem you are seeing has the same root cause as the problem that this issue is tracking.

@steveharter
Copy link
Member

It will likely be necessary to run the STJ tests in a tight loop on a local Linux or OSX environment in attempt to repro the crash and narrow down the tests and call stacks.

FWIW here's commits from the two weeks prior to the first reported failure on 12\15\2020 containing keywords including "JIT", "GC", and "OSX":

7696202 Auto-generate all C++ header overrides of the entire jit interface (#45920)
08322e8 Fix GC hole with STOREIND of LCL_VAR_ADDR/LCL_FLD_ADDR (#45818)
61a92b7 Fix JitDisasmWithGC without JitDump (#45938)
bc12def JIT: consistently mark method table accesses as invariant (#45854)
c400049 JIT: implement return merging stress mode (#45917)
56ecd32 Fix debugger attach+read on Linux 32 bit (#45506)
f6949e4 Fix SOS clrstack w^x issues on osx-arm64 (#45849)
6e33437 Do YieldProcessor normalization in preemptive GC mode (#45913)
d731112 Fix dump of GC vars at call sites (#45815)
e516cf3 JIT: remove early prop's type propagation (#45655)
29325a6 JIT: don't compute edge weights when there's no profile data (#45615)
b66138c Move return buffer handling from interop to the JIT (#39294)
c218928 Return the result of PAL_InjectActivation in Thread::InjectGcSuspension (#44731)
76b52a5 Add OSX arm64 jit. (#45467)
44cf945 A small jit changes for Arm64 apple. (#45461)

@Anipik Anipik added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Feb 25, 2021
@jkotas
Copy link
Member

jkotas commented Feb 25, 2021

Do we have the more dumps for this crash now?

Note that this is crash during background GC, so the test that triggered the problem is likely finished running by the time the process crashes. The best way to investigate crashes like this one is stresslog.

@steveharter
Copy link
Member

Do we have the more dumps for this crash now?

No new dumps that I'm aware of although we have been getting other failures that may be related: #47805

@jkotas
Copy link
Member

jkotas commented Feb 25, 2021

To make progress on this, we really need to fix the CI to collect the dumps for crashes like this one. I believe that it used to work at one point in not so distant past. Why is the CI not collecting the crash dumps anymore?

@danmoseley
Copy link
Member

If I go to the last failure listed in #47805 https://dev.azure.com/dnceng/public/_build/results?buildId=982059&view=ms.vss-test-web.build-test-results-tab&runId=30778306&resultId=173593&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab
and click on the test and on the Helix tab
I see
image
which has our cool how to markdown that has the commands in it.

Does that work?

@danmoseley
Copy link
Member

What is a little more troublesome is going from Kusto (ie, a query of failures for the tests) to debugging the dump, because there is not a good way to get the dump file without a link to the build. I know @safern was working on a change so that runfo get-helix-payload will get the dump as well as everything else. Then it should be easy to get dump+payload for all failures for a test going back in time until they start discarding them.

@safern
Copy link
Member

safern commented Feb 25, 2021

I know @safern was working on a change so that runfo get-helix-payload will get the dump as well as everything else. Then it should be easy to get dump+payload for all failures for a test going back in time until they start discarding them.

I'm about to put a PR into runfo 😄

@danmoseley
Copy link
Member

I'm about to put a PR into runfo 😄

Nice! After that we can update how-to-debug-dump.md to match.

@jkotas
Copy link
Member

jkotas commented Feb 26, 2021

Are there any dumps for the musl x64 failure that this issue is about? (I am not setup to debug the OSX x64 dump at the moment.)

@danmoseley
Copy link
Member

danmoseley commented Feb 26, 2021

Here's what I find

https://engsrvprod.kusto.windows.net/engineeringdata?query=H4sIAAAAAAAEAG2RQWvDMAyF74H8BxF6SGEk63baIYMxVmgPg9HBzk6sNlpjq8g2bcZ%2b%2fJxkJTvsJvT8nj7JW65dmnzDuUVBeAsYUMMjqANDvrq71ctZ3HGQBqGqIGuoPIW6o6bU7C36UoL1ZLAU3LuyRaVdaZTzKNng%2f2SycCSrK7IWBXL4YDluPBoHc%2f5aCK3u%2bldlpim7PiaY4h0vvtg6trFy3mWz4%2bVC%2fpn1%2bHp1%2f5AmS2ALiw73vrimLIQObfRzPXT%2bo1lTh24wXqE2%2bi9VVMcs55V4dybfxv1ZsMiAZZYj%2bq9YaHMaIV0wRgl9ITQc75Mvoe6nEw%2bOm6l86kjFH0iTH1pYlQSKAQAA&web=0

Jobs
| where Queued > ago (120d)
| where Source == "ci/public/dotnet/runtime/refs/heads/master"
| join kind=inner ( WorkItems
| where FriendlyName == "System.Text.Json.Tests"
| where ExitCode == 139
) on $left.Name == $right.JobName
| join kind=inner Files on WorkItemId
| where FileName startswith "core." or FileName endswith ".dmp"
| summarize count() by QueueName, QueueAlias

QueueName QueueAlias count_
osx.1013.amd64.open osx.1013.amd64.open 15
ubuntu.1804.armarch.open ubuntu.1804.arm64.open 4
ubuntu.1604.amd64.open.rt alpine.312.amd64.open 2
osx.1014.amd64.open osx.1014.amd64.open 1

then looking at just Alpine, I find only two

https://engsrvprod.kusto.windows.net/engineeringdata?query=H4sIAAAAAAAEAG1RTUvDQBC9B%2fIfhtBDhbCxrQgeKmix0B4EqeJ5m50202Z3wuyGtuCPN4n9Qr0t%2b968j5k5L30cfcGuQEF4q7FGA4%2bg1wz9wfDW3FzABdeSI4zHkOSUVfWypDwzHByGTGoXyGImuPJZgdr4zGofUJJ2fsPkYEvOjMk5FOjDJ8t2FtB6uOhPhdCZ8vCq7Y%2fL4tAoWPWO%2b6Dmnl3z8sEnl4mXPYUJm449GD3E0Q2wg16Jq6BOKj2hddHM87L9%2bVX1qSTtOy9dVuRQjQZDpcXe3ymu0CXAck38w7TmxPyv5pRK9G2iU9uZua7boF1IH7QEv6NQNItlQdXZnuFmJ0dQGVt1RpXwBvMALZ5eBUxh0WqhSc%2brSWHCznOJH0IpPIt2eZEeTxlHcfQNUaYKIwACAAA%3d&web=0

Jobs
| where Queued > ago (120d)
| where Source == "ci/public/dotnet/runtime/refs/heads/master"
| join kind=inner ( WorkItems
| where FriendlyName == "System.Text.Json.Tests"
| where ExitCode == 139
) on $left.Name == $right.JobName
| where QueueAlias == "alpine.312.arm64.open" or QueueAlias == "alpine.312.amd64.open"
| join kind=inner Files on WorkItemId
| where FileName startswith "core." or FileName endswith ".dmp"
| project Name, QueueAlias, Started, ExitCode, ConsoleUri, Branch, Source

Name QueueAlias Started ExitCode ConsoleUri Branch Source
7cb6bbd8-e0e8-4ee2-9f37-bf5b3e6e6693 alpine.312.amd64.open 2020-11-26 11:54:33.5660000 139 https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-master-7cb6bbd8e0e84ee29f/System.Text.Json.Tests/console.fbc07370.log?sv=2019-07-07&se=2020-12-16T11%3A54%3A32Z&sr=c&sp=rl&sig=hT8QhYbgTOthE7sDPFLV%2FCyNIB%2FFmqesag9K1TKKs0k%3D refs/heads/master ci/public/dotnet/runtime/refs/heads/master
00624f66-390f-4a06-8904-833bc7cc6b4f alpine.312.amd64.open 2020-12-29 14:40:27.7330000 139 https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-master-00624f66390f4a0689/System.Text.Json.Tests/console.c8d16e70.log?sv=2019-07-07&se=2021-01-18T14%3A40%3A26Z&sr=c&sp=rl&sig=AAvKrGMGlpgrOEwNu6dUFSgh3AJUuV1Sa3oZoLkJpnM%3D refs/heads/master ci/public/dotnet/runtime/refs/heads/master

both both of these URLs give me

<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:f76a8b7b-b01e-00e3-71e3-0ba8a1000000 Time:2021-02-26T02:03:35.7795736Z</Message>
<AuthenticationErrorDetail>Signed expiry time [Mon, 18 Jan 2021 14:40:26 GMT] must be after signed start time [Fri, 26 Feb 2021 02:03:35 GMT]</AuthenticationErrorDetail>
</Error>

@MattGal what's up there? if it's deleted, why doesn't it just give me 404?

If I comment out the "Source" line (so we get PR's as well, which may hvae their own bugs) I get 23 hits, but only 2 URL's work

ConsoleUri
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-47901-merge-a1f817076a5b454b8a/System.Text.Json.Tests/console.747db9c7.log?sv=2019-07-07&se=2021-03-01T17%3A54%3A25Z&sr=c&sp=rl&sig=%2FYR3DJ3Puq%2Bj25phEgVbR5o0xhwS%2BKe5ByFMUe%2BqAjs%3D
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-48377-merge-8fd73184daa84815aa/System.Text.Json.Tests/console.b51f55e2.log?sv=2019-07-07&se=2021-03-09T03%3A57%3A58Z&sr=c&sp=rl&sig=q5CwA%2Fe0dpyuWEdSvU0ejW6aTsaFLeyfNzcH54kNPu8%3D

The second one looks most promising as it's x64. Of course, it may be a PR issue. @safern I will need your help getting to the dump.

@danmoseley
Copy link
Member

BTW, no evidence anything started on 12/15.

adding
| summarize count() by format_datetime(bin(Started, 1d), 'MM-dd'), QueueAlias

Started QueueAlias count_
10-29 alpine.312.amd64.open 1
11-17 alpine.312.arm64.open 1
11-18 alpine.312.amd64.open 1
11-21 alpine.312.amd64.open 1
11-22 alpine.312.amd64.open 1
11-25 alpine.312.amd64.open 2
11-25 alpine.312.arm64.open 1
11-26 alpine.312.amd64.open 1
12-04 alpine.312.amd64.open 1
12-05 alpine.312.amd64.open 1
12-05 alpine.312.arm64.open 1
12-09 alpine.312.amd64.open 1
12-15 alpine.312.amd64.open 1
12-17 alpine.312.arm64.open 2
12-24 alpine.312.amd64.open 1
12-29 alpine.312.amd64.open 1
01-08 alpine.312.arm64.open 1
01-10 alpine.312.amd64.open 1
01-12 alpine.312.amd64.open 1
02-09 alpine.312.arm64.open 1
02-17 alpine.312.amd64.open 1

@safern
Copy link
Member

safern commented Feb 26, 2021

Nice! After that we can update how-to-debug-dump.md to match.

As promised: jaredpar/runfo#98. Once that is merged and a new version is published I'll update how-to-debug-dump.md

@danmoseley
Copy link
Member

@safern with your improvement, is it possible to dig up an alpine dump for this one for @jkotas to look at? Seems this is our highest hitting isse in CI right now.

@safern
Copy link
Member

safern commented Mar 17, 2021

@danmoseley I couldn't find any crash for these tests on Alpine after my improvement where the crash wasn't caused by the PR change itself.

https://dataexplorer.azure.com/clusters/engsrvprod/databases/engineeringdata?query=H4sIAAAAAAAAA3WSTUvDQBCG7/kVy55S0FStFEEiVLSQHkSp4FHW7NROm+yE3QlpwR/vJjZu48dtlnln5nlntgAWK4UFaJGKBb256EM0a7AgllTbHEROhhUaJ6QmNsBjWxvGEmQ0Hote+2ipAssI7kifryHfgpbfHZkKasDGTzXUMCtQuVGQ90k5Kyo0IEcBhJVlz3cj1DvFkzPdpipLG8i5Rc70SQfg0kpZB68bRyYORKFRJ0ruYIUGGclk3nMq5PRq2jJuCI3YotEpGgM2il/IbjOGMqxkbhGMLvYPqoSudLl3XpA8w46ThR/rI8dusJrObFfwn9Vkcn6RzEo9vRya5tp1Q26Vvt8hy9+ue8I2PmxpJMh8pa+juT/rEbx/DTnkmppTpnCgvt/AZ5D/dPvX1g5/yUMEuE8NCWR4ZAIAAA==

let failed = Jobs
| where Source contains "dotnet/runtime"
| where tolower(QueueAlias) contains tolower("Alpine")
| where Started > ago(30d)
| project JobId, Props=parse_json(Properties)
| where Props.DefinitionId == "686"
| join kind=inner
(WorkItems
| where FriendlyName == "System.Text.Json.Tests"
| where Status == "BadExit"
| project JobId, WorkItemId, Started) on JobId;
Files
| where FileName contains "how-to"
| where WorkItemFriendlyName contains "System.Text.Json"
| join kind=inner
failed on WorkItemId

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

So should we close this issue if we do not see these crashes anymore?

Seems this is our highest hitting isse in CI right now.

I assume that it is the OSX crash (#47805). Is that right?

@danmoseley
Copy link
Member

Yes. You are right. The Infrastructure health has both, but this one is falling off the list. it's the other that is hot.

I think we can close it. I confirm what @safern saw, with this query

https://engsrvprod.kusto.windows.net/engineeringdata?query=H4sIAAAAAAAEAJVRTWvCQBC9C%2fkPQ%2fCgEDZVS6EHC61U0EOh2NLzmh3NaHYnzG5QoT%2b%2bSeoXbS%2b9LfvezPuYOS991PmEXY6C8FphhQYeQK8ZeoPhjelfwAVXkiGMxxBnlJbVsqAsNRwchlQqF8hiKrjyaY7a%2bNRqH1BiYPn%2fJLm40d0wOdiSM2NyDgV68MGynQW0Hi6%2bpkLoTHF40fZbY3Gola16w31Qc8%2bufvng48vE857ChE3LHozuo04f2EG3wFVQpy1doXVez%2fOy%2bflR0WNB2rdauijJoRoNhkqLvbtVXKJrQ18RfzGtOTH%2fijmlAn3j6JR2Zq7j1mhr0gctwe8o5HWtLKha2TNcd3IElbFlK1QKbzAL0ODJlcEEFs0uNMm5mgQm7DwX%2bC6UwJNol%2bXJ8ZBRJ%2bp8AS7lPmE4AgAA&web=0

Jobs
| where Queued > ago (120d)
| where Source == "ci/public/dotnet/runtime/refs/heads/master" or Source == "ci/public/dotnet/runtime/refs/heads/main"
| join kind=inner ( WorkItems
| where FriendlyName == "System.Text.Json.Tests"
| where ExitCode == 139
) on $left.Name == $right.JobName
| where QueueAlias == "alpine.312.arm64.open" or QueueAlias == "alpine.312.amd64.open"
| join kind=inner Files on WorkItemId
| where FileName startswith "core." or FileName endswith ".dmp"
| project Name, QueueAlias, Started, ExitCode, ConsoleUri, Branch, Source

There are two very old hits.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

9 participants