Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert in ProcessWaitState on Linux arm64 #74795

Closed
wfurt opened this issue Aug 30, 2022 · 55 comments · Fixed by #75424
Closed

assert in ProcessWaitState on Linux arm64 #74795

wfurt opened this issue Aug 30, 2022 · 55 comments · Fixed by #75424
Assignees
Labels
arch-arm64 area-System.Net.Quic blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Aug 30, 2022

Disabled tests:

Crash in System.Net.Requests.Tests Work Item - Last 7 days in Runfo:

  • 9/8 - 142x failures out of which 115x are on arm64-Mono
Day Run Details
9/7 Rolling run 6968 Core Dump - net7.0-Linux-Release-arm64-NativeAOT_Release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open
9/6 Rolling run 6173 2x Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open & net7.0-Linux-Release-arm64-Mono_release-(Debian.11.Arm64.Open)Ubuntu.1804.Armarch.Open
9/4 Rolling run 4317 Code Rump - net7.0-Linux-Release-arm64-CoreCLR_release-(Ubuntu.2204.Arm64.Open)Ubuntu.1804.ArmArch.Open
8/30 Rolling run 1975580 2x Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open & net7.0-Linux-Release-arm64-Mono_release-(Debian.11.Arm64.Open)Ubuntu.1804.Armarch.Open
8/29 Rolling run 1973318 Code Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open
8/28 Rolling run 1972808 Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Debian.11.Arm64.Open)Ubuntu.1804.Armarch.Open
8/28 Rolling run 1972465 Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open
8/26 Rolling run 1970381 Core Dump - net7.0-Linux-Release-arm64-CoreCLR_release-(Ubuntu.2204.Arm64.Open)Ubuntu.1804.ArmArch.Open

Related to #69125.
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/console.1429bd54.log?%3Fhelixlogtype%3Dresult

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Requests.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Requests.Tests (found 349 of 367 test cases)
  Starting:    System.Net.Requests.Tests (parallel test collections = on, max threads = 4)
Process terminated. Error while reaping child. errno = 10
   at System.Environment.FailFast(System.String)
   at System.Diagnostics.ProcessWaitState.TryReapChild(Boolean)
   at System.Diagnostics.ProcessWaitState.CheckChildren(Boolean, Boolean)
   at System.Diagnostics.Process.OnSigChild(Int32, Int32)
./RunTests.sh: line 168:    21 Aborted                 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Net.Requests.Tests.runtimeconfig.json --depsfile System.Net.Requests.Tests.deps.json xunit.console.dll System.Net.Requests.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE

dump: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/core.1001.21

errno 10 -> ENOCHILD

> clrstack -a
OS Thread Id: 0x1c (0)
        Child SP               IP Call Site
0000007EBADC5460 0000007fa357f200 [HelperMethodFrame_1OBJ: 0000007ebadc5460] System.Environment.FailFast(System.String)
0000007EBADC55E0 0000007F64861C34 System.Diagnostics.ProcessWaitState.TryReapChild(Boolean) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs @ 589]
    PARAMETERS:
        this (0x0000007EBADC5630) = 0x000000156f599f80
        configureConsole (0x0000007EBADC562C) = 0x0000000000000001
    LOCALS:
        0x0000007EBADC5620 = 0x000000156f599fc8
        0x0000007EBADC5618 = 0x0000000000000001
        0x0000007EBADC5610 = 0x0000000060230000
        0x0000007EBADC560C = 0x00000000ffffffff
        0x0000007EBADC5608 = 0x0000000000000001
        0x0000007EBADC5600 = 0x000000000000000a

0000007EBADC5640 0000007F648610C8 System.Diagnostics.ProcessWaitState.CheckChildren(Boolean, Boolean) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs @ 614]
    PARAMETERS:
        reapAll (0x0000007EBADC5724) = 0x0000000000000000
        configureConsole (0x0000007EBADC5720) = 0x0000000000000001
    LOCALS:
        0x0000007EBADC5718 = 0x000000156f42a280
        0x0000007EBADC5710 = 0x0000000000000001
        0x0000007EBADC570C = 0x0000000000000000
        0x0000007EBADC5708 = 0x000000000000003c
        0x0000007EBADC5700 = 0x000000156f599f80
        0x0000007EBADC56F8 = 0x0000000000000000
        0x0000007EBADC56F0 = 0x0000000000000000
        0x0000007EBADC56E8 = 0x0000000000000000
        0x0000007EBADC56C0 = 0x0000000000000000
        0x0000007EBADC56B0 = 0x0000000000000000
        0x0000007EBADC56A8 = 0x0000000000000000
        0x0000007EBADC5690 = 0x0000000000000000
        0x0000007EBADC5688 = 0x0000000000000000
        0x0000007EBADC5680 = 0x0000000000000000

0000007EBADC5730 0000007F64860B40 System.Diagnostics.Process.OnSigChild(Int32, Int32) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs @ 1104]
    PARAMETERS:
        reapAll = <no data>
        configureConsole (<CLR reg>) = 0x0000000000000001
    LOCALS:
        <CLR reg> = 0x0000000000000001
        <no data>
        <no data>

> dumpobj 0x000000156f599f80
Name:        System.Diagnostics.ProcessWaitState
MethodTable: 0000007f6480b9a8
EEClass:     0000007f64827760
Tracked Type: false
Size:        72(0x48) bytes
File:        /root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000007f62d49320  4000179        8        System.Object  0 instance 000000156f599fc8 _gate
0000007f63b4e388  400017a       20         System.Int32  1 instance               60 _processId
0000007f63b4ae88  400017b       28       System.Boolean  1 instance                1 _isChild
0000007f63b4ae88  400017c       29       System.Boolean  1 instance                1 _usesTerminal
0000007f63e07350  400017d       10 ...eading.Tasks.Task  0 instance 0000000000000000 _waitInProgress
0000007f63b4e388  400017e       24         System.Int32  1 instance                2 _outstandingRefCount
0000007f63b4ae88  400017f       2a       System.Boolean  1 instance                0 _exited
0000007f63c7cc10  4000180       2c ...Private.CoreLib]]  1 instance 000000156f599fac _exitCode
0000007f63f72c08  4000181       38      System.DateTime  1 instance 000000156f599fb8 _exitTime
0000007f645345b0  4000182       18 ....ManualResetEvent  0 instance 000000156f599fe0 _exitedEvent
0000007f6480d118  4000177       60 ...nostics.Process]]  0   static 000000156f42a230 s_processWaitStates
0000007f6480d118  4000178       68 ...nostics.Process]]  0   static 000000156f42a280 s_childProcessWaitStates

and with (some) symbols

> clrstack -i -a



Dumping managed stack and managed variables using ICorDebug.
=============================================================================
Child SP         IP               Call Site
0000007EBADC4A60 0000007fa357f200 [NativeStackFrame]
0000007EBADC5460 (null) [Internal call: 0000007EBADC5460]
0000007EBADC55E0 0000007f64861c34 [DEFAULT] [hasThis] Boolean System.Diagnostics.ProcessWaitState.TryReapChild(Boolean) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + System.Diagnostics.ProcessWaitState this @ 0x156f599f80
  + bool configureConsole  = true

LOCALS:
  + (Error 0x80004005 retrieving local variable 'local_0')
  + (Error 0x80004005 retrieving local variable 'local_1')
  + int exitCode  = 1612906496
  + int waitResult  = -1
  + (Error 0x80004005 retrieving local variable 'local_4')
  + int errorCode  = 10

0000007EBADC5640 0000007f648610c8 [DEFAULT] Void System.Diagnostics.ProcessWaitState.CheckChildren(Boolean,Boolean) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + bool reapAll  = false
  + bool configureConsole  = true

LOCALS:
  + (Error 0x80004005 retrieving local variable 'local_0')
  + (Error 0x80004005 retrieving local variable 'local_1')
  + bool checkAll  = false
  + int pid  = 60
  + System.Diagnostics.ProcessWaitState pws @ 0x156f599f80
  + int errorCode  = 0
  + System.Diagnostics.ProcessWaitState firstToRemove = null
  + System.Collections.Generic.List`1&lt;System.Diagnostics.ProcessWaitState&gt; additionalToRemove = null
  + (Error 0x80004005 retrieving local variable 'local_8')
  + System.Collections.Generic.KeyValuePair`2&lt;int,System.Diagnostics.ProcessWaitState&gt; kv @ 0x7ebadc56b0
  + System.Diagnostics.ProcessWaitState pws = null
  + (Error 0x80004005 retrieving local variable 'local_11')
  + System.Diagnostics.ProcessWaitState pws = null
  + (Error 0x80004005 retrieving local variable 'local_13')

0000007EBADC5730 0000007f64860b40 [DEFAULT] I4 System.Diagnostics.Process.OnSigChild(I4,I4) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + (Error 0x80131304 retrieving parameter 'reapAll')
  + int configureConsole  = 1

LOCALS:
  + bool childrenUsingTerminalPre  = true
  + (Error 0x80004005 retrieving local variable 'childrenUsingTerminalPost')
  + (Error 0x80004005 retrieving local variable 'local_2')

0000007EBADC5790 0000007f60218260 [NativeStackFrame]
Stack walk complete.
=============================================================================

cc: @tmds

{ "ErrorMessage":"Error while reaping child. errno = 10" } 

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@ghost
Copy link

ghost commented Aug 30, 2022

Tagging subscribers to this area: @dotnet/area-system-diagnostics-process
See info in area-owners.md if you want to be subscribed.

Issue Details

related to #69125.
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/console.1429bd54.log?%3Fhelixlogtype%3Dresult

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Requests.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Requests.Tests (found 349 of 367 test cases)
  Starting:    System.Net.Requests.Tests (parallel test collections = on, max threads = 4)
Process terminated. Error while reaping child. errno = 10
   at System.Environment.FailFast(System.String)
   at System.Diagnostics.ProcessWaitState.TryReapChild(Boolean)
   at System.Diagnostics.ProcessWaitState.CheckChildren(Boolean, Boolean)
   at System.Diagnostics.Process.OnSigChild(Int32, Int32)
./RunTests.sh: line 168:    21 Aborted                 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Net.Requests.Tests.runtimeconfig.json --depsfile System.Net.Requests.Tests.deps.json xunit.console.dll System.Net.Requests.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE

dump: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/core.1001.21

errno 10 -> ENOCHILD

> clrstack -a
OS Thread Id: 0x1c (0)
        Child SP               IP Call Site
0000007EBADC5460 0000007fa357f200 [HelperMethodFrame_1OBJ: 0000007ebadc5460] System.Environment.FailFast(System.String)
0000007EBADC55E0 0000007F64861C34 System.Diagnostics.ProcessWaitState.TryReapChild(Boolean) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs @ 589]
    PARAMETERS:
        this (0x0000007EBADC5630) = 0x000000156f599f80
        configureConsole (0x0000007EBADC562C) = 0x0000000000000001
    LOCALS:
        0x0000007EBADC5620 = 0x000000156f599fc8
        0x0000007EBADC5618 = 0x0000000000000001
        0x0000007EBADC5610 = 0x0000000060230000
        0x0000007EBADC560C = 0x00000000ffffffff
        0x0000007EBADC5608 = 0x0000000000000001
        0x0000007EBADC5600 = 0x000000000000000a

0000007EBADC5640 0000007F648610C8 System.Diagnostics.ProcessWaitState.CheckChildren(Boolean, Boolean) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs @ 614]
    PARAMETERS:
        reapAll (0x0000007EBADC5724) = 0x0000000000000000
        configureConsole (0x0000007EBADC5720) = 0x0000000000000001
    LOCALS:
        0x0000007EBADC5718 = 0x000000156f42a280
        0x0000007EBADC5710 = 0x0000000000000001
        0x0000007EBADC570C = 0x0000000000000000
        0x0000007EBADC5708 = 0x000000000000003c
        0x0000007EBADC5700 = 0x000000156f599f80
        0x0000007EBADC56F8 = 0x0000000000000000
        0x0000007EBADC56F0 = 0x0000000000000000
        0x0000007EBADC56E8 = 0x0000000000000000
        0x0000007EBADC56C0 = 0x0000000000000000
        0x0000007EBADC56B0 = 0x0000000000000000
        0x0000007EBADC56A8 = 0x0000000000000000
        0x0000007EBADC5690 = 0x0000000000000000
        0x0000007EBADC5688 = 0x0000000000000000
        0x0000007EBADC5680 = 0x0000000000000000

0000007EBADC5730 0000007F64860B40 System.Diagnostics.Process.OnSigChild(Int32, Int32) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs @ 1104]
    PARAMETERS:
        reapAll = <no data>
        configureConsole (<CLR reg>) = 0x0000000000000001
    LOCALS:
        <CLR reg> = 0x0000000000000001
        <no data>
        <no data>

> dumpobj 0x000000156f599f80
Name:        System.Diagnostics.ProcessWaitState
MethodTable: 0000007f6480b9a8
EEClass:     0000007f64827760
Tracked Type: false
Size:        72(0x48) bytes
File:        /root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000007f62d49320  4000179        8        System.Object  0 instance 000000156f599fc8 _gate
0000007f63b4e388  400017a       20         System.Int32  1 instance               60 _processId
0000007f63b4ae88  400017b       28       System.Boolean  1 instance                1 _isChild
0000007f63b4ae88  400017c       29       System.Boolean  1 instance                1 _usesTerminal
0000007f63e07350  400017d       10 ...eading.Tasks.Task  0 instance 0000000000000000 _waitInProgress
0000007f63b4e388  400017e       24         System.Int32  1 instance                2 _outstandingRefCount
0000007f63b4ae88  400017f       2a       System.Boolean  1 instance                0 _exited
0000007f63c7cc10  4000180       2c ...Private.CoreLib]]  1 instance 000000156f599fac _exitCode
0000007f63f72c08  4000181       38      System.DateTime  1 instance 000000156f599fb8 _exitTime
0000007f645345b0  4000182       18 ....ManualResetEvent  0 instance 000000156f599fe0 _exitedEvent
0000007f6480d118  4000177       60 ...nostics.Process]]  0   static 000000156f42a230 s_processWaitStates
0000007f6480d118  4000178       68 ...nostics.Process]]  0   static 000000156f42a280 s_childProcessWaitStates

and with (some) symbols

> clrstack -i -a



Dumping managed stack and managed variables using ICorDebug.
=============================================================================
Child SP         IP               Call Site
0000007EBADC4A60 0000007fa357f200 [NativeStackFrame]
0000007EBADC5460 (null) [Internal call: 0000007EBADC5460]
0000007EBADC55E0 0000007f64861c34 [DEFAULT] [hasThis] Boolean System.Diagnostics.ProcessWaitState.TryReapChild(Boolean) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + System.Diagnostics.ProcessWaitState this @ 0x156f599f80
  + bool configureConsole  = true

LOCALS:
  + (Error 0x80004005 retrieving local variable 'local_0')
  + (Error 0x80004005 retrieving local variable 'local_1')
  + int exitCode  = 1612906496
  + int waitResult  = -1
  + (Error 0x80004005 retrieving local variable 'local_4')
  + int errorCode  = 10

0000007EBADC5640 0000007f648610c8 [DEFAULT] Void System.Diagnostics.ProcessWaitState.CheckChildren(Boolean,Boolean) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + bool reapAll  = false
  + bool configureConsole  = true

LOCALS:
  + (Error 0x80004005 retrieving local variable 'local_0')
  + (Error 0x80004005 retrieving local variable 'local_1')
  + bool checkAll  = false
  + int pid  = 60
  + System.Diagnostics.ProcessWaitState pws @ 0x156f599f80
  + int errorCode  = 0
  + System.Diagnostics.ProcessWaitState firstToRemove = null
  + System.Collections.Generic.List`1&lt;System.Diagnostics.ProcessWaitState&gt; additionalToRemove = null
  + (Error 0x80004005 retrieving local variable 'local_8')
  + System.Collections.Generic.KeyValuePair`2&lt;int,System.Diagnostics.ProcessWaitState&gt; kv @ 0x7ebadc56b0
  + System.Diagnostics.ProcessWaitState pws = null
  + (Error 0x80004005 retrieving local variable 'local_11')
  + System.Diagnostics.ProcessWaitState pws = null
  + (Error 0x80004005 retrieving local variable 'local_13')

0000007EBADC5730 0000007f64860b40 [DEFAULT] I4 System.Diagnostics.Process.OnSigChild(I4,I4) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + (Error 0x80131304 retrieving parameter 'reapAll')
  + int configureConsole  = 1

LOCALS:
  + bool childrenUsingTerminalPre  = true
  + (Error 0x80004005 retrieving local variable 'childrenUsingTerminalPost')
  + (Error 0x80004005 retrieving local variable 'local_2')

0000007EBADC5790 0000007f60218260 [NativeStackFrame]
Stack walk complete.
=============================================================================

cc: @tmds

Author: wfurt
Assignees: -
Labels:

arch-arm64, area-System.Diagnostics.Process, os-linux

Milestone: -

@tmds
Copy link
Member

tmds commented Aug 30, 2022

We had this last year for System.Net.Requests.Tests too: #55618.

This FailFast happens when some other code in the process reaps a child process that was started by .NET.

@jkotas jkotas added the Known Build Error Use this to report build issues in the .NET Helix tab label Aug 30, 2022
@danmoseley
Copy link
Member

In such a case, it's a fail fast necessary? Would it be reasonable to just continue?

@tmds
Copy link
Member

tmds commented Aug 30, 2022

Someone proposed such a change. We didn't take it because we obtain the exit code when we reap the child (from waitpid). So this affects the further state of the program. (#70705 (comment)).

@jozkee jozkee added this to the Future milestone Aug 30, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Aug 30, 2022
@danmoseley
Copy link
Member

Someone proposed such a change. We didn't take it because we obtain the exit code when we reap the child (from waitpid). So this affects the further state of the program. (#70705 (comment)).

I read that comment as a general statement that we prefer to fail fast than continue in a corrupted or undefined state (cc @jkotas). In this case, can we reasonably recover from this error? It seems we use the error code to release resources -- can we just do that? Or is a leak inevitable here?

It seems fragile to terminate the app suddenly in the case some child process terminates or is killed for some reason. I guess this does not happen on Windows.

@wfurt
Copy link
Member Author

wfurt commented Aug 30, 2022

One another questions is what this tart happening now. It can certainly be race condition that always existed. I tried to reproduce it locally (on ARM64) but so far no luck.
Only change I'm aware of is we updated test images to include MsQuic. That really should not impact this test set but timing may changed. I did quick search through MsQuic code and I did not see any attempts to deal with process management.
So the question is is what is reaping the process...

@jkotas
Copy link
Member

jkotas commented Aug 30, 2022

than continue in a corrupted or undefined state (cc @jkotas).

The corrupted state here is akin to calling free on a pointer that you do not own.

The situation is not triggered by child process terminating or being killed for some reason. The situation is triggered by some code in the current process handling it incorrectly.

It is also possible that the root cause has nothing to do with process reaping. Instead, the root cause may be a memory corruption somewhere else that this crash is a side-effect of.

@wfurt
Copy link
Member Author

wfurt commented Aug 31, 2022

I did check the test suites and I don't see any other attempts to reap or execute besides our RemoteExecutor @tmds.

16623 execve("/media/5bfe43cb-dd6e-413b-8b94-00f69f90ecc2/wfurt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/media/5bfe43cb-dd6e-413b-8b94-0"..., "exec", "--runtimeconfig", "System.Net.Requests.Tests.runtim"..., "--depsfile", "System.Net.Requests.Tests.deps.j"..., "xunit.console.dll", "System.Net.Requests.Tests.dll", "-xml", "testResults.xml", "-nologo", "-notrait", "category=OuterLoop", "-notrait", "category=failing", "-nocolor"], 0x7fe0751af8 /* 13 vars */) = 0
16644 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.WebRequestTest+"..., "<DefaultWebProxy_SetThenGet_Valu"..., "/tmp/31jqjssf.ryi"], 0x7e8803b9a0 /* 13 vars */) = 0
16651 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.GlobalProxySele"..., "<Select_Success>b__1_0", "/tmp/0rb3jbxa.g4r"], 0x7e90005e40 /* 13 vars */) = 0
16640 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16644, si_uid=1001, si_status=42, si_utime=10, si_stime=6} ---
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16644, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16644, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16644
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16638 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16651, si_uid=1001, si_status=42, si_utime=18, si_stime=10} ---
16663 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.WebRequestTest+"..., "<SendGetRequest_WithGlobalCacheP"..., "/tmp/dkgtmmfz.ado"], 0x7e8803d100 /* 13 vars */) = 0
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16651, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16651, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16651
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16664 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.AuthenticationM"..., "<CustomTargetNameDictionary_Vali"..., "/tmp/3ibo3kk5.gfq"], 0x7e90051de0 /* 13 vars */) = 0
16637 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16664, si_uid=1001, si_status=42, si_utime=10, si_stime=6} ---
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16664, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16664, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16664
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16678 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.AuthenticationM"..., "<Register_Unregister_ModuleCount"..., "/tmp/tuctrifw.tyv"], 0x7e9002e400 /* 13 vars */) = 0
16689 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.HttpWebRequestT"..., "<DefaultMaximumResponseHeadersLe"..., "/tmp/5edbvy3h.1x1"], 0x7e8c07b110 /* 13 vars */) = 0
16637 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16678, si_uid=1001, si_status=42, si_utime=12, si_stime=9} ---
16690 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.HttpWebRequestT"..., "<GetResponseAsync_ParametersAreC"..., "/tmp/ad05mcf0.phu", "False"], 0x7e8403e720 /* 13 vars */) = 0
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16678, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16678, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16678
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16691 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.AuthenticationM"..., "<CredentialPolicy_Roundtrip>b__6"..., "/tmp/jqxqxkup.irr"], 0x7e9002d320 /* 13 vars */) = 0
16638 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16689, si_uid=1001, si_status=42, si_utime=11, si_stime=6} ---
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16689, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16689, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16689
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16637 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16691, si_uid=1001, si_status=42, si_utime=11, si_stime=11} ---
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16691, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16691, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16691
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16711 execve("/mnt/runtime/artifacts/bin/testhost/net7.0-Linux-Debug-arm64/dotnet", ["/mnt/runtime/artifacts/bin/testh"..., "exec", "--runtimeconfig", "/mnt/runtime/artifacts/bin/Syste"..., "--depsfile", "/mnt/runtime/artifacts/bin/Syste"..., "/mnt/runtime/artifacts/bin/Syste"..., "System.Net.Requests.Tests, Versi"..., "System.Net.Tests.AuthenticationM"..., "<Register_UnregisterByScheme_Mod"..., "/tmp/hzkvicop.2dw"], 0x7e9002d3c0 /* 13 vars */) = 0
16728 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16711, si_uid=1001, si_status=42, si_utime=13, si_stime=5} ---
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16711, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16711, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16711
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16718 --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=16663, si_uid=1001} ---
16719 --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=16663, si_uid=1001} ---
16728 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16663, si_uid=1001, si_status=42, si_utime=95, si_stime=32} ---
16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16663, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
16630 wait4(16663, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16663
16630 waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0
....

I'm running the test suite in loop but as usually no repro. It is interesting that the dump above has _usesTerminal=1. I was thinking about differences in CI and one is that the test suite is executed from background Helix daemon vs running in foreground from interactive shell. I don't think it should make much difference for process management but the process code has some specific processing.

@tmds
Copy link
Member

tmds commented Aug 31, 2022

The process exit handling code hasn't changed much in the past years. I don't expect it to have a bug as it has been used probably millions of times in various situations without users reporting the errno = 10 case.

I think we should treat this as a glitch unless it reproduces more.

@wfurt
Copy link
Member Author

wfurt commented Aug 31, 2022

I run it whole night on ARM machine without repro. However it fails somewhat often in CI. As @jkotas mentioned it may be data corruption. But I did not see any signs and it fails in same way on different OSes and versions.

On the glitch path .... is there chance that is some kernel bug or behavior? I was thinking to run it ci with tracing enabled (and dump it on failure) but it seems like the process code does not really have any. What would we need to come to bottom of this (and how do we expect users to debug it?) runnings strace may not be practical in many cases IMHO.

@tmds
Copy link
Member

tmds commented Aug 31, 2022

However it fails somewhat often in CI.

I thought this was a single occurrence. Since when did it start failing in CI?

@wfurt
Copy link
Member Author

wfurt commented Aug 31, 2022

It feels like it start happening more in last few weeks. But as far as I can tell the #55618 was never solved, right?
There was #73972 so it is hard to know all the details without going through the logs by hand. (Kusto reports same result)
With the corruption we would generally get crash but the waitpid issue is different and consistent.

I see ~ 200+ failures in last two weeks. Mono was disabled in #74668 (see #74667)

@tmds
Copy link
Member

tmds commented Aug 31, 2022

#55618 was reported over a year ago, and it was closed after the issue had no longer occurred for 3 weeks. Since then it wasn't reported until this issue was created this week.

Having an idea of when it started to occur more can give us a clue as to what change triggered it.

I see ~ 200+ failures in last two weeks. Mono was disabled in #74668 (see #74667)

We've only seen this on arm64?

Does it occur less with coreclr than with mono?

The code asks the kernel if there are children that terminated using waitid:

16630 waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16711, si_uid=1001, si_status=42, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0

If there is one, it passes the pid (from si_pid) to reap the child:

16630 wait4(16711, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], WNOHANG, NULL) = 16711

In between these calls the child suddenly goes missing.

So either something has messed up the pid, or something else has reaped the child in the meanwhile.

If it's something else reaping the child, it's expected to be a native library making a blocking waitpid(-1, &status, 0) call (or wait(&status)) on a thread. This thread will race with .NET signal handling as soon as the child terminates.

If there is no native library doing this, then somehow the pid changed.
Or there is a bug in the kernel.

Note that the signal handling is running on is native pthread. It's not a managed thread.

@wfurt
Copy link
Member Author

wfurt commented Aug 31, 2022

I see it recently only on ARM. Some of the older cases are gone e.g. there is note of crash on Kusto but we no longer have console so we cannot check if the pattern is the same. I searched our code base (and msquic) and I did not see any direct uses of wait*. So I'm running out of ideas.

Now, since you mentioned the -1: #72830 is another mysterious case when Socket's SafeHandle turns into -1. There is also no local repro so far and insight is minimal.

@wfurt
Copy link
Member Author

wfurt commented Aug 31, 2022

actually, the runs are containers. I'm wondering if this is applicable:

// In two cases we reap all processes (and may inadvertently reap non-managed processes):
// - When the original disposition is SIG_IGN, children that terminated did not become zombies.
// Because overwrote the disposition, we have become responsible for reaping those processes.
// - pid 1 (the init daemon) is responsible for reaping orphaned children.
// Because containers usually don't have an init daemon .NET may be pid 1.
bool reapAll = g_pid == 1 || IsSigIgn(OrigActionFor(signalCode));

That would eventually call

 pid = Interop.Sys.WaitPidExitedNoHang(-1, out _);

@jkotas
Copy link
Member

jkotas commented Aug 31, 2022

Now, since you mentioned the -1: #72830 is another mysterious case when Socket's SafeHandle turns into -1.

Yes, both of these can have the case root cause.

Notice that both of these issues have been only seen in System.Net.Requests.Tests.

One possible explanation is that a method passes a pointer to a stack location to some asynchronous logic. The method goes out of scope, and some completely unrelated code starts running. And finally, the asynchronous logic wakes up and overwrites the location on the stack that is used for something completely different by now.

Does it ring any bells? Do we have a code in sockets that is passing a pointer to a stack location to some asynchronous logic?

@wfurt
Copy link
Member Author

wfurt commented Aug 31, 2022

I will take a look. #72830 happens in System.Net.Mail. That has really only one functional change recently and I don't see how that would cause either of the problems.

@jkotas
Copy link
Member

jkotas commented Aug 31, 2022

I will take a look. #72830 happens in System.Net.Mail.

Right, but the mysterious crash is still in sockets.

@tmds
Copy link
Member

tmds commented Sep 1, 2022

actually, the runs are containers. I'm wondering if this is applicable:

It isn't. It is applicable only for the main process in the container (pid 1). You can see reapAll = false in your debugger output.

I don't see any reports of this on x64.

Was Mono disabled because it happens more on Mono than it does on coreclr?

@jkotas
Copy link
Member

jkotas commented Sep 10, 2022

Our tests have a specific pattern that is prone to hit the lttng bug: The tests are initializing Quic on one thread while there is a remote executor exec storm of short-lived processes on other threads. I think that it is unlikely for real world applications out there to have the same pattern.

Until we get lttng fix in place, we may want to set QUIC_LTTng=0 in our CI docker images.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 11, 2022
@wfurt
Copy link
Member Author

wfurt commented Sep 11, 2022

This is not just the docker, right? Since this is showing up so far only on ARM64, we may choose to update only that.
I put up #75424 with alternative approach to load msquic early before tests runs. That also seems sufficient.

I'm working on issue for LTTNG. They don't use GitHub for tracking and my registration to submit issue is pending for approval. I look at the code again and changing wait to waitpid should be safe and it it should fix the observed behavior.

@karelz
Copy link
Member

karelz commented Sep 12, 2022

@wfurt do we understand why it is happening only on System.Net.Request on not on other test suites?
Will we need to initialize msquic in more test suites in future?

@karelz
Copy link
Member

karelz commented Sep 12, 2022

Also, does it impact also release/7.0 branch? Should we backport it as test-only change?

@jkotas
Copy link
Member

jkotas commented Sep 12, 2022

@wfurt do we understand why it is happening only on System.Net.Request on not on other test suites?
Will we need to initialize msquic in more test suites in future?

Any test that launches child processes and ends up loading Quic is prone to hitting this issue. I expect that we will see this problem in a few more tests over time.

An alternative big hammer fix that would address it everywhere would be to set QUIC_LTTng=0 in the environment. I think we try the proposed fix and see how it works, and only go back to the big hammer fix if we get signal that the problem is wide-spread.

does it impact also release/7.0 branch? Should we backport it as test-only change?

Yes.

@wfurt
Copy link
Member Author

wfurt commented Sep 12, 2022

Note this is initialization race. So when Quic is initialized befe test runs and child processes it is OK. That for example happens when we conditional tests where XUnit evaluates conditions before test run.

@ghost
Copy link

ghost commented Sep 12, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Disabled tests:

Crash in System.Net.Requests.Tests Work Item - Last 7 days in Runfo:

  • 9/8 - 142x failures out of which 115x are on arm64-Mono
Day Run Details
9/7 Rolling run 6968 Core Dump - net7.0-Linux-Release-arm64-NativeAOT_Release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open
9/6 Rolling run 6173 2x Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open & net7.0-Linux-Release-arm64-Mono_release-(Debian.11.Arm64.Open)Ubuntu.1804.Armarch.Open
9/4 Rolling run 4317 Code Rump - net7.0-Linux-Release-arm64-CoreCLR_release-(Ubuntu.2204.Arm64.Open)Ubuntu.1804.ArmArch.Open
8/30 Rolling run 1975580 2x Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open & net7.0-Linux-Release-arm64-Mono_release-(Debian.11.Arm64.Open)Ubuntu.1804.Armarch.Open
8/29 Rolling run 1973318 Code Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open
8/28 Rolling run 1972808 Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Debian.11.Arm64.Open)Ubuntu.1804.Armarch.Open
8/28 Rolling run 1972465 Core Dump - net7.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open
8/26 Rolling run 1970381 Core Dump - net7.0-Linux-Release-arm64-CoreCLR_release-(Ubuntu.2204.Arm64.Open)Ubuntu.1804.ArmArch.Open

Related to #69125.
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/console.1429bd54.log?%3Fhelixlogtype%3Dresult

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Net.Requests.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Requests.Tests (found 349 of 367 test cases)
  Starting:    System.Net.Requests.Tests (parallel test collections = on, max threads = 4)
Process terminated. Error while reaping child. errno = 10
   at System.Environment.FailFast(System.String)
   at System.Diagnostics.ProcessWaitState.TryReapChild(Boolean)
   at System.Diagnostics.ProcessWaitState.CheckChildren(Boolean, Boolean)
   at System.Diagnostics.Process.OnSigChild(Int32, Int32)
./RunTests.sh: line 168:    21 Aborted                 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Net.Requests.Tests.runtimeconfig.json --depsfile System.Net.Requests.Tests.deps.json xunit.console.dll System.Net.Requests.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE

dump: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-aaacaf8e0a7f46c4ad/System.Net.Requests.Tests/1/core.1001.21

errno 10 -> ENOCHILD

> clrstack -a
OS Thread Id: 0x1c (0)
        Child SP               IP Call Site
0000007EBADC5460 0000007fa357f200 [HelperMethodFrame_1OBJ: 0000007ebadc5460] System.Environment.FailFast(System.String)
0000007EBADC55E0 0000007F64861C34 System.Diagnostics.ProcessWaitState.TryReapChild(Boolean) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs @ 589]
    PARAMETERS:
        this (0x0000007EBADC5630) = 0x000000156f599f80
        configureConsole (0x0000007EBADC562C) = 0x0000000000000001
    LOCALS:
        0x0000007EBADC5620 = 0x000000156f599fc8
        0x0000007EBADC5618 = 0x0000000000000001
        0x0000007EBADC5610 = 0x0000000060230000
        0x0000007EBADC560C = 0x00000000ffffffff
        0x0000007EBADC5608 = 0x0000000000000001
        0x0000007EBADC5600 = 0x000000000000000a

0000007EBADC5640 0000007F648610C8 System.Diagnostics.ProcessWaitState.CheckChildren(Boolean, Boolean) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs @ 614]
    PARAMETERS:
        reapAll (0x0000007EBADC5724) = 0x0000000000000000
        configureConsole (0x0000007EBADC5720) = 0x0000000000000001
    LOCALS:
        0x0000007EBADC5718 = 0x000000156f42a280
        0x0000007EBADC5710 = 0x0000000000000001
        0x0000007EBADC570C = 0x0000000000000000
        0x0000007EBADC5708 = 0x000000000000003c
        0x0000007EBADC5700 = 0x000000156f599f80
        0x0000007EBADC56F8 = 0x0000000000000000
        0x0000007EBADC56F0 = 0x0000000000000000
        0x0000007EBADC56E8 = 0x0000000000000000
        0x0000007EBADC56C0 = 0x0000000000000000
        0x0000007EBADC56B0 = 0x0000000000000000
        0x0000007EBADC56A8 = 0x0000000000000000
        0x0000007EBADC5690 = 0x0000000000000000
        0x0000007EBADC5688 = 0x0000000000000000
        0x0000007EBADC5680 = 0x0000000000000000

0000007EBADC5730 0000007F64860B40 System.Diagnostics.Process.OnSigChild(Int32, Int32) [/_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs @ 1104]
    PARAMETERS:
        reapAll = <no data>
        configureConsole (<CLR reg>) = 0x0000000000000001
    LOCALS:
        <CLR reg> = 0x0000000000000001
        <no data>
        <no data>

> dumpobj 0x000000156f599f80
Name:        System.Diagnostics.ProcessWaitState
MethodTable: 0000007f6480b9a8
EEClass:     0000007f64827760
Tracked Type: false
Size:        72(0x48) bytes
File:        /root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000007f62d49320  4000179        8        System.Object  0 instance 000000156f599fc8 _gate
0000007f63b4e388  400017a       20         System.Int32  1 instance               60 _processId
0000007f63b4ae88  400017b       28       System.Boolean  1 instance                1 _isChild
0000007f63b4ae88  400017c       29       System.Boolean  1 instance                1 _usesTerminal
0000007f63e07350  400017d       10 ...eading.Tasks.Task  0 instance 0000000000000000 _waitInProgress
0000007f63b4e388  400017e       24         System.Int32  1 instance                2 _outstandingRefCount
0000007f63b4ae88  400017f       2a       System.Boolean  1 instance                0 _exited
0000007f63c7cc10  4000180       2c ...Private.CoreLib]]  1 instance 000000156f599fac _exitCode
0000007f63f72c08  4000181       38      System.DateTime  1 instance 000000156f599fb8 _exitTime
0000007f645345b0  4000182       18 ....ManualResetEvent  0 instance 000000156f599fe0 _exitedEvent
0000007f6480d118  4000177       60 ...nostics.Process]]  0   static 000000156f42a230 s_processWaitStates
0000007f6480d118  4000178       68 ...nostics.Process]]  0   static 000000156f42a280 s_childProcessWaitStates

and with (some) symbols

> clrstack -i -a



Dumping managed stack and managed variables using ICorDebug.
=============================================================================
Child SP         IP               Call Site
0000007EBADC4A60 0000007fa357f200 [NativeStackFrame]
0000007EBADC5460 (null) [Internal call: 0000007EBADC5460]
0000007EBADC55E0 0000007f64861c34 [DEFAULT] [hasThis] Boolean System.Diagnostics.ProcessWaitState.TryReapChild(Boolean) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + System.Diagnostics.ProcessWaitState this @ 0x156f599f80
  + bool configureConsole  = true

LOCALS:
  + (Error 0x80004005 retrieving local variable 'local_0')
  + (Error 0x80004005 retrieving local variable 'local_1')
  + int exitCode  = 1612906496
  + int waitResult  = -1
  + (Error 0x80004005 retrieving local variable 'local_4')
  + int errorCode  = 10

0000007EBADC5640 0000007f648610c8 [DEFAULT] Void System.Diagnostics.ProcessWaitState.CheckChildren(Boolean,Boolean) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + bool reapAll  = false
  + bool configureConsole  = true

LOCALS:
  + (Error 0x80004005 retrieving local variable 'local_0')
  + (Error 0x80004005 retrieving local variable 'local_1')
  + bool checkAll  = false
  + int pid  = 60
  + System.Diagnostics.ProcessWaitState pws @ 0x156f599f80
  + int errorCode  = 0
  + System.Diagnostics.ProcessWaitState firstToRemove = null
  + System.Collections.Generic.List`1&lt;System.Diagnostics.ProcessWaitState&gt; additionalToRemove = null
  + (Error 0x80004005 retrieving local variable 'local_8')
  + System.Collections.Generic.KeyValuePair`2&lt;int,System.Diagnostics.ProcessWaitState&gt; kv @ 0x7ebadc56b0
  + System.Diagnostics.ProcessWaitState pws = null
  + (Error 0x80004005 retrieving local variable 'local_11')
  + System.Diagnostics.ProcessWaitState pws = null
  + (Error 0x80004005 retrieving local variable 'local_13')

0000007EBADC5730 0000007f64860b40 [DEFAULT] I4 System.Diagnostics.Process.OnSigChild(I4,I4) (/root/helix/work/correlation/shared/Microsoft.NETCore.App/8.0.0/System.Diagnostics.Process.dll)

PARAMETERS:
  + (Error 0x80131304 retrieving parameter 'reapAll')
  + int configureConsole  = 1

LOCALS:
  + bool childrenUsingTerminalPre  = true
  + (Error 0x80004005 retrieving local variable 'childrenUsingTerminalPost')
  + (Error 0x80004005 retrieving local variable 'local_2')

0000007EBADC5790 0000007f60218260 [NativeStackFrame]
Stack walk complete.
=============================================================================

cc: @tmds

{ "ErrorMessage":"Error while reaping child. errno = 10" } 

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
Author: wfurt
Assignees: wfurt
Labels:

arch-arm64, disabled-test, os-linux, blocking-clean-ci, area-System.Net.Quic, in-pr, Known Build Error

Milestone: 7.0.0

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 12, 2022
@wfurt
Copy link
Member Author

wfurt commented Sep 12, 2022

reopening for 7.0

@wfurt wfurt reopened this Sep 12, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 12, 2022
@karelz
Copy link
Member

karelz commented Sep 12, 2022

@wfurt can you please also re-enable the disabled test from PR #75266?

@wfurt
Copy link
Member Author

wfurt commented Sep 12, 2022

That was part of #75424 unless I missed something. @jkotas already asked for it during review.

@wfurt
Copy link
Member Author

wfurt commented Sep 12, 2022

#75266 does not seems to make it to 7.0 so that part is missing the back-port PR.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 12, 2022
@karelz karelz removed the disabled-test The test is disabled in source code against the issue label Sep 12, 2022
@karelz
Copy link
Member

karelz commented Sep 12, 2022

Yep, sorry - I looked accidentally only at 7.0 change where it was not disabled ... I updated to post and removed the disabled-test label.

@karelz
Copy link
Member

karelz commented Sep 13, 2022

Fixed in main (8.0) in PR #75424 and in 7.0 (RC2) in PR #75466.

@tmds
Copy link
Member

tmds commented Sep 15, 2022

For reference, the lttng issue created by @wfurt: https://bugs.lttng.org/issues/1359.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Net.Quic blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants