Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MigrationsRunner locks up on dotnet CLI first run #12365

Open
ZeroSkill1 opened this issue Jan 13, 2023 · 12 comments
Open

MigrationsRunner locks up on dotnet CLI first run #12365

ZeroSkill1 opened this issue Jan 13, 2023 · 12 comments
Assignees

Comments

@ZeroSkill1
Copy link

ZeroSkill1 commented Jan 13, 2023

NuGet Product Used

dotnet.exe

Product Version

dotnet 6.0.10 (SDK 6.0.402)+

Worked before?

dotnet 6.0.9 (SDK 6.0.401)

Impact

I'm unable to use this version

Repro Steps & Context

Since MigrationsRunner was introduced in NuGet.Client (used in .NET 6.0.10, SDK 6.0.402+), I have not been able to use dotnet at all.

I have opened an issue over at dotnet/sdk already and that led me to NuGet being the cause. As seen in the MigrationsRunner source, some kind of logic is executed that uses Mutexes. When dotnet runs for the first time, it executes the .Run() method, it seems to always hang and infinitely wait(?) for the Mutex to release, which never happens. As a result, dotnet cannot continue with its first run logic.

To reproduce:

  • Download any version of dotnet newer than 6.0.9 (SDK 6.0.401)
  • Make sure you do not have a directory called $HOME/.dotnet (because dotnet will not execute MigrationsRunner.Run() otherwise)
  • Run a basic command such as dotnet new console -o HelloWorld or even dotnet nuget
  • dotnet will hang, spiking the CPU utilization on one core.

What I've tried:

  • Removing $HOME/.nuget to see if that would be an issue (perhaps due to incorrect permissions)
  • Tracing syscalls of the dotnet binary, which showed the infinite Mutex wait

Verbose Logs

No response

@ZeroSkill1
Copy link
Author

I would've tried to debug this a little more myself, but I am unable to do so as it's impossible to get my hands on a debug build by compiling one. The dotnet-install script downloads an earlier version of the SDK, which, due to this issue, prevents dotnet from doing anything.

@kartheekp-ms
Copy link
Contributor

kartheekp-ms commented Jan 15, 2023

When dotnet runs for the first time, it executes the .Run() method, it seems to always hang and infinitely wait(?) for the Mutex to release, which never happens. As a result, dotnet cannot continue with its first run logic.

@ZeroSkill1 - https://github.com/dotnet/sdk/blob/dfe7bcb5ffe75746f35553c9919c317384c7c97b/src/Cli/Microsoft.DotNet.Configurer/DotnetFirstTimeUseConfigurer.cs#L62 line of code is not inserted into 6.0.10 .NET SDK version. However, in that version of .NET SDK, NuGet introduced a migration logic that runs only once whenever a NuGet command is invoked. For example, dotnet restore, dotnet nuget add source. In .NET SDK version 7.0.102 NuGet's migration code is invoked when first time notice is printed to console https://github.com/dotnet/sdk/blob/98720693e228934405de152651b2e21390c3bfec/src/Cli/Microsoft.DotNet.Configurer/DotnetFirstTimeUseConfigurer.cs#L62.

I created an Ubuntu 22.04 WSL instance locally and unable to reproduce the issue.

  • Installed .NET 7.0.102 SDK
  • user1@:~$ dotnet new console
Welcome to .NET 7.0!
---------------------
SDK Version: 7.0.102
//rest of the first-time notice removed to keep this short.

The template "Console App" was created successfully.

Processing post-creation actions...
Restoring /home/user1/project1.csproj:
  Determining projects to restore...
  Restored /home/user1/project1.csproj (in 67 ms).
Restore succeeded.
  • Navigated to /home/user/.local/share/NuGet/Migrations and found that a file has been created that indicates that NuGet migrations logic executed successfully.
user1@<MACHINENAME>:~/.local/share/NuGet/Migrations$ ls -l
total 0
-rw-r--r-- 1 user1 user1 0 Jan 14 17:22 1

In dotnet/sdk#29662 (comment) you have mentioned that I can very clearly see that it is waiting for a mutex by tracing the syscalls of the dotnet binary.

Related to this, we fixed a logic error in the migrations code that is causing abandoned mutex exceptions in NuGet/NuGet.Client#4895. This fix has not been released yet and but is part of next servicing release. Can you please paste the call stacks here where you have identified NuGet is waiting for a mutex? That would be interesting because, NuGet waits maximum of 1 minute to obtain mutex. https://github.com/NuGet/NuGet.Client/blob/0966444b35d8a74fa25adf0368b983f389884377/src/NuGet.Core/NuGet.Common/Migrations/MigrationRunner.cs#L35

Can you use dotnet-dump to get a (managed?) memory dump of the process, so we can try to see where it got stuck?

@kartheekp-ms kartheekp-ms added Functionality:Restore Product:dotnet.exe WaitingForCustomer Applied when a NuGet triage person needs more info from the OP and removed Triage:Investigate labels Jan 15, 2023
@ZeroSkill1

This comment was marked as outdated.

@ghost ghost added WaitingForClientTeam Customer replied, needs attention from client team. Do not apply this label manually. and removed WaitingForCustomer Applied when a NuGet triage person needs more info from the OP labels Jan 15, 2023
@ZeroSkill1
Copy link
Author

ZeroSkill1 commented Jan 15, 2023

Rerunning the dotnet executable does indeed confirm that something is wrong on the NuGet side of things. What is mentioned in the comment above happens as a result of this logic issue I assume.

Call stack of dotnet -h running in "first use" mode:

> clrstack                                                                                                                                       
OS Thread Id: 0x1677 (0)
        Child SP               IP Call Site
00007FFFCE598E70 00007F66168A9883 System.Collections.Generic.HashSet`1[[System.__Canon, System.Private.CoreLib]].AddIfNotPresent(System.__Canon, Int32 ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/HashSet.cs @ 1157]
00007FFFCE598F00 00007F66168A95DB System.Collections.Generic.HashSet`1[[System.__Canon, System.Private.CoreLib]].Add(System.__Canon) [/_/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/HashSet.cs @ 453]
00007FFFCE598F10 00007F6617C4F811 NuGet.Common.Migrations.Migration1.<GetPathsToCheck>g__AddAllParentDirectoriesUpToHome|3_0(System.String, <>c__DisplayClass3_0 ByRef)
00007FFFCE598F30 00007F6617C4F27B NuGet.Common.Migrations.Migration1.GetPathsToCheck()
00007FFFCE598F60 00007F6617C4F0C1 NuGet.Common.Migrations.Migration1.Run()
00007FFFCE598F80 00007F6617C4F939 NuGet.Common.Migrations.MigrationRunner.Run()
00007FFFCE598FD0 00007F66173A1DF5 Microsoft.DotNet.Configurer.DotnetFirstTimeUseConfigurer.Configure() [/_/src/Cli/Microsoft.DotNet.Configurer/DotnetFirstTimeUseConfigurer.cs @ 62]
00007FFFCE599000 00007F6616A2E7F7 Microsoft.DotNet.Cli.Program.ConfigureDotNetForFirstTimeUse(Microsoft.DotNet.Configurer.IFirstTimeUseNoticeSentinel, Microsoft.DotNet.Configurer.IAspNetCertificateSentinel, Microsoft.DotNet.Configurer.IFileSentinel, Boolean, Microsoft.DotNet.Configurer.DotnetFirstRunConfiguration, Microsoft.DotNet.Cli.Utils.IEnvironmentProvider, System.Collections.Generic.Dictionary`2<System.String,Double>) [/_/src/Cli/dotnet/Program.cs @ 296]
00007FFFCE599080 00007F6616A2E2EB Microsoft.DotNet.Cli.Program.ProcessArgs(System.String[], System.TimeSpan, Microsoft.DotNet.Cli.Telemetry.ITelemetry) [/_/src/Cli/dotnet/Program.cs @ 194]
00007FFFCE5991B0 00007F6616A2DBB4 Microsoft.DotNet.Cli.Program.Main(System.String[]) [/_/src/Cli/dotnet/Program.cs @ 77]

@kartheekp-ms
Copy link
Contributor

kartheekp-ms commented Jan 17, 2023

Looking at the call stack posted in the above comment, may be the while loop never breaks (I am guessing at this point). https://github.com/NuGet/NuGet.Client/blob/1e764301523e488a5ca79bdedf79df66f0a3ffbd/src/NuGet.Core/NuGet.Common/Migrations/Migration1.cs#L90-L94

                while (parent != homePath)
                {
                    pathsToCheck.Add(parent);
                    parent = Path.GetDirectoryName(parent);
                }

Please follow steps and let us know your feedback to troubleshoot this issue.

  1. Create $HOME/.dotnet/7.0.102.dotnetFirstUseSentinel file using command touch $HOME/.dotnet/7.0.102.dotnetFirstUseSentinel. Reference restore/build/publish hanging on .NET > 6.0.401 dotnet/sdk#29662 (comment) comment.
  2. Clear all the files under $HOME/.local/shared/NuGet/Migrations directory if any.
  3. Invoke any NuGet command for e.g., dotnet nuget --version.
  4. Verify that a file with the name 1 is created under $HOME/.local/share/NuGet/Migrations directory.

Are you able to reproduce this issue (meaning indefinite wait) after step no. 3?

@ghost ghost added WaitingForCustomer Applied when a NuGet triage person needs more info from the OP and removed WaitingForClientTeam Customer replied, needs attention from client team. Do not apply this label manually. labels Jan 17, 2023
@kartheekp-ms
Copy link
Contributor

@ZeroSkill1 - adding to above comment, I created a new VM and executed the steps you mentioned in the issue description and noticed that restore succeeded. Would it be possible to run this test on another Ubuntu 22.04 machine? I just want to eliminate the possibility of any environment variables causing these issues.

@ZeroSkill1
Copy link
Author

ZeroSkill1 commented Jan 18, 2023

@kartheekp-ms After creating 7.0.102.dotnetFirstUseSentinel in $HOME/.dotnet, this is the dump when running dotnet nuget --version:

00007FFF479C6098 00007F4E0E515086 System.IO.Path.GetDirectoryName(System.String) [/_/src/libraries/System.Private.CoreLib/src/System/IO/Path.cs @ 134]
00007FFF479C60C0 00007F4E0E6FF81A NuGet.Common.Migrations.Migration1.<GetPathsToCheck>g__AddAllParentDirectoriesUpToHome|3_0(System.String, <>c__DisplayClass3_0 ByRef)
00007FFF479C60E0 00007F4E0E6FF27B NuGet.Common.Migrations.Migration1.GetPathsToCheck()
00007FFF479C6110 00007F4E0E6FF0C1 NuGet.Common.Migrations.Migration1.Run()
00007FFF479C6130 00007F4E0E6FF939 NuGet.Common.Migrations.MigrationRunner.Run()
00007FFF479C6180 00007F4E0E60367F NuGet.CommandLine.XPlat.Program.MainInternal(System.String[], NuGet.CommandLine.XPlat.CommandOutputLogger)
00007FFF479C61E0 00007F4E0E6035E8 NuGet.CommandLine.XPlat.Program.Main(System.String[])

Note: in $HOME/.local/share/NuGet/Migrations, no files are created and there were none before either.

Additionally, the restore does succeed on a fresh VM.

@ghost ghost added WaitingForClientTeam Customer replied, needs attention from client team. Do not apply this label manually. and removed WaitingForCustomer Applied when a NuGet triage person needs more info from the OP labels Jan 18, 2023
@ZeroSkill1
Copy link
Author

ZeroSkill1 commented Jan 18, 2023

After running a diff on the environment variables of the root user and the normal user, I found the following.

Simply append a forward slash to $HOME:

export HOME=/home/{username}/; # this will trigger the bug
rm $HOME/.local/share/NuGet/Migrations -r; # make sure it runs migrations
dotnet nuget --version; # this will hang

I did not notice this before, but in my VM this slash does not exist in the $HOME variable. This also matches the behavior seen when running dotnet as the root user; since the $HOME variable is initialized to /root.

It looks like it causes an infinite loop.

@kartheekp-ms kartheekp-ms removed the WaitingForClientTeam Customer replied, needs attention from client team. Do not apply this label manually. label Jan 19, 2023
@kartheekp-ms
Copy link
Contributor

kartheekp-ms commented Jan 20, 2023

This also matches the behavior seen when running dotnet as the root user; since the $HOME variable is initialized to /root

@ZeroSkill1 - I am able to reproduce the indefinite wait appending "/" to the HOME environment variable when logged in as a normal user. But I don't understand how you were able to reproduce this issue while logged in as root user or in other words can you please explain how you were able to reproduce this issue while logged in as root user?

EDIT - I am unable to reproduce this issue on a new VM logged as root user whose HOME path is set to "/root". That being said I am able to reproduce the indefinite wait when HOME is set to "/root/" for the root user.

root@{ROOT}-vm-01:~# rm -rf \$HOME/.dotnet/ *
root@{ROOT}-vm-01:~# rm  $HOME/.local/share/NuGet/Migrations/ -rf

root@{ROOT}-vm-01:~# echo $HOME
/root

root@{ROOT}-vm-01:~# dotnet -h
Usage: dotnet [runtime-options] [path-to-application] [arguments]

//removed other output to keep this simple

Run 'dotnet [command] --help' for more information on a command.

root@{ROOT}-vm-01:~# dotnet nuget --version
NuGet Command Line
6.4.0.117

root@{ROOT}-vm-01:~# cd $HOME/.local/share/NuGet/Migrations/
root@{ROOT}-vm-01:~/.local/share/NuGet/Migrations# ls -l
total 0
-rw-r--r-- 1 root root 0 Jan 20 03:31 1

root@{ROOT}-vm-01:~/.local/share/NuGet/Migrations# cd $HOME
root@{ROOT}-vm-01:~#

@ZeroSkill1
Copy link
Author

I meant that the issue does not affect the root user.

@aortiz-msft aortiz-msft added the Priority:2 Issues for the current backlog. label Jan 30, 2023
@steveisok
Copy link

We are experiencing the same thing during certain CI runs within dotnet/runtime.

dotnet/runtime#80619

Microsoft.DotNet.XUnitConsoleRunner v2.5.0 (64-bit .NET 7.0.2)
  Discovering: JIT.HardwareIntrinsics.XUnitWrapper (method display = ClassAndMethod, method display options = None)
  Discovered:  JIT.HardwareIntrinsics.XUnitWrapper (found 6 of 362 test cases)
  Starting:    JIT.HardwareIntrinsics.XUnitWrapper (parallel test collections = on, max threads = 2)
    JIT/HardwareIntrinsics/Arm/ArmBase.Arm64/ArmBase.Arm64_ro/ArmBase.Arm64_ro.sh [FAIL]
      System.IO.IOException: The system cannot open the device or file specified. : 'NuGet-Migrations'
         at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
         at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
         at NuGet.Common.Migrations.MigrationRunner.Run()
         at Microsoft.DotNet.Configurer.DotnetFirstTimeUseConfigurer.Configure()
         at Microsoft.DotNet.Cli.Program.ConfigureDotNetForFirstTimeUse(IFirstTimeUseNoticeSentinel firstTimeUseNoticeSentinel, IAspNetCertificateSentinel aspNetCertificateSentinel, IFileSentinel toolPathSentinel, Boolean isDotnetBeingInvokedFromNativeInstaller, DotnetFirstRunConfiguration dotnetFirstRunConfiguration, IEnvironmentProvider environmentProvider, Dictionary`2 performanceMeasurements)
         at Microsoft.DotNet.Cli.Program.ProcessArgs(String[] args, TimeSpan startupTime, ITelemetry telemetryClient)
         at Microsoft.DotNet.Cli.Program.Main(String[] args)
      
      Return code:      1

@kartheekp-ms, would it be possible to use a global mutex instead?

@kartheekp-ms
Copy link
Contributor

@steveisok - This root cause of this issue is presence of trailing forward slash to the $HOME environment variable. AFAIK this issue is different from dotnet/runtime#80619. I added a comment dotnet/runtime#80619 (comment) explaining why we choose local mutex over global mutex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants