-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move more pipelines to scheduled rather than rolling #60099
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (if ok with @BruceForstall )
Tagging subscribers to this area: @dotnet/runtime-infrastructure Issue DetailsFollow up for: #59884 Setting scheduled for the linker-tests, runtime-staging and coreclr outerloop.
|
failure is #60119 |
@steveisok the build stage of 'Build iOS arm Release AllSubsets_Mono' was still running after 2 hours. Looking at the log it just seems generally super slow, but there were some particularly slow parts eg
^^ this took 45 mins. |
I suspect when we get a slow mac, all parts become super slow. I don't believe there's anything that part of the build is doing to cause such a slowdown. @akoeplinger @directhex what you guys think? |
Yeah that looks like another instance of the "slow mac" issue |
@MattGal just curious, where are we currently with the "slow mac" issue? I recall core-eng was gathering data. |
Investigation notes:
The rollout is ongoing, looking at that build it definitely smells like an instance of this problem:
|
that's great - thanks @MattGal for all that work! Just curious, do we now have better telemetry or other data source to detect unusually slow machines (not sure how this would be defined/detected given workloads vary). IIRC that made it harder to investigate initially. |
Machine telemetry continues to be limited for anything that hasn't happened in the last very short period of time (something like hours, not days), and of course since they run whatever we send to them knowing when to alert for perf is nigh impossible. The MMS team is aware of this and wants to improve it, and are improving their ability to post-hoc investigate machines that customers identify as problematic... but I don't think there are any public issues we can follow tracking it |
Yeah, all I can think of is "% of times the build times out". Since clearly it's not supposed to. Of course, sometimes that's the compiler hanging or whatnot. anyway sounds good. |
I think we're on the same page here, but that metric can't work for lots of reasons. For our stuff, though I assume other companies have similar problems, your builds depend on something external (in this case, on-prem Helix test machines) and may be timing out or just taking longer than normal due to how many other runs are going, an outage in our services, a regression in a test causing all work to hang, AzDO package feeds just being extremely slow/tar-pitted, or many other reasons. As such it took a lot of twiddling to even get to the stage where it was clear the "noisy neighbor" issue was causing the problem and not one of these other reasons. I do think though that if they could track specific tasks on these pools (say, how long the same repo takes to clone) it could catch this since a machine with a noisy neighbor often has 50+% longer clone times, but this data isn't in the same place as the machines' telemetry. |
Follow up for: #59884
Setting scheduled for the linker-tests, runtime-staging and coreclr outerloop.