Move more pipelines to scheduled rather than rolling #60099

safern · 2021-10-06T22:44:48Z

Follow up for: #59884

Setting scheduled for the linker-tests, runtime-staging and coreclr outerloop.

dotnet-issue-labeler · 2021-10-06T22:44:53Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

eng/pipelines/coreclr/ci.yml

eng/pipelines/runtime-linker-tests.yml

danmoseley

LGTM (if ok with @BruceForstall )

ghost · 2021-10-07T15:15:06Z

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow up for: #59884

Setting scheduled for the linker-tests, runtime-staging and coreclr outerloop.

Author:	safern
Assignees:	-
Labels:	`area-Infrastructure`
Milestone:	-

danmoseley · 2021-10-07T15:15:33Z

failure is #60119

danmoseley · 2021-10-07T15:19:15Z

@steveisok the build stage of 'Build iOS arm Release AllSubsets_Mono' was still running after 2 hours. Looking at the log it just seems generally super slow, but there were some particularly slow parts eg

2021-10-07T00:12:46.9144650Z   [ 98%] Building C object mono/mini/CMakeFiles/monosgen-objects.dir/__/__/version.c.o
2021-10-07T00:30:28.7707950Z   [ 98%] Built target monosgen-objects

^^ this took 45 mins.

steveisok · 2021-10-07T21:19:00Z

@steveisok the build stage of 'Build iOS arm Release AllSubsets_Mono' was still running after 2 hours. Looking at the log it just seems generally super slow, but there were some particularly slow parts eg
2021-10-07T00:12:46.9144650Z   [ 98%] Building C object mono/mini/CMakeFiles/monosgen-objects.dir/__/__/version.c.o
2021-10-07T00:30:28.7707950Z   [ 98%] Built target monosgen-objects
^^ this took 45 mins.

I suspect when we get a slow mac, all parts become super slow. I don't believe there's anything that part of the build is doing to cause such a slowdown.

@akoeplinger @directhex what you guys think?

akoeplinger · 2021-10-08T14:03:07Z

Yeah that looks like another instance of the "slow mac" issue

danmoseley · 2021-10-08T16:21:26Z

@MattGal just curious, where are we currently with the "slow mac" issue? I recall core-eng was gathering data.

MattGal · 2021-10-08T19:16:10Z

@MattGal just curious, where are we currently with the "slow mac" issue? I recall core-eng was gathering data.

Investigation notes:

We did lots of experiments by rigging up 10 different runtime macos builds and running them in a special pool to look for what was happening, over and over and over.
It became clear that allowing 2x 3-core VMs on the same 6-core host to use 100% CPU each bogged down the disk and CPU enough to reliably trigger this symptom
Using VMWare settings, we found that limiting max CPU % by the VMs on the host significantly reduced the variance for obvious measurable things like clone / build steps. In experiments with 85%, 90%, and 95%, 90% seemed to win (not a giant sample size, but any limitation under 100% helped)

The rollout is ongoing, looking at that build it definitely smells like an instance of this problem:

After checking with the hosted macOS team, by the end of today we should be at 40% conversion and the plan is to stay there for some time to gauge the impact (i.e. make sure the thing we think we're fixing is getting fixed)
I verified with my counterpart that indeed the pool where your slow build ran has not been updated yet (whew).

danmoseley · 2021-10-08T19:21:15Z

that's great - thanks @MattGal for all that work! Just curious, do we now have better telemetry or other data source to detect unusually slow machines (not sure how this would be defined/detected given workloads vary). IIRC that made it harder to investigate initially.

MattGal · 2021-10-08T19:39:56Z

Machine telemetry continues to be limited for anything that hasn't happened in the last very short period of time (something like hours, not days), and of course since they run whatever we send to them knowing when to alert for perf is nigh impossible. The MMS team is aware of this and wants to improve it, and are improving their ability to post-hoc investigate machines that customers identify as problematic... but I don't think there are any public issues we can follow tracking it

danmoseley · 2021-10-08T22:24:25Z

Yeah, all I can think of is "% of times the build times out". Since clearly it's not supposed to. Of course, sometimes that's the compiler hanging or whatnot.

anyway sounds good.

MattGal · 2021-10-08T23:08:27Z

Yeah, all I can think of is "% of times the build times out". Since clearly it's not supposed to. Of course, sometimes that's the compiler hanging or whatnot.

anyway sounds good.

I think we're on the same page here, but that metric can't work for lots of reasons. For our stuff, though I assume other companies have similar problems, your builds depend on something external (in this case, on-prem Helix test machines) and may be timing out or just taking longer than normal due to how many other runs are going, an outage in our services, a regression in a test causing all work to hang, AzDO package feeds just being extremely slow/tar-pitted, or many other reasons.

As such it took a lot of twiddling to even get to the stage where it was clear the "noisy neighbor" issue was causing the problem and not one of these other reasons. I do think though that if they could track specific tasks on these pools (say, how long the same repo takes to clone) it could catch this since a machine with a noisy neighbor often has 50+% longer clone times, but this data isn't in the same place as the machines' telemetry.

Move more pipelines to scheduled rather than rolling

ffa80c8

safern requested review from lewing, steveisok, danmoseley, jkotas, ViktorHofer, eerhardt, BruceForstall and hoyosjs October 6, 2021 22:44

eerhardt reviewed Oct 6, 2021

View reviewed changes

eng/pipelines/coreclr/ci.yml Outdated Show resolved Hide resolved

eng/pipelines/runtime-linker-tests.yml Show resolved Hide resolved

PR Feedback

80e1978

steveisok approved these changes Oct 6, 2021

View reviewed changes

BruceForstall approved these changes Oct 6, 2021

View reviewed changes

hoyosjs approved these changes Oct 6, 2021

View reviewed changes

eerhardt approved these changes Oct 6, 2021

View reviewed changes

lewing approved these changes Oct 6, 2021

View reviewed changes

danmoseley approved these changes Oct 7, 2021

View reviewed changes

runfoapp bot mentioned this pull request Oct 7, 2021

System.Tests.TimeZoneInfoTests.TestWindowsNlsDisplayNames is failing on rolling builds #60119

Closed

danmoseley added the area-Infrastructure label Oct 7, 2021

safern merged commit 82cf865 into dotnet:main Oct 7, 2021

safern deleted the MoveMorePipelinesToScheduled branch October 7, 2021 20:50

ghost locked as resolved and limited conversation to collaborators Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move more pipelines to scheduled rather than rolling #60099

Move more pipelines to scheduled rather than rolling #60099

safern commented Oct 6, 2021

dotnet-issue-labeler bot commented Oct 6, 2021

danmoseley left a comment

ghost commented Oct 7, 2021

danmoseley commented Oct 7, 2021

danmoseley commented Oct 7, 2021

steveisok commented Oct 7, 2021

akoeplinger commented Oct 8, 2021

danmoseley commented Oct 8, 2021

MattGal commented Oct 8, 2021

danmoseley commented Oct 8, 2021

MattGal commented Oct 8, 2021

danmoseley commented Oct 8, 2021

MattGal commented Oct 8, 2021

Move more pipelines to scheduled rather than rolling #60099

Move more pipelines to scheduled rather than rolling #60099

Conversation

safern commented Oct 6, 2021

dotnet-issue-labeler bot commented Oct 6, 2021

danmoseley left a comment

Choose a reason for hiding this comment

ghost commented Oct 7, 2021

danmoseley commented Oct 7, 2021

danmoseley commented Oct 7, 2021

steveisok commented Oct 7, 2021

akoeplinger commented Oct 8, 2021

danmoseley commented Oct 8, 2021

MattGal commented Oct 8, 2021

danmoseley commented Oct 8, 2021

MattGal commented Oct 8, 2021

danmoseley commented Oct 8, 2021

MattGal commented Oct 8, 2021