Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move more pipelines to scheduled rather than rolling #60099

Merged
merged 2 commits into from
Oct 7, 2021

Conversation

safern
Copy link
Member

@safern safern commented Oct 6, 2021

Follow up for: #59884

Setting scheduled for the linker-tests, runtime-staging and coreclr outerloop.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

eng/pipelines/coreclr/ci.yml Outdated Show resolved Hide resolved
eng/pipelines/runtime-linker-tests.yml Show resolved Hide resolved
Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (if ok with @BruceForstall )

@ghost
Copy link

ghost commented Oct 7, 2021

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow up for: #59884

Setting scheduled for the linker-tests, runtime-staging and coreclr outerloop.

Author: safern
Assignees: -
Labels:

area-Infrastructure

Milestone: -

@danmoseley
Copy link
Member

failure is #60119

@danmoseley
Copy link
Member

@steveisok the build stage of 'Build iOS arm Release AllSubsets_Mono' was still running after 2 hours. Looking at the log it just seems generally super slow, but there were some particularly slow parts eg

2021-10-07T00:12:46.9144650Z   [ 98%] Building C object mono/mini/CMakeFiles/monosgen-objects.dir/__/__/version.c.o
2021-10-07T00:30:28.7707950Z   [ 98%] Built target monosgen-objects

^^ this took 45 mins.

@safern safern merged commit 82cf865 into dotnet:main Oct 7, 2021
@safern safern deleted the MoveMorePipelinesToScheduled branch October 7, 2021 20:50
@steveisok
Copy link
Member

@steveisok the build stage of 'Build iOS arm Release AllSubsets_Mono' was still running after 2 hours. Looking at the log it just seems generally super slow, but there were some particularly slow parts eg

2021-10-07T00:12:46.9144650Z   [ 98%] Building C object mono/mini/CMakeFiles/monosgen-objects.dir/__/__/version.c.o
2021-10-07T00:30:28.7707950Z   [ 98%] Built target monosgen-objects

^^ this took 45 mins.

I suspect when we get a slow mac, all parts become super slow. I don't believe there's anything that part of the build is doing to cause such a slowdown.

@akoeplinger @directhex what you guys think?

@akoeplinger
Copy link
Member

Yeah that looks like another instance of the "slow mac" issue

@danmoseley
Copy link
Member

@MattGal just curious, where are we currently with the "slow mac" issue? I recall core-eng was gathering data.

@MattGal
Copy link
Member

MattGal commented Oct 8, 2021

@MattGal just curious, where are we currently with the "slow mac" issue? I recall core-eng was gathering data.

Investigation notes:

  • We did lots of experiments by rigging up 10 different runtime macos builds and running them in a special pool to look for what was happening, over and over and over.
  • It became clear that allowing 2x 3-core VMs on the same 6-core host to use 100% CPU each bogged down the disk and CPU enough to reliably trigger this symptom
  • Using VMWare settings, we found that limiting max CPU % by the VMs on the host significantly reduced the variance for obvious measurable things like clone / build steps. In experiments with 85%, 90%, and 95%, 90% seemed to win (not a giant sample size, but any limitation under 100% helped)

The rollout is ongoing, looking at that build it definitely smells like an instance of this problem:

  • After checking with the hosted macOS team, by the end of today we should be at 40% conversion and the plan is to stay there for some time to gauge the impact (i.e. make sure the thing we think we're fixing is getting fixed)
  • I verified with my counterpart that indeed the pool where your slow build ran has not been updated yet (whew).

@danmoseley
Copy link
Member

that's great - thanks @MattGal for all that work! Just curious, do we now have better telemetry or other data source to detect unusually slow machines (not sure how this would be defined/detected given workloads vary). IIRC that made it harder to investigate initially.

@MattGal
Copy link
Member

MattGal commented Oct 8, 2021

Machine telemetry continues to be limited for anything that hasn't happened in the last very short period of time (something like hours, not days), and of course since they run whatever we send to them knowing when to alert for perf is nigh impossible. The MMS team is aware of this and wants to improve it, and are improving their ability to post-hoc investigate machines that customers identify as problematic... but I don't think there are any public issues we can follow tracking it

@danmoseley
Copy link
Member

Yeah, all I can think of is "% of times the build times out". Since clearly it's not supposed to. Of course, sometimes that's the compiler hanging or whatnot.

anyway sounds good.

@MattGal
Copy link
Member

MattGal commented Oct 8, 2021

Yeah, all I can think of is "% of times the build times out". Since clearly it's not supposed to. Of course, sometimes that's the compiler hanging or whatnot.

anyway sounds good.

I think we're on the same page here, but that metric can't work for lots of reasons. For our stuff, though I assume other companies have similar problems, your builds depend on something external (in this case, on-prem Helix test machines) and may be timing out or just taking longer than normal due to how many other runs are going, an outage in our services, a regression in a test causing all work to hang, AzDO package feeds just being extremely slow/tar-pitted, or many other reasons.

As such it took a lot of twiddling to even get to the stage where it was clear the "noisy neighbor" issue was causing the problem and not one of these other reasons. I do think though that if they could track specific tasks on these pools (say, how long the same repo takes to clone) it could catch this since a machine with a noisy neighbor often has 50+% longer clone times, but this data isn't in the same place as the machines' telemetry.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants