Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI failures #938

Closed
wants to merge 6 commits into from
Closed

Conversation

inancgumus
Copy link
Member

@inancgumus inancgumus commented Jun 16, 2023

The solution in this PR still needs to be better, but it's better than the current workflows. When we run it multiple times, tests eventually pass. Feel free to push commits to this PR, if needed.

PS: While testing this PR, GH has automatically closed the PR multiple times 😄

@inancgumus inancgumus closed this Jun 16, 2023
@inancgumus inancgumus force-pushed the fix/ci-flakiness-reduce-parallelism branch from 42ffe3e to 6b21ec7 Compare June 16, 2023 05:44
@inancgumus inancgumus reopened this Jun 16, 2023
@inancgumus inancgumus force-pushed the fix/ci-flakiness-reduce-parallelism branch from 3aeba0b to 1a54d3d Compare June 16, 2023 05:49
@inancgumus inancgumus changed the title Reduce parallelism on CI to reduce flakiness CI flakiness reduce experiments Jun 16, 2023
@inancgumus inancgumus force-pushed the fix/ci-flakiness-reduce-parallelism branch from 1a54d3d to 97dc7d8 Compare June 20, 2023 11:20
@inancgumus inancgumus closed this Jun 20, 2023
@inancgumus inancgumus force-pushed the fix/ci-flakiness-reduce-parallelism branch from 97dc7d8 to 92ac6b7 Compare June 20, 2023 13:01
@inancgumus inancgumus reopened this Jun 20, 2023
@inancgumus inancgumus force-pushed the fix/ci-flakiness-reduce-parallelism branch 3 times, most recently from 7598c0b to 9e28ccc Compare June 20, 2023 13:56
@inancgumus inancgumus changed the title CI flakiness reduce experiments Fix CI failures Jun 20, 2023
@inancgumus inancgumus force-pushed the fix/ci-flakiness-reduce-parallelism branch 2 times, most recently from 94957b5 to ae6fbe7 Compare June 20, 2023 15:52
@inancgumus inancgumus requested review from ankur22 and ka3de June 20, 2023 16:11
@inancgumus inancgumus self-assigned this Jun 20, 2023
@inancgumus inancgumus added the ci label Jun 20, 2023
@inancgumus inancgumus marked this pull request as ready for review June 20, 2023 16:11
Copy link
Collaborator

@ka3de ka3de left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think the changes make sense in any case, although I'm still a little bit confused as to why CI tests started to fail, which makes me doubt whether it is due to a change in GH or a change that we introduced 🤔

@@ -217,6 +217,9 @@ func TestBrowserLogIterationID(t *testing.T) {
}

func TestMultiBrowserPanic(t *testing.T) {
// TODO: This test never works on CI, fix it.
t.Skip("skipping, fix this")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use an ENV var, set by us in CI, as a flag in order to skip this test or not. In that way at least we will always run this test when testing locally. Do you think that is worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; I was thinking about the same, too. Especially for the disable-gpu one. We can do it whether we're running on Github by checking the GITHUB_ACTIONS environment variable.

chromium/browser_type.go Show resolved Hide resolved
Copy link
Collaborator

@ankur22 ankur22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think it would be good to get some clarification on the changes so that we're aligned 🙂

Thanks for tackling this issue!!

@@ -21,6 +21,19 @@ jobs:
platform: [ubuntu-latest]
runs-on: ${{ matrix.platform }}
steps:
- name: Setup Chrome
Copy link
Collaborator

@ankur22 ankur22 Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting! So we were relying on Chrome already being present in the CI test VMs/Containers up until we started seeing these issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's interesting, indeed. It didn't work when I removed it.

export GOMAXPROCS=1
fi
# Run with less concurrency to reduce CI flakiness.
export GOMAXPROCS=1
Copy link
Collaborator

@ankur22 ankur22 Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change. I think it makes sense since when we see a flakey tests we don't have the resources to resolve them straight away, and they can be quite demanding tests to fix. At the same time, i think we've found some issues in CI that we don't see in our local test runs, which are generally race conditions or deadlocks. I'll go with the consensus on this one 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I wonder if we should have one set of tests specifically run with higher concurrency, or maybe just leave the current setup and expect it to pass in any case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one that makes it consistently (eh, to some degree) pass. Otherwise, it constantly fails.

Copy link
Member Author

@inancgumus inancgumus Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I'm testing this PR with several commits (in and out), and adding/removing some/all of the tests, I noticed, the nature of the tests doesn't matter for the outcome of the test runs. When there is concurrency, there is a failure here.

@@ -55,7 +55,7 @@ jobs:
export GOMAXPROCS=1
args=("-p" "1" "-race")
export K6_BROWSER_HEADLESS=true
go test "${args[@]}" -timeout 5m ./...
go test "${args[@]}" -timeout 10m ./...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need such a large timeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the concurrency is 1 now, and it was 2 before, I thought it would make sense to double the timeout. When I tested it, tests took about 2x longer.

@ankur22
Copy link
Collaborator

ankur22 commented Jun 21, 2023

which makes me doubt whether it is due to a change in GH or a change that we introduced

We're seeing this CI issue with our tagged releases too. I've tested v0.10.0 and v0.9.0 😞

@inancgumus
Copy link
Member Author

inancgumus commented Jun 21, 2023

Overall I think the changes make sense in any case, although I'm still a little bit confused as to why CI tests started to fail, which makes me doubt whether it is due to a change in GH or a change that we introduced 🤔
LGTM. I think it would be good to get some clarification on the changes so that we're aligned 🙂
Thanks for tackling this issue!!

Sure :)

TBH, I'm as confused as you are :( I believe this issue is caused by GH infrastructure changes and not because of our changes. As I mentioned in my previous edits in the PR description, I've tried with earlier merges. Then I even tried back to v0.8.0. Nothing has changed.

This PR is totally created by experimentation. So I don't have the best answers to your questions :( TLDR: I don't even know myself what/why I'm doing there 😄 I've tried to answer your questions in the comments above.

@inancgumus
Copy link
Member Author

inancgumus commented Jun 21, 2023

Although this one is a fix that demands no cost, #947 seems like a more elegant fix. Closing this one. We can reopen it if needed.

Update:
Out of curiosity, I've tried what happens here with this change. Inspired by @ankur22's ubuntu-pro platform idea in #947.

It worked brilliantly with the addition of ubuntu-pro ($). Otherwise, we need to retry one or two times ($=free).

Even with a concurrency of one, tests finish in ~5m.

Note that workers (CI tasks) are still concurrent. (edited)

@inancgumus inancgumus closed this Jun 21, 2023
I'm just curious to see what happens :-)
@inancgumus inancgumus reopened this Jun 23, 2023
@inancgumus inancgumus closed this Jun 23, 2023
@inancgumus inancgumus deleted the fix/ci-flakiness-reduce-parallelism branch May 7, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants