CI: make 700-play parallel-safe #23998

edsantiago · 2024-09-18T12:33:01Z

(where possible. Not all tests are parallelizable).

And, refactor two complicated tests into one. This one
is hard to review, sorry.

Signed-off-by: Ed Santiago santiago@redhat.com

None

(where possible. Not all tests are parallelizable). And, refactor two complicated tests into one. This one is hard to review, sorry. Signed-off-by: Ed Santiago <santiago@redhat.com>

openshift-ci · 2024-09-18T12:33:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2024-09-18T12:54:14Z

test/system/700-play.bats

+        # GAH! Save ten seconds, but in a horrible way.
+        #   - 'kube down' does not have a -t0 option.
+        #   - Using 'top' in the container, instead of 'sleep 100', results
+        #     in very weird failures. Seriously weird.
+        #   - 'stop -t0', every once in a while on parallel runs on my
+        #     laptop (never yet in CI), barfs with 'container is running or
+        #     paused, refusing to clean up, container state improper'
+        # Here's hoping that this will silence the flakes.
+        run_podman '?' stop -t0 $ctrName


Do you have a log for this? What exactly is not working with top?

I have lots of logs for "container state improper", is there anything useful that can be gleaned from them?

top: no, I can't find those logs any more. I will try to reproduce.

I mean podman stop should be idempotent so it should never error with container state improper here so this is likely something that needs to be fixed in podman

Using top -b, the healthy test passes but unhealthy does not:

#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv #| FAIL: Container never goes unhealthy #| expected: !~ -unhealthy #| actual: 1-starting 2-starting 3-starting 4-starting 5-unhealthy 6-starting 7-starting 8-starting 9-starting 10-starting 11-starting 12-healthy #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I guess my "weird" comment reflects my bafflement as to how the container can ever go healthy, and why 'top' is different from 'sleep'

Isn't it the inverse in this check "Container never goes $dontwant" So the healthy case failed?
The log doesn't make much sense 5-unhealthy for it to then go back to starting, these seems like something wrong with health checks.
But yeah sleep vs top no idea...

nope:

#| FAIL: Container got to 'unhealthy' #| expected: =~ -unhealthy\$ #| actual: 1-starting 2-starting 3-starting 4-starting 5-starting 6-starting 7-starting 8-starting 9-starting 10-starting 11-starting 12-starting 13-starting 14-starting 15-starting 16-starting 17-starting 18-starting 19-starting 20-starting 21-starting 22-starting 23-starting 24-starting

Giving up. I'm just monkeying without actually learning anything useful.

How easy is this to reproduce? I can try to instrument some podman code to see if I can figure out what is going on.

"this" being the top issue? About 50% reproducible. s/sleep 100/top/ and run hack/bats --rootless --tags ci:parallel. If you mean the stop ... state improper thing, I have not seen it yet this morning.

Oh: I've never seen the top thing in CI, only on my 12-core laptop

Here's a log of the stop issue, from Sep 5:

... # [11:57:04.641519074] # bin/podman kube play /tmp/podman_bats.NjbFo3/play_kube_unhealthy_7iPZ8Q.yaml # [11:57:05.565840635] Pod: # 5f7428be75f02b786e801f19c9bb494011c8e12d70a6ae29e809f3463a6fe16f # Container: # 8ae0ccceac07f06a879da59b02771f7699abf390003b440c18f98947678dbee7 ... # [11:57:09.139196762] # bin/podman inspect liveness-exec-t383-vn2b6p1n-unhealthy-liveness-ctr-t383-vn2b6p1n-unhealthy --format 5-{{.State.Health.Status}} # [11:57:09.296619275] 5-unhealthy # # [11:57:09.322312911] # bin/podman stop -t0 liveness-exec-t383-vn2b6p1n-unhealthy-liveness-ctr-t383-vn2b6p1n-unhealthy # [11:57:10.034340238] Error: container 8ae0ccceac07f06a879da59b02771f7699abf390003b440c18f98947678dbee7 is running or paused, refusing to clean up: container state improper

...and, this just in, it finally just triggered on a local run. No need to post log, because it looks exactly the same as above modulo container names and shas. Anyhow, the point is, podman stop is barfing in the way you say should not happen. But it's rare.

CI: make 700-play parallel-safe

9d7c3ee

(where possible. Not all tests are parallelizable). And, refactor two complicated tests into one. This one is hard to review, sorry. Signed-off-by: Ed Santiago <santiago@redhat.com>

openshift-ci bot added the release-note-none label Sep 18, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2024

Luap99 reviewed Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: make 700-play parallel-safe #23998

CI: make 700-play parallel-safe #23998

edsantiago commented Sep 18, 2024

openshift-ci bot commented Sep 18, 2024

Luap99 Sep 18, 2024

edsantiago Sep 18, 2024

Luap99 Sep 18, 2024

edsantiago Sep 18, 2024

Luap99 Sep 18, 2024

edsantiago Sep 18, 2024

Luap99 Sep 18, 2024

edsantiago Sep 18, 2024

edsantiago Sep 18, 2024

edsantiago Sep 18, 2024

CI: make 700-play parallel-safe #23998

Are you sure you want to change the base?

CI: make 700-play parallel-safe #23998

Conversation

edsantiago commented Sep 18, 2024

openshift-ci bot commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment