Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: make 700-play parallel-safe #23998

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

edsantiago
Copy link
Member

(where possible. Not all tests are parallelizable).

And, refactor two complicated tests into one. This one
is hard to review, sorry.

Signed-off-by: Ed Santiago santiago@redhat.com

None

(where possible. Not all tests are parallelizable).

And, refactor two complicated tests into one. This one
is hard to review, sorry.

Signed-off-by: Ed Santiago <santiago@redhat.com>
Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2024
Comment on lines +940 to +948
# GAH! Save ten seconds, but in a horrible way.
# - 'kube down' does not have a -t0 option.
# - Using 'top' in the container, instead of 'sleep 100', results
# in very weird failures. Seriously weird.
# - 'stop -t0', every once in a while on parallel runs on my
# laptop (never yet in CI), barfs with 'container is running or
# paused, refusing to clean up, container state improper'
# Here's hoping that this will silence the flakes.
run_podman '?' stop -t0 $ctrName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a log for this? What exactly is not working with top?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have lots of logs for "container state improper", is there anything useful that can be gleaned from them?

top: no, I can't find those logs any more. I will try to reproduce.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean podman stop should be idempotent so it should never error with container state improper here so this is likely something that needs to be fixed in podman

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using top -b, the healthy test passes but unhealthy does not:

#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Container never goes unhealthy
   #| expected: !~ -unhealthy
   #|   actual:     1-starting 2-starting 3-starting 4-starting 5-unhealthy 6-starting 7-starting 8-starting 9-starting 10-starting 11-starting 12-healthy
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I guess my "weird" comment reflects my bafflement as to how the container can ever go healthy, and why 'top' is different from 'sleep'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the inverse in this check "Container never goes $dontwant" So the healthy case failed?
The log doesn't make much sense 5-unhealthy for it to then go back to starting, these seems like something wrong with health checks.
But yeah sleep vs top no idea...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope:

#|     FAIL: Container got to 'unhealthy'
#| expected: =~ -unhealthy\$
#|   actual:     1-starting 2-starting 3-starting 4-starting 5-starting 6-starting 7-starting 8-starting 9-starting 10-starting 11-starting 12-starting 13-starting 14-starting 15-starting 16-starting 17-starting 18-starting 19-starting 20-starting 21-starting 22-starting 23-starting 24-starting

Giving up. I'm just monkeying without actually learning anything useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy is this to reproduce? I can try to instrument some podman code to see if I can figure out what is going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this" being the top issue? About 50% reproducible. s/sleep 100/top/ and run hack/bats --rootless --tags ci:parallel. If you mean the stop ... state improper thing, I have not seen it yet this morning.

Oh: I've never seen the top thing in CI, only on my 12-core laptop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a log of the stop issue, from Sep 5:

...
# [11:57:04.641519074] # bin/podman kube play /tmp/podman_bats.NjbFo3/play_kube_unhealthy_7iPZ8Q.yaml
# [11:57:05.565840635] Pod:
# 5f7428be75f02b786e801f19c9bb494011c8e12d70a6ae29e809f3463a6fe16f
# Container:
# 8ae0ccceac07f06a879da59b02771f7699abf390003b440c18f98947678dbee7

...

# [11:57:09.139196762] # bin/podman inspect liveness-exec-t383-vn2b6p1n-unhealthy-liveness-ctr-t383-vn2b6p1n-unhealthy --format 5-{{.State.Health.Status}}
# [11:57:09.296619275] 5-unhealthy
#
# [11:57:09.322312911] # bin/podman stop -t0 liveness-exec-t383-vn2b6p1n-unhealthy-liveness-ctr-t383-vn2b6p1n-unhealthy
# [11:57:10.034340238] Error: container 8ae0ccceac07f06a879da59b02771f7699abf390003b440c18f98947678dbee7 is running or paused, refusing to clean up: container state improper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and, this just in, it finally just triggered on a local run. No need to post log, because it looks exactly the same as above modulo container names and shas. Anyhow, the point is, podman stop is barfing in the way you say should not happen. But it's rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants