-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: make 700-play parallel-safe #23998
base: main
Are you sure you want to change the base?
Conversation
(where possible. Not all tests are parallelizable). And, refactor two complicated tests into one. This one is hard to review, sorry. Signed-off-by: Ed Santiago <santiago@redhat.com>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
# GAH! Save ten seconds, but in a horrible way. | ||
# - 'kube down' does not have a -t0 option. | ||
# - Using 'top' in the container, instead of 'sleep 100', results | ||
# in very weird failures. Seriously weird. | ||
# - 'stop -t0', every once in a while on parallel runs on my | ||
# laptop (never yet in CI), barfs with 'container is running or | ||
# paused, refusing to clean up, container state improper' | ||
# Here's hoping that this will silence the flakes. | ||
run_podman '?' stop -t0 $ctrName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a log for this? What exactly is not working with top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have lots of logs for "container state improper", is there anything useful that can be gleaned from them?
top
: no, I can't find those logs any more. I will try to reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean podman stop should be idempotent so it should never error with container state improper here so this is likely something that needs to be fixed in podman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using top -b
, the healthy
test passes but unhealthy
does not:
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#| FAIL: Container never goes unhealthy
#| expected: !~ -unhealthy
#| actual: 1-starting 2-starting 3-starting 4-starting 5-unhealthy 6-starting 7-starting 8-starting 9-starting 10-starting 11-starting 12-healthy
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I guess my "weird" comment reflects my bafflement as to how the container can ever go healthy, and why 'top' is different from 'sleep'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the inverse in this check "Container never goes $dontwant"
So the healthy
case failed?
The log doesn't make much sense 5-unhealthy for it to then go back to starting, these seems like something wrong with health checks.
But yeah sleep vs top no idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope:
#| FAIL: Container got to 'unhealthy'
#| expected: =~ -unhealthy\$
#| actual: 1-starting 2-starting 3-starting 4-starting 5-starting 6-starting 7-starting 8-starting 9-starting 10-starting 11-starting 12-starting 13-starting 14-starting 15-starting 16-starting 17-starting 18-starting 19-starting 20-starting 21-starting 22-starting 23-starting 24-starting
Giving up. I'm just monkeying without actually learning anything useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy is this to reproduce? I can try to instrument some podman code to see if I can figure out what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this" being the top
issue? About 50% reproducible. s/sleep 100/top/
and run hack/bats --rootless --tags ci:parallel
. If you mean the stop ... state improper
thing, I have not seen it yet this morning.
Oh: I've never seen the top
thing in CI, only on my 12-core laptop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a log of the stop
issue, from Sep 5:
...
# [11:57:04.641519074] # bin/podman kube play /tmp/podman_bats.NjbFo3/play_kube_unhealthy_7iPZ8Q.yaml
# [11:57:05.565840635] Pod:
# 5f7428be75f02b786e801f19c9bb494011c8e12d70a6ae29e809f3463a6fe16f
# Container:
# 8ae0ccceac07f06a879da59b02771f7699abf390003b440c18f98947678dbee7
...
# [11:57:09.139196762] # bin/podman inspect liveness-exec-t383-vn2b6p1n-unhealthy-liveness-ctr-t383-vn2b6p1n-unhealthy --format 5-{{.State.Health.Status}}
# [11:57:09.296619275] 5-unhealthy
#
# [11:57:09.322312911] # bin/podman stop -t0 liveness-exec-t383-vn2b6p1n-unhealthy-liveness-ctr-t383-vn2b6p1n-unhealthy
# [11:57:10.034340238] Error: container 8ae0ccceac07f06a879da59b02771f7699abf390003b440c18f98947678dbee7 is running or paused, refusing to clean up: container state improper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and, this just in, it finally just triggered on a local run. No need to post log, because it looks exactly the same as above modulo container names and shas. Anyhow, the point is, podman stop
is barfing in the way you say should not happen. But it's rare.
(where possible. Not all tests are parallelizable).
And, refactor two complicated tests into one. This one
is hard to review, sorry.
Signed-off-by: Ed Santiago santiago@redhat.com