Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Game server container crash before Ready, should restart, not move to Unhealthy #956

Closed
markmandel opened this issue Jul 29, 2019 · 6 comments · Fixed by #1099
Closed

Game server container crash before Ready, should restart, not move to Unhealthy #956

markmandel opened this issue Jul 29, 2019 · 6 comments · Fixed by #1099
Assignees
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Milestone

Comments

@markmandel
Copy link
Member

What happened:

If the game server container crashes at any stage, the GameServer moved to Unhealthy straight away.

What you expected to happen:

According to: https://agones.dev/site/docs/guides/health-checking/

If any of the GameServer container fails health checking, or exits before the GameServer moves to Ready then, it is restart as per the restartPolicy (which defaults to “Always”)

How to reproduce it (as minimally and precisely as possible):

We will need a e2e test that has a CRASH command on udp-simple that does a sdk.Exit(1) and then test what happens when a crash occurs before Ready and after Ready.

Anything else we need to know?:

Logic can be found here:
https://github.com/googleforgames/agones/blob/master/pkg/gameservers/health.go#L87

Environment:

  • Agones version: development
  • Kubernetes version (use kubectl version): 1.12
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s): N/A
  • Others:
@markmandel markmandel added kind/bug These are bugs. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Jul 29, 2019
@Colstuwjx
Copy link

I also met this issue, version table see below:

agones: 0.11.0 
xonotic version: 0.5

BTW, it seems that our gameserver\fleet crd not maintains POD replica count, if I found the pod is unhealthy, and maunally deleted it, things would be broken, and gameserver/fleet would never try to recreate a new POD...

Thanks.

@markmandel
Copy link
Member Author

BTW, it seems that our gameserver\fleet crd not maintains POD replica count, if I found the pod is unhealthy, and maunally deleted it, things would be broken, and gameserver/fleet would never try to recreate a new POD...

We have an e2e test for this, so in theory this shouldn't happen, but if you are finding this an issue - please file a new bug with detailed replication instructions so that we can reproduce and fix. Thank you! 👍

@aLekSer
Copy link
Collaborator

aLekSer commented Aug 12, 2019

I was trying to develop such E2E test. It seems that we need a separate simple-udp server which would part of the time fail to start say randomly one time out of 2 (os.Exit() before Ready).
I was not able to send something to UDP server before Ready because it does not have allocated ports.
Added such check and test fails on if statement:

func SendGameServerUDP(gs *agonesv1.GameServer, msg string) (string, error) {
	if len(gs.Status.Ports) < 1 {
		return "", errors.New("Port was not assigned to the GameServer")
	}
        ..
}

@markmandel
Copy link
Member Author

I think we could make it a command line argument for the simple-udp startup script, and then overwrite the entrypoint in the configuration (rather than a whole new image)

@aLekSer
Copy link
Collaborator

aLekSer commented Aug 14, 2019

I have created a branch which contains a UT with the part of health checks from https://agones.dev/site/docs/guides/health-checking/#health-failure-strategy
https://github.com/aLekSer/agones/tree/restartPolicyBeforeReady
The test itself which will perform os.Exit() before and after Ready state. If we call before Ready than number in restartCount should be >0 .
And note that GameServer moves to Scheduled state first.
https://github.com/aLekSer/agones/blob/86af4e7c3a9b5b2770f832d2d7c216346f5e023a/test/e2e/gameserver_test.go#L37

markmandel added a commit to markmandel/agones that referenced this issue Sep 3, 2019
This provides the CRASH command, to make GameServer crash testing
easier.

This also includes a new param to disable automatically moving to
`Ready` on startup, to enable testing crash events before and after
a `Ready` state has been achieved.

Work on googleforgames#956
roberthbailey pushed a commit that referenced this issue Sep 4, 2019
This provides the CRASH command, to make GameServer crash testing
easier.

This also includes a new param to disable automatically moving to
`Ready` on startup, to enable testing crash events before and after
a `Ready` state has been achieved.

Work on #956
markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
@markmandel
Copy link
Member Author

FYI - I think I have a handle on this. It's a bit fun and tricky, but slowly taking it apart.

markmandel added a commit to markmandel/agones that referenced this issue Sep 6, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 6, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 10, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 24, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 24, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 24, 2019
Also moves udp-simple from 0.14=>0.15

More preparation for googleforgames#956
roberthbailey pushed a commit that referenced this issue Sep 24, 2019
* E2E test for Unhealthy GameServer on process crash

More preparation for #956

* Include logic to test LastTerminatedState, since
it looks like an update event can skip the State and
move directly to LastTerminatedState.
@markmandel markmandel self-assigned this Sep 24, 2019
@markmandel markmandel added this to the 1.1.0 milestone Sep 24, 2019
markmandel added a commit to markmandel/agones that referenced this issue Sep 25, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 25, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Sep 25, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 1, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 1, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 1, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 2, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

To solve this, we store the currently running GameServer containerID
(which is unique to that running instance), as an annotation on the
GameServer Pod.

When the annotation is not there, we know the Pod is not yet Ready,
so we can ignore it when our Unhealthy check occurs, and restarts
of the GameServer container can happen as per usual.

When the annotation is there, we know to check for failure, but to
avoid ContainerStatus.LastTerminationState polluting the result
since that crash/failure may have happened before the GameServer is
ready, we can compare the current ContainerID to the one stored in the
annotation -- which when equal means we can skip looking at the
LastTerminationState, as it was before the GameServer was marked as
Ready.

Lots of unit and e2e test updates to go with this to test it out.

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 4, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR googleforgames#1069

Closes googleforgames#956
@markmandel markmandel modified the milestones: 1.1.0, 1.2.0 Oct 22, 2019
markmandel added a commit to markmandel/agones that referenced this issue Oct 24, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR googleforgames#1069

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 24, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR googleforgames#1069

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Oct 24, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR googleforgames#1069

Closes googleforgames#956
markmandel added a commit to markmandel/agones that referenced this issue Nov 2, 2019
This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR googleforgames#1069

Closes googleforgames#956
roberthbailey pushed a commit that referenced this issue Nov 2, 2019
…ter (#1099)

This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR #1069

Closes #956
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
3 participants