Game server container crash before Ready, should restart, not move to Unhealthy #956

markmandel · 2019-07-29T01:15:47Z

What happened:

If the game server container crashes at any stage, the GameServer moved to Unhealthy straight away.

What you expected to happen:

According to: https://agones.dev/site/docs/guides/health-checking/

If any of the GameServer container fails health checking, or exits before the GameServer moves to Ready then, it is restart as per the restartPolicy (which defaults to “Always”)

How to reproduce it (as minimally and precisely as possible):

We will need a e2e test that has a CRASH command on udp-simple that does a sdk.Exit(1) and then test what happens when a crash occurs before Ready and after Ready.

Anything else we need to know?:

Logic can be found here:
https://github.com/googleforgames/agones/blob/master/pkg/gameservers/health.go#L87

Environment:

Agones version: development
Kubernetes version (use kubectl version): 1.12
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): helm
Troubleshooting guide log(s): N/A
Others:

The text was updated successfully, but these errors were encountered:

Colstuwjx · 2019-07-29T01:27:56Z

I also met this issue, version table see below:

agones: 0.11.0 
xonotic version: 0.5

BTW, it seems that our gameserver\fleet crd not maintains POD replica count, if I found the pod is unhealthy, and maunally deleted it, things would be broken, and gameserver/fleet would never try to recreate a new POD...

Thanks.

markmandel · 2019-07-29T01:32:42Z

BTW, it seems that our gameserver\fleet crd not maintains POD replica count, if I found the pod is unhealthy, and maunally deleted it, things would be broken, and gameserver/fleet would never try to recreate a new POD...

We have an e2e test for this, so in theory this shouldn't happen, but if you are finding this an issue - please file a new bug with detailed replication instructions so that we can reproduce and fix. Thank you! 👍

aLekSer · 2019-08-12T15:19:24Z

I was trying to develop such E2E test. It seems that we need a separate simple-udp server which would part of the time fail to start say randomly one time out of 2 (os.Exit() before Ready).
I was not able to send something to UDP server before Ready because it does not have allocated ports.
Added such check and test fails on if statement:

func SendGameServerUDP(gs *agonesv1.GameServer, msg string) (string, error) {
	if len(gs.Status.Ports) < 1 {
		return "", errors.New("Port was not assigned to the GameServer")
	}
        ..
}

markmandel · 2019-08-12T16:17:55Z

I think we could make it a command line argument for the simple-udp startup script, and then overwrite the entrypoint in the configuration (rather than a whole new image)

aLekSer · 2019-08-14T15:43:45Z

I have created a branch which contains a UT with the part of health checks from https://agones.dev/site/docs/guides/health-checking/#health-failure-strategy
https://github.com/aLekSer/agones/tree/restartPolicyBeforeReady
The test itself which will perform os.Exit() before and after Ready state. If we call before Ready than number in restartCount should be >0 .
And note that GameServer moves to Scheduled state first.
https://github.com/aLekSer/agones/blob/86af4e7c3a9b5b2770f832d2d7c216346f5e023a/test/e2e/gameserver_test.go#L37

This provides the CRASH command, to make GameServer crash testing easier. This also includes a new param to disable automatically moving to `Ready` on startup, to enable testing crash events before and after a `Ready` state has been achieved. Work on googleforgames#956

This provides the CRASH command, to make GameServer crash testing easier. This also includes a new param to disable automatically moving to `Ready` on startup, to enable testing crash events before and after a `Ready` state has been achieved. Work on #956

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel · 2019-09-06T00:52:50Z

FYI - I think I have a handle on this. It's a bit fun and tricky, but slowly taking it apart.

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

* E2E test for Unhealthy GameServer on process crash More preparation for #956 * Include logic to test LastTerminatedState, since it looks like an update event can skip the State and move directly to LastTerminatedState.

This brings our implementation inline with what our health checking documentation states that we do. To solve this, we store the currently running GameServer containerID (which is unique to that running instance), as an annotation on the GameServer Pod. When the annotation is not there, we know the Pod is not yet Ready, so we can ignore it when our Unhealthy check occurs, and restarts of the GameServer container can happen as per usual. When the annotation is there, we know to check for failure, but to avoid ContainerStatus.LastTerminationState polluting the result since that crash/failure may have happened before the GameServer is ready, we can compare the current ContainerID to the one stored in the annotation -- which when equal means we can skip looking at the LastTerminationState, as it was before the GameServer was marked as Ready. Lots of unit and e2e test updates to go with this to test it out. Closes googleforgames#956

This brings our implementation inline with what our health checking documentation states that we do. This is done by implementing extra checks in the HealthController to determine if it's appropriate to move to Unhealthy rather than allow a restart to occur. Replaced PR googleforgames#1069 Closes googleforgames#956

…ter (#1099) This brings our implementation inline with what our health checking documentation states that we do. This is done by implementing extra checks in the HealthController to determine if it's appropriate to move to Unhealthy rather than allow a restart to occur. Replaced PR #1069 Closes #956

markmandel added kind/bug These are bugs. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Jul 29, 2019

aLekSer mentioned this issue Aug 15, 2019

GameServer marked unhealthy immediately when using REST API #1000

Closed

markmandel mentioned this issue Sep 3, 2019

CRASH for simple-udp example #1032

Merged

markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019

E2E test for Unhealthy GameServer on process crash

6032601

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel mentioned this issue Sep 5, 2019

E2E test for Unhealthy GameServer on process crash #1038

Merged

markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019

E2E test for Unhealthy GameServer on process crash

963e6d3

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019

E2E test for Unhealthy GameServer on process crash

2e96028

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019

E2E test for Unhealthy GameServer on process crash

fd03d18

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 5, 2019

E2E test for Unhealthy GameServer on process crash

92a8513

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 6, 2019

E2E test for Unhealthy GameServer on process crash

b83c136

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 6, 2019

E2E test for Unhealthy GameServer on process crash

25302d9

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 10, 2019

E2E test for Unhealthy GameServer on process crash

7a00252

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 24, 2019

E2E test for Unhealthy GameServer on process crash

2bd4957

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 24, 2019

E2E test for Unhealthy GameServer on process crash

70425e6

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel added a commit to markmandel/agones that referenced this issue Sep 24, 2019

E2E test for Unhealthy GameServer on process crash

4aefc99

Also moves udp-simple from 0.14=>0.15 More preparation for googleforgames#956

markmandel self-assigned this Sep 24, 2019

markmandel added this to the 1.1.0 milestone Sep 24, 2019

markmandel mentioned this issue Sep 25, 2019

GameServer container restart before Ready, move to Unhealthy state After #1069

Closed

markmandel mentioned this issue Oct 4, 2019

GameServer container restart before Ready, move to Unhealthy state After (v2) #1099

Merged

markmandel modified the milestones: 1.1.0, 1.2.0 Oct 22, 2019

roberthbailey closed this as completed in #1099 Nov 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Game server container crash before Ready, should restart, not move to Unhealthy #956

Game server container crash before Ready, should restart, not move to Unhealthy #956

markmandel commented Jul 29, 2019

Colstuwjx commented Jul 29, 2019

markmandel commented Jul 29, 2019

aLekSer commented Aug 12, 2019 •

edited

Loading

markmandel commented Aug 12, 2019

aLekSer commented Aug 14, 2019

markmandel commented Sep 6, 2019

Game server container crash before Ready, should restart, not move to Unhealthy #956

Game server container crash before Ready, should restart, not move to Unhealthy #956

Comments

markmandel commented Jul 29, 2019

Colstuwjx commented Jul 29, 2019

markmandel commented Jul 29, 2019

aLekSer commented Aug 12, 2019 • edited Loading

markmandel commented Aug 12, 2019

aLekSer commented Aug 14, 2019

markmandel commented Sep 6, 2019

aLekSer commented Aug 12, 2019 •

edited

Loading