Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for health check race condition #2351

Merged
merged 1 commit into from
Nov 2, 2021

Conversation

markmandel
Copy link
Member

@markmandel markmandel commented Oct 29, 2021

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking

/kind bug

/kind cleanup
/kind documentation
/kind feature
/kind hotfix

What this PR does / Why we need it:

The Pod and GameServer data in the Informer cache are eventually consistent and can result in data races.

This fix ensures that they are in sync at the time that the GameServer is moved to Ready, so that Health checking can behave correctly if a crash occurs after the fact.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

This also includes a combination of smaller fixes/debug for e2e flakiness,
as we narrowed down the problem.

  • Retries on UDP packet sending
  • Failing on UDP packet sending dumps gameserver details
  • Drop parallelism on e2e tests from 32 to 16.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: d2702608-cdfe-4f78-9c15-9c202fa36835

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-14418e3

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 5ffee1c4-57fa-49e6-a510-8beef192b424

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-e1be871

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: ea9a49ab-9959-4c01-8871-505619ddad0d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel markmandel changed the title Combining a few debugging ideas Combination of fixes/debug for e2e flakiness Oct 30, 2021
@markmandel markmandel marked this pull request as ready for review October 30, 2021 03:51
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 456cadbe-706b-4aa5-962a-8e164072639d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@roberthbailey
Copy link
Member

--- FAIL: TestGameServerAllocationMetaDataPatch (305.22s)
    gameserverallocation_test.go:383: 
        	Error Trace:	gameserverallocation_test.go:383
        	Error:      	Expected nil, but got: &errors.errorString{s:"waiting for {game-server [{udp-port Dynamic <nil> 7654 0 UDP}] {false 0 0 0}  { 0 0} {{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} {[] [] [{game-server gcr.io/agones-images/simple-game-server:0.4 [] []  [] [] [] {map[cpu:{{30 -3} {<nil>} 30m DecimalSI} memory:{{33554432 0} {<nil>}  BinarySI}] map[cpu:{{30 -3} {<nil>} 30m DecimalSI} memory:{{33554432 0} {<nil>}  BinarySI}]} [] [] nil nil nil nil   IfNotPresent nil false false false}] []  <nil> <nil>  map[]   <nil>  false false false <nil> nil []   nil  [] []  <nil> nil [] <nil> <nil> <nil> map[] [] <nil>}} <nil>} GameServer instance readiness timed out (): waiting for GameServer to be Ready 1635567595/game-server26bwl: timed out waiting for the condition"}
        	Test:       	TestGameServerAllocationMetaDataPatch
    gameserverallocation_test.go:384: 
        	Error Trace:	gameserverallocation_test.go:384
        	Error:      	could not create GameServer
        	Test:       	TestGameServerAllocationMetaDataPatch

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 887524df-56a0-401c-8e73-9296be1d1626

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member Author

So now one can grep the logs to get all the logs for a specific e2e test, and now we start to see something strange:

time="2021-10-30 04:24:31.385" level=info msg="GameServer created, waiting for Ready" gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:33.802" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:37.373" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:40.374" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:43.573" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:46.574" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:49.173" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:52.172" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:54.974" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:24:58.190" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:25:00.973" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:25:03.774" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:25:06.575" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Unhealthy gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:25:08.973" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Unhealthy gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
...
time="2021-10-30 04:25:11.774" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Unhealthy gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
time="2021-10-30 04:25:14.774" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Unhealthy gs=game-server26bwl test=TestGameServerAllocationMetaDataPatch
--- FAIL: TestGameServerAllocationMetaDataPatch (305.22s)
        	Test:       	TestGameServerAllocationMetaDataPatch
        	Test:       	TestGameServerAllocationMetaDataPatch
time="2021-10-30 05:03:46.829" level=info msg="Could not read from address" address="34.127.24.242:7783" error="read udp 192.168.10.6:33435->34.127.24.242:7783: i/o timeout" test=TestFleetRecreateGameServers/gameserver_unhealthy
time="2021-10-30 05:03:46.869" level=info msg="Could not read from address" address="34.127.24.242:7783" error="read udp 192.168.10.6:54095->34.127.24.242:7783: read: connection refused" test=TestFleetRecreateGameServers/gameserver_unhealthy
time="2021-10-30 05:03:46.870" level=info msg="Failed to send UDP packet to GameServer. Dumping Events!" gs=simple-fleet-q4k2q-ncqvg-dqmzb status="{State:Ready Ports:[{Name:udp-port Port:7783}] Address:34.127.24.242 NodeName:gke-e2e-test-cluster-default-684ee24f-41w0 ReservedUntil:<nil> Players:0xc00035fb60}" test=TestFleetRecreateGameServers/gameserver_unhealthy
time="2021-10-30 05:03:46.870" level=info msg="Dumping Events:" kind=GameServer test=TestFleetRecreateGameServers/gameserver_unhealthy
time="2021-10-30 05:03:47.060" level=info msg="Event!" lastTimestamp="2021-10-30 05:02:49 +0000 UTC" message="Port allocated" reason=PortAllocation test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal
time="2021-10-30 05:03:47.060" level=info msg="Event!" lastTimestamp="2021-10-30 05:02:49 +0000 UTC" message="Pod simple-fleet-q4k2q-ncqvg-dqmzb created" reason=Creating test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal
time="2021-10-30 05:03:47.060" level=info msg="Event!" lastTimestamp="2021-10-30 05:02:49 +0000 UTC" message="Address and port populated" reason=Scheduled test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal
time="2021-10-30 05:03:47.061" level=info msg="Event!" lastTimestamp="2021-10-30 05:03:37 +0000 UTC" message="SDK state change" reason=RequestReady test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal
time="2021-10-30 05:03:47.061" level=info msg="Event!" lastTimestamp="2021-10-30 05:03:37 +0000 UTC" message="SDK.Ready() complete" reason=Ready test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal
time="2021-10-30 05:03:47.061" level=info msg="Event!" lastTimestamp="2021-10-30 05:03:44 +0000 UTC" message="Issue with Gameserver pod" reason=Unhealthy test=TestFleetRecreateGameServers/gameserver_unhealthy type=Warning
time="2021-10-30 05:03:47.061" level=info msg="Event!" lastTimestamp="2021-10-30 05:03:44 +0000 UTC" message="Deletion started" reason=Shutdown test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal
time="2021-10-30 05:03:47.061" level=info msg="Event!" lastTimestamp="2021-10-30 05:03:44 +0000 UTC" message="Deleting Pod simple-fleet-q4k2q-ncqvg-dqmzb" reason=Shutdown test=TestFleetRecreateGameServers/gameserver_unhealthy type=Normal

So time to go go hunting through the debug logs we added to the controller I think, and find out why these gameservers are randomly moving to Unhealthy. I'm wondering if something has changed in one of our hacks somewhere (i.e. a string match), and it's being construed as a crash that isn't what we think it is. The GameServer from TestGameServerAllocationMetaDataPatch , game-server26bwl looks like a good subject, since it's only creating a single GameServer in that test.

func TestGameServerAllocationMetaDataPatch(t *testing.T) {
t.Parallel()
ctx := context.Background()
gs := framework.DefaultGameServer(framework.Namespace)
gs.ObjectMeta.Labels = map[string]string{"test": t.Name()}
gs, err := framework.CreateGameServerAndWaitUntilReady(t, framework.Namespace, gs)
if !assert.Nil(t, err) {
assert.FailNow(t, "could not create GameServer")
}
defer framework.AgonesClient.AgonesV1().GameServers(framework.Namespace).Delete(ctx, gs.ObjectMeta.Name, metav1.DeleteOptions{}) // nolint: errcheck
gsa := &allocationv1.GameServerAllocation{ObjectMeta: metav1.ObjectMeta{GenerateName: "allocation-"},
Spec: allocationv1.GameServerAllocationSpec{
Selectors: []allocationv1.GameServerSelector{{LabelSelector: metav1.LabelSelector{MatchLabels: map[string]string{"test": t.Name()}}}},
MetaPatch: allocationv1.MetaPatch{
Labels: map[string]string{"red": "blue"},
Annotations: map[string]string{"dog": "good"},
},
}}
err = wait.PollImmediate(time.Second, 30*time.Second, func() (bool, error) {
gsa, err = framework.AgonesClient.AllocationV1().GameServerAllocations(framework.Namespace).Create(ctx, gsa.DeepCopy(), metav1.CreateOptions{})
if err != nil {
return true, err
}
return allocationv1.GameServerAllocationAllocated == gsa.Status.State, nil
})
if err != nil {
assert.FailNow(t, err.Error())
}
gs, err = framework.AgonesClient.AgonesV1().GameServers(framework.Namespace).Get(ctx, gsa.Status.GameServerName, metav1.GetOptions{})
if assert.Nil(t, err) {
assert.Equal(t, "blue", gs.ObjectMeta.Labels["red"])
assert.Equal(t, "good", gs.ObjectMeta.Annotations["dog"])
}
}

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: d93d0720-070a-44db-8edf-0f03f917f3e9

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member Author

Digging into the controller logs for game-server26bwl in test TestGameServerAllocationMetaDataPatch
https://cloudlogging.app.goo.gl/N2YKQVN1e7uckzgZ7
(TIL you can get a shortlink to a logging query and time space)

Just looking at Events, I can see it going from Created to Ready, but then straight to Unhealthy:

"Event(v1.ObjectReference{Kind:"GameServer", Namespace:"1635567595", Name:"game-server26bwl", UID:"cc0cf02d-08a7-4be1-a3cb-e376688818b0", APIVersion:"agones.dev/v1", ResourceVersion:"7324243", FieldPath:""}): type: 'Normal' reason: 'Creating' Pod game-server26bwl created"
"Event(v1.ObjectReference{Kind:"GameServer", Namespace:"1635567595", Name:"game-server26bwl", UID:"cc0cf02d-08a7-4be1-a3cb-e376688818b0", APIVersion:"agones.dev/v1", ResourceVersion:"7324250", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Address and port populated"
"Event(v1.ObjectReference{Kind:"GameServer", Namespace:"1635567595", Name:"game-server26bwl", UID:"cc0cf02d-08a7-4be1-a3cb-e376688818b0", APIVersion:"agones.dev/v1", ResourceVersion:"7325372", FieldPath:""}): type: 'Normal' reason: 'Ready' SDK.Ready() complete"
"Event(v1.ObjectReference{Kind:"GameServer", Namespace:"1635567595", Name:"game-server26bwl", UID:"cc0cf02d-08a7-4be1-a3cb-e376688818b0", APIVersion:"agones.dev/v1", ResourceVersion:"7325372", FieldPath:""}): type: 'Warning' reason: 'Unhealthy' Issue with Gameserver pod"

Looking just before the move to Unhealthy, I can see the HealthController saying the GameServer is not healthy:

"Issue with GameServer pod, marking as GameServerStateUnhealthy"

Looking before that, I can see the message "Container Failed" - and if we look at the log, we can see that the simple-gameserver seems to have crashed! The intent of the the health check is such that is a crash happens after moving to Ready, the GameServer should be moved to Unhealthy.

{
  "insertId": "9l7sghonrqe6wf83",
  "jsonPayload": {
    "containerStatuses": [
      {
        "ready": true,
        "containerID": "containerd://4d1c06167addfb1f287f7110a0b2d6594aebd4f3fbb49e99e17702031190980e",
        "restartCount": 0,
        "state": {
          "running": {
            "startedAt": "2021-10-30T04:24:59Z"
          }
        },
        "imageID": "gcr.io/agones-images/agones-sdk@sha256:949b060db929eb5ffcf66c961ce259073c00fe38dcfe04c8efe2e96f5ef662fd",
        "name": "agones-gameserver-sidecar",
        "image": "gcr.io/agones-images/agones-sdk:1.19.0-93e5a8f",
        "started": true,
        "lastState": {}
      },
      {
        "started": false,
        "imageID": "gcr.io/agones-images/simple-game-server@sha256:ac3e5a8388677453834f567fe1561eaec0bb3dab4df6243933064823f5c144b6",
        "state": {
          "terminated": {
            "reason": "Error",
            "containerID": "containerd://317fbe44e5a7446bb1b1e4417969ebf562ee9e8236375d0253927841ea7e5b4a",
            "exitCode": 1,
            "startedAt": "2021-10-30T04:24:32Z",
            "finishedAt": "2021-10-30T04:25:02Z"
          }
        },
        "name": "game-server",
        "lastState": {},
        "ready": false,
        "containerID": "containerd://317fbe44e5a7446bb1b1e4417969ebf562ee9e8236375d0253927841ea7e5b4a",
        "image": "gcr.io/agones-images/simple-game-server:0.4",
        "restartCount": 0
      }
    ],
    "message": "Container Failed",
    "source": "*gameservers.HealthController",
    "container": "game-server",
    "gs": "game-server26bwl"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-west1-c",
      "namespace_name": "agones-system",
      "cluster_name": "e2e-test-cluster",
      "project_id": "agones-images",
      "pod_name": "agones-controller-7cbc78d57b-7z48d",
      "container_name": "agones-controller"
    }
  },
  "timestamp": "2021-10-30T04:25:04.779309764Z",
  "severity": "DEBUG",
  "labels": {
    "k8s-pod/heritage": "Helm",
    "compute.googleapis.com/resource_name": "gke-e2e-test-cluster-agones-system-578206ef-4cpx",
    "k8s-pod/release": "agones",
    "k8s-pod/agones_dev/role": "controller",
    "k8s-pod/pod-template-hash": "7cbc78d57b",
    "k8s-pod/app": "agones"
  },
  "logName": "projects/agones-images/logs/stderr",
  "receiveTimestamp": "2021-10-30T04:25:06.624093989Z"
}

So the next area of investigation: Why did it crash? (or think it crashed?) Going to need to pull up it's logs.

@markmandel
Copy link
Member Author

Pulling up the logs for the simple-gameserver and I can see it crashes because it initially fails to connect to the SDK.

https://cloudlogging.app.goo.gl/Rcs7V3uHnq99DpMLA
image

Crashing here, since it can't connect to the SDK server:
https://github.com/googleforgames/agones/blob/main/examples/simple-game-server/main.go#L71-L75

There are two questions I have at this point:

  1. Why is this failing? Seems odd, but only sometimes? (Could be the cluster is overloaded????).
    1. Might want to confirm this happens with other GameServer Pods.
    2. There is a pretty solid >=20s or so between the GameServer starting up and the sdk server! (Maybe that's the new behaviour?)
  2. The health checking code is supposed to know that the crash happened before the GameServer was ready (need to check on how that is done, I think it has to do with the containerid annotation we set, maybe that behavior has changed 🤔 ), and if so, let is pass through. What is happening now is that as soon as the GameServer hits ready it sees it as a crash that happened after it became Ready, and therefore moves it to Unhealthy. (Maybe this is a bug that has always been there?????) 🤔 Although I am noting in the log for the pod that it claims it wasn't restarted, just that it's in a terminated state. That's.... weird.

Okay, that at least gives us something to go on.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 5f64ce57-4526-41df-a508-f4a829c97c61

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 453f9900-20f6-4a54-bbeb-7b86911d9271

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@roberthbailey
Copy link
Member

This article is interesting: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74

I just checked and it appears that we add the sdk sidecar after the game container in the pods we create for a game server. I wonder if switching the order would help mitigate the issue with the sidecar not starting for such a long time.

@roberthbailey
Copy link
Member

The other thing I'm looking into is whether there are any differences in GKE's default cluster settings that have changed between 1.20 and 1.21 that might affect us. So far it looks like the only thing I've seen is the default network settings are different:

WARNING: Currently VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the --no-enable-ip-alias flag

But nothing about container runtimes or node images.

@markmandel
Copy link
Member Author

I just checked and it appears that we add the sdk sidecar after the game container in the pods we create for a game server. I wonder if switching the order would help mitigate the issue with the sidecar not starting for such a long time.

Sounds like a good idea in general. I figure the delay is either (1) first image download or (2) overloaded system (3) combination therein

Agones should handle this anyway, without moving the GameServer to Unhealthy, so it seems like a legitimate bug.

@markmandel
Copy link
Member Author

markmandel commented Nov 1, 2021

--- FAIL: TestGameServerRestartBeforeReadyCrash (211.69s)
    gameserver_test.go:353: 
        	Error Trace:	gameserver_test.go:353
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestGameServerRestartBeforeReadyCrash

Is back again. I think this might actually be the best test to dig into the flakiness.

Dumping logs from the test:

time="2021-11-01 15:33:38.287" level=info msg="Waiting for us to have an address to send things to" test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:40.898" level=info msg="Waiting for states to match" awaitingState=Scheduled currentState=Scheduled gs=game-serversz62f test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:40.898" level=info msg="GameServer created" gs=game-serversz62f test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:40.898" level=info msg="Dialing UDP message to address" address="35.197.104.56:7073" test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:40.898" level=info msg="crashing, and waiting to see restart" test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:43.941" level=info msg="sending message" fields.msg=CRASH gs=game-serversz62f state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:49.720" level=info msg="sending message" fields.msg=CRASH gs=game-serversz62f state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:52.522" level=info msg="successfully crashed. Moving on!" test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:55.276" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serversz62f test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:33:58.876" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serversz62f test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:34:01.724" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serversz62f test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:34:04.476" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serversz62f test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:34:04.477" level=info msg="marking GameServer as ready" test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:34:07.142" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:34:12.520" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:34:16.121" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
... repeat ...
time="2021-11-01 15:36:13.238" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:36:16.230" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:37:04.232" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:37:07.235" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:37:07.322" level=info msg="sending message" fields.msg=READY gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:37:07.323" level=info msg="crashing again, should be unhealthy" test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:37:07.410" level=info msg="checking final crash state" gs=game-serversz62f state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2021-11-01 15:37:07.410" level=info msg="Unhealthy! We are done!" test=TestGameServerRestartBeforeReadyCrash
--- FAIL: TestGameServerRestartBeforeReadyCrash (211.69s)
        	Test:       	TestGameServerRestartBeforeReadyCrash

This looks like a good example of the test issue, because the GameServer should be able to go to Ready even if it has restarted beforehand.

Dumping logs for the GameServer from the controller:
https://cloudlogging.app.goo.gl/m569cPdnvxL5Q5xK6

And yep, can see that just before the GameServer gets marked as Unhealthy, the system sees the game server container to be terminated (even though it must be running, since it marked the GameServer as Ready.

(stripped some non-relevant fields)

{
    "insertId": "k9hmvq0mowv1v8ed",
    "jsonPayload": {
        "source": "*gameservers.HealthController",
        "gs": "game-serversz62f",
        "message": "skipUnhealthy: Container is terminated, returning false",
        "podStatus": {
            "startTime": "2021-11-01T15:33:38Z",
            "podIP": "10.40.10.245",
            "phase": "Running",
            "podIPs": [
                {
                    "ip": "10.40.10.245"
                }
            ],
            "hostIP": "10.138.15.235",
            "containerStatuses": [
                {
                    "started": true,
                    "image": "gcr.io/agones-images/agones-sdk:1.19.0-ea95f09",
                    "state": {
                        "running": {
                            "startedAt": "2021-11-01T15:33:41Z"
                        }
                    },
                    "name": "agones-gameserver-sidecar",
                    "ready": true,
                    "restartCount": 0,
                    "imageID": "gcr.io/agones-images/agones-sdk@sha256:df654c5772a0d96a9aeac8b909a2e15006bb00880d235fc3158f5e91a5d85f6c",
                    "containerID": "containerd://67a8c632347295ea47bd927561e6fcf79cc737c52e784bf7fdc238aeff69b54b",
                    "lastState": {}
                },
                {
                    "started": false,
                    "restartCount": 1,
                    "state": {
                        "terminated": {
                            "containerID": "containerd://a472c45938511a201af32f456ad5f5e0edf4986e52882af8f51ce966fa3ec1f9",
                            "reason": "Error",
                            "finishedAt": "2021-11-01T15:33:49Z",
                            "startedAt": "2021-11-01T15:33:45Z",
                            "exitCode": 1
                        }
                    },
                    "image": "gcr.io/agones-images/simple-game-server:0.4",
                    "name": "game-server",
                    "imageID": "gcr.io/agones-images/simple-game-server@sha256:ac3e5a8388677453834f567fe1561eaec0bb3dab4df6243933064823f5c144b6",
                    "lastState": {
                        "terminated": {
                            "reason": "Error",
                            "exitCode": 1,
                            "finishedAt": "2021-11-01T15:33:44Z",
                            "containerID": "containerd://610cf7106856d10afcd838a85c53935a8e4cd968cdfc81a2eaa4080422e8c7e7",
                            "startedAt": "2021-11-01T15:33:41Z"
                        }
                    },
                    "ready": false,
                    "containerID": "containerd://a472c45938511a201af32f456ad5f5e0edf4986e52882af8f51ce966fa3ec1f9"
                }
            ],
            "qosClass": "Burstable",
        }
    },
    "timestamp": "2021-11-01T15:34:09.175741350Z",
    "severity": "DEBUG",
    "logName": "projects/agones-images/logs/stderr",
    "receiveTimestamp": "2021-11-01T15:34:11.834098642Z"
}

The GameServer also thinks that what it sees as the currently terminated container is the one that has marked it as Ready

annotations: {
agones.dev/ready-container-id: "containerd://a472c45938511a201af32f456ad5f5e0edf4986e52882af8f51ce966fa3ec1f9"
agones.dev/sdk-version: "1.19.0-ea95f09"
}

This is a little odd, as we do have a check in the code to check that the state is Running. But race conditions do happen.

Let's see what the logs for the Pod look like (stripping Health Pings):
https://cloudlogging.app.goo.gl/nuhJYdNKdJS4teyu5

image

(READY repeats until shutdown, since the GameServer gets stuck in Unhealthy).

An interesting thing, since we retry CRASHing, we actually crash the container twice (which we see in the container status values) -- this probably reads much like the SDK Server starting up slow and forcing a crash.

So feeling pretty confident that I've got a handle on the issue. 🤞🏻

@markmandel
Copy link
Member Author

Reminder: Running tests lots: go test -run TestGameServerRestartBeforeReadyCrash -count 10 -timeout 8h

(I always forget the timeout)

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 72c24792-c67f-4bee-83a8-fe9623a97f74

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-178c0ce

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: fc0977b6-1b5a-48ea-b268-50f9306677ec

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member Author

So this failure seems to be something different?

time="2021-11-02 01:33:28.344" level=info msg="waiting for fleet condition" fleet=simple-fleet-s2xkv test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:31.328" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:35.506" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:6 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:40.106" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:6 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:44.505" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:6 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:47.305" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:8 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:49.906" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:8 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:53.906" level=info msg="Checking Fleet Ready replicas" expected=10 fleet=simple-fleet-s2xkv fleetStatus="{Replicas:10 ReadyReplicas:10 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:56.784" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:35296->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:57.788" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:58187->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:58.788" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:53510->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:33:59.796" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:40634->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:34:00.787" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:55548->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:34:01.787" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:53179->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:34:01.788" level=info msg="Could not read from address" address="34.82.230.193:7708" error="read udp 192.168.10.4:38081->34.82.230.193:7708: read: connection refused" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:34:01.789" level=info msg="Failed to send UDP packet to GameServer. Dumping Events!" gs=simple-fleet-s2xkv-6dgph-4jbxr status="{State:Ready Ports:[{Name:udp-port Port:7708}] Address:34.82.230.193 NodeName:gke-e2e-test-cluster-default-684ee24f-zahj ReservedUntil:<nil> Players:<nil>}" test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:34:01.789" level=info msg="Dumping Events:" kind=GameServer test=TestFleetRecreateGameServers/gameserver_shutdown
time="2021-11-02 01:34:01.873" level=info msg="Event!" lastTimestamp="2021-11-02 01:33:28 +0000 UTC" message="Port allocated" reason=PortAllocation test=TestFleetRecreateGameServers/gameserver_shutdown type=Normal
time="2021-11-02 01:34:01.874" level=info msg="Event!" lastTimestamp="2021-11-02 01:33:28 +0000 UTC" message="Pod simple-fleet-s2xkv-6dgph-4jbxr created" reason=Creating test=TestFleetRecreateGameServers/gameserver_shutdown type=Normal
time="2021-11-02 01:34:01.874" level=info msg="Event!" lastTimestamp="2021-11-02 01:33:28 +0000 UTC" message="Address and port populated" reason=Scheduled test=TestFleetRecreateGameServers/gameserver_shutdown type=Normal
time="2021-11-02 01:34:01.874" level=info msg="Event!" lastTimestamp="2021-11-02 01:33:33 +0000 UTC" message="SDK state change" reason=RequestReady test=TestFleetRecreateGameServers/gameserver_shutdown type=Normal
time="2021-11-02 01:34:01.874" level=info msg="Event!" lastTimestamp="2021-11-02 01:33:33 +0000 UTC" message="SDK.Ready() complete" reason=Ready test=TestFleetRecreateGameServers/gameserver_shutdown type=Normal
    --- FAIL: TestFleetRecreateGameServers/gameserver_shutdown (39.63s)

Looks like it's Ready, so I'm not sure why the connection is being refused.

Need to have a look into the logs for the game server, see if anything weird is going on there.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 64d0ec03-7407-4544-97fb-f5a9bea15dd4

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 0a91a8e1-c582-4980-a25f-078827ee5f49

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-b787e6f

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: dbafa2b0-c787-4b4c-b3bc-d267cfbbce3d

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-6eddb2e

@markmandel markmandel marked this pull request as ready for review November 2, 2021 18:50
@markmandel
Copy link
Member Author

Added some more unit tests, and made some minor tweaks to a few things.

I'm punting the connection issue to another PR. Assuming this otherwise passes on this run, I'll cleanup the PR description and the git history -- but it should be good to review now.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 5e7470a7-bf27-4b24-882e-87b09e9a7238

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-daffc50

The Pod and GameServer data in the Informer cache are
eventually consistent and can result in data races.

This fix ensures that they are in sync at the time that
the GameServer is moved to Ready, so that Health checking
can behave correctly if a crash occurs after the fact.

This also includes a combination of smaller fixes/debug for e2e flakiness,
as we narrowed down the problem.

* Retries on UDP packet sending
* Failing on UDP packet sending dumps gameserver details
* Drop parallelism on e2e tests from 32 to 16.
@markmandel markmandel added the kind/bug These are bugs. label Nov 2, 2021
@markmandel markmandel changed the title Combination of fixes/debug for e2e flakiness Fix for health check race condition Nov 2, 2021
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 706d3421-b8e6-493f-9e9c-4ec9ad15eaa9

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2351/head:pr_2351 && git checkout pr_2351
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-1f532c1

@roberthbailey roberthbailey merged commit 6941142 into googleforgames:main Nov 2, 2021
@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, roberthbailey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [markmandel,roberthbailey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants