Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with hung GameServer resource on Kubernetes 1.10 #278

Merged

Conversation

markmandel
Copy link
Member

It looks like the foregroundDeletion finalizer that gets added to the GameServer when a deletion with foreground propogation is requested - when it gets removed, by the Kubernetes system, the ResourceVersion/Generation
doesn't get incremented.

This means when we do an Update it will go through (usually it fails if the ResourceVersion/Generation has changed since the last update), rather than failing and going back to re-sync.

This can cause a race condition where the Update can basically put back the foregroundDeletion finalizer, if it gets removed between retrieving the GameServer object and doing the Update() operation.

We actually don't need the foregroundDeletion propogation on Delete(), as we only want to keep the GameServer around until the backing Pod is gone, so we can actually just turn this off instead in the code.

As a note - also explored using a Patch() operation to try and solve this problem. AFAIK, a patch (and specifically a JSONPatch - other patch types didn't seem better), can only do index based removal from a list of items.

This could just present other types of race conditions, as finalizers could be added/removed by other systems at the same time as well, changing the index to the finalizer we want at the same time.

Not sure if this is a bug in the Kubernetes system, or working as intended, but this solves the issue.

It looks like the `foregroundDeletion` finalizer that gets added to the
GameServer when a deletion with foreground propogation is requested -
when it gets removed, by the Kubernetes system, the ResourceVersion/Generation
doesn't get incremented.

This means when we do an `Update` it will go through (usually it fails if the
ResourceVersion/Generation has changed since the last update), rather than
failing and going back to re-sync.

This can cause a race condition where the `Update` can basically put back
the `foregroundDeletion` finalizer, if it gets removed between retrieving the
`GameServer` object and doing the `Update()` operation.

We actually don't need the `foregroundDeletion` propogation on `Delete()`, as
we only want to keep the `GameServer` around until the backing `Pod` is gone,
so we can actually just turn this off instead in the code.

As a note - also explored using a `Patch()` operation to try and solve this
problem. AFAIK, a patch (and specifically a JSONPatch - other patch types
didn't seem better), can only do index based removal from a list of items.

This could just present other types of race conditions, as finalizers could
be added/removed by other systems at the same time as well, changing the index
to the finalizer we want at the same time.

Not sure if this is a bug in the Kubernetes system, or working as intended,
but this solves the issue.
@markmandel markmandel added the kind/bug These are bugs. label Jun 30, 2018
@markmandel markmandel added this to the 0.3.0 milestone Jun 30, 2018
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: bef30346-09c3-4006-bf6c-d848c07ce943

The following development artifacts have been built, and will exist for the next 30 days:

markmandel added a commit to markmandel/agones that referenced this pull request Jul 4, 2018
Update codegen library to 1.10.5, and regen the CRD clients.

This may be the cause of googleforgames#278 (but that change needed to happen anyway)
markmandel added a commit that referenced this pull request Jul 4, 2018
Update codegen library to 1.10.5, and regen the CRD clients.

This may be the cause of #278 (but that change needed to happen anyway)
@cyriltovena
Copy link
Collaborator

cyriltovena commented Jul 5, 2018

I'm reading this https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/, then i'll look at your code

@cyriltovena
Copy link
Collaborator

cyriltovena commented Jul 5, 2018

We actually don't need the foregroundDeletion propogation on Delete(), as we only want to keep the GameServer around until the backing Pod is gone, so we can actually just turn this off instead in the code.

So if I understand well we don't rely on k8s deleting dependent resources, because we're deleting the pods ourselves.

Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markmandel
Copy link
Member Author

Thanks for the approve.
To be clear - this is actually because of Finalizers - which is what blocks the GameServer resource from being deleted before the Pod gets deleted, even on background object collection (which is the default).

I'm also running into other people who look to have potentially run into the same bug with Foreground deletion and finalisers, so I'm actually genuinely wondering if it's a real bug in K8s -- either way, we don't need the foreground deletion (since we handle the actual lifecycle of the GameServer ourselves with the finaliser) so we can remove it for now and skirt the issue.

Hope that makes sense.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 8876f7dd-3055-4f9f-98f4-8ec42d1afb53

The following development artifacts have been built, and will exist for the next 30 days:

@markmandel markmandel merged commit d3e9d7e into googleforgames:master Jul 9, 2018
@markmandel markmandel deleted the bug/shutdown-gameserver branch July 9, 2018 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants