Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Readiness and Liveness probes re-added #1361

Merged
merged 4 commits into from
Aug 20, 2019

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Aug 16, 2019

Manages parts of the issue #1357 and implements the PR #1004 finally that I merged and then reverted in #1356 because it was a bit problematic to have a livenessProbe if we only accepted waiting 30 seconds.

Having a livenessProbe is certainly not out of the question, but perhaps we need it to have a much longer startup time. The key point of it would be to recover from a memory leak or similar, and those would occur far later in time anyhow.

tmshn and others added 3 commits August 16, 2019 14:52
The hub pod may lack time to startup properly. This mainly because the
hub pod will inspect its previously known pods and ensure they exist
before starting to listen to endpoints like /hub/health. This inspection
can take up to 30 seconds currently assuming the default value of
c.KubeSpawner.http_timeout. The liveness probes could without this
configuration conclude that the hub pod needs restarting after 3 failure
attempts made with 10 second intervals, so perhaps after 20 seconds.

This commit ensures we at least survive for the 30 initial seconds.
@consideRatio consideRatio changed the title Readiness liveness probes Readiness and Liveness probes re-added Aug 16, 2019
@manics
Copy link
Member

manics commented Aug 16, 2019

Maybe something for a future PR: The CI test script has several workarounds to ensure the hub is ready, and to work around race conditions:

Presumably all these sections can be removed and replaced with just one loop that checks the deployment status is ready?

@betatim
Copy link
Member

betatim commented Aug 16, 2019

I left one big comment on the liveness probe for the hub. The other probes look good to me.

@consideRatio
Copy link
Member Author

@manics ah yes nice we could await for the hub and proxy to be ready according to k8s readinessProbes.

Like this!

$ kubectl wait pod --for=condition=Ready --selector "component in (hub, proxy)"
pod/hub-68854b5749-fn89h condition met
pod/proxy-6d7dbdbf47-sljz2 condition met

@consideRatio consideRatio changed the title Readiness and Liveness probes re-added [MR] Readiness and Liveness probes re-added Aug 19, 2019
@consideRatio consideRatio changed the title [MR] Readiness and Liveness probes re-added [MRG] Readiness and Liveness probes re-added Aug 19, 2019
@consideRatio
Copy link
Member Author

consideRatio commented Aug 19, 2019

The only downside of this change, as far as I know, is that we may delay successful responses by max 10 seconds to the hub, as the hub won't get traffic until it is considered ready, and that is polled every 10 seconds.

The key benefit i see with this PR is that it prepares us for higher availability matters, such as not having a service disruptions of the hub pod during helm chart upgrades because a new one could startup alongside and only received traffic when it was ready and not before it isn't for example.

Speaking of which... My hub pod shuts down, and then starts up. I wonder if I could instead make it a rolling update since I have an external DB.

@betatim
Copy link
Member

betatim commented Aug 20, 2019

I'll merge this now.


Speaking of which... My hub pod shuts down, and then starts up. I wonder if I could instead make it a rolling update since I have an external DB.

I am not sure. Reading all your other comments/issues and your quest for HA hubs: I think that jupyterhub (the program) assumes it is the only one reading and writing to the database. Working on that assumption is (I think) what needs to happen before we can change the upgrade strategy and other HA related things. If the upgrade strategy wasn't Recreate you could have a situation where the new hub starts, reads from the DB, does some other stuff and then become ready. In the meantime (somewhere between the new hub reading from the DB and becoming ready) the old hub writes something to the DB. The new hub won't expect/be setup to handle this and things will become inconsistent (the state in the brain of the hub is different from the state in the DB).

@betatim betatim merged commit 2d435d6 into jupyterhub:master Aug 20, 2019
@manics
Copy link
Member

manics commented Aug 21, 2019

Just had another thought, the probes are applied to the container so could you move the database upgrade to an initcontainer?

An added benefit, especially if full HA support is added, is it'll give more control over the upgrade process.

@consideRatio
Copy link
Member Author

@manics ooooh yes an excellent idea! Could you repost that idea as a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants