[MRG] Readiness and Liveness probes re-added #1361

consideRatio · 2019-08-16T13:16:55Z

Manages parts of the issue #1357 and implements the PR #1004 finally that I merged and then reverted in #1356 because it was a bit problematic to have a livenessProbe if we only accepted waiting 30 seconds.

Having a livenessProbe is certainly not out of the question, but perhaps we need it to have a much longer startup time. The key point of it would be to recover from a memory leak or similar, and those would occur far later in time anyhow.

The hub pod may lack time to startup properly. This mainly because the hub pod will inspect its previously known pods and ensure they exist before starting to listen to endpoints like /hub/health. This inspection can take up to 30 seconds currently assuming the default value of c.KubeSpawner.http_timeout. The liveness probes could without this configuration conclude that the hub pod needs restarting after 3 failure attempts made with 10 second intervals, so perhaps after 20 seconds. This commit ensures we at least survive for the 30 initial seconds.

jupyterhub/templates/hub/deployment.yaml

manics · 2019-08-16T14:19:03Z

Maybe something for a future PR: The CI test script has several workarounds to ensure the hub is ready, and to work around race conditions:

zero-to-jupyterhub-k8s/ci/test.sh

Line 24 in 154913a

echo "waiting for servers to become responsive"
zero-to-jupyterhub-k8s/ci/test.sh

Line 50 in 154913a

# Run this first to ensure the hub can talk to the proxy
zero-to-jupyterhub-k8s/ci/test.sh

Line 54 in 154913a

# Now sleep, and retry again, in case a race condition meant the two were

Presumably all these sections can be removed and replaced with just one loop that checks the deployment status is ready?

betatim · 2019-08-16T14:23:47Z

I left one big comment on the liveness probe for the hub. The other probes look good to me.

consideRatio · 2019-08-19T08:20:55Z

@manics ah yes nice we could await for the hub and proxy to be ready according to k8s readinessProbes.

Like this!

$ kubectl wait pod --for=condition=Ready --selector "component in (hub, proxy)"
pod/hub-68854b5749-fn89h condition met
pod/proxy-6d7dbdbf47-sljz2 condition met

consideRatio · 2019-08-19T08:37:15Z

The only downside of this change, as far as I know, is that we may delay successful responses by max 10 seconds to the hub, as the hub won't get traffic until it is considered ready, and that is polled every 10 seconds.

The key benefit i see with this PR is that it prepares us for higher availability matters, such as not having a service disruptions of the hub pod during helm chart upgrades because a new one could startup alongside and only received traffic when it was ready and not before it isn't for example.

Speaking of which... My hub pod shuts down, and then starts up. I wonder if I could instead make it a rolling update since I have an external DB.

betatim · 2019-08-20T10:25:14Z

I'll merge this now.

Speaking of which... My hub pod shuts down, and then starts up. I wonder if I could instead make it a rolling update since I have an external DB.

I am not sure. Reading all your other comments/issues and your quest for HA hubs: I think that jupyterhub (the program) assumes it is the only one reading and writing to the database. Working on that assumption is (I think) what needs to happen before we can change the upgrade strategy and other HA related things. If the upgrade strategy wasn't Recreate you could have a situation where the new hub starts, reads from the DB, does some other stuff and then become ready. In the meantime (somewhere between the new hub reading from the DB and becoming ready) the old hub writes something to the DB. The new hub won't expect/be setup to handle this and things will become inconsistent (the state in the brain of the hub is different from the state in the DB).

manics · 2019-08-21T09:08:34Z

Just had another thought, the probes are applied to the container so could you move the database upgrade to an initcontainer?

An added benefit, especially if full HA support is added, is it'll give more control over the upgrade process.

consideRatio · 2019-08-21T11:03:42Z

@manics ooooh yes an excellent idea! Could you repost that idea as a new issue?

tmshn and others added 3 commits August 16, 2019 14:52

Configure liveness/readinessProbe

62428c7

Adjust liveness/readiness probes for hub and proxy

9bdc7a4

consideRatio changed the title ~~Readiness liveness probes~~ Readiness and Liveness probes re-added Aug 16, 2019

betatim reviewed Aug 16, 2019

View reviewed changes

jupyterhub/templates/hub/deployment.yaml Outdated Show resolved Hide resolved

Disable hub livenessProbe due to db upgrade complexity

dc75c21

consideRatio changed the title ~~Readiness and Liveness probes re-added~~ [MR] Readiness and Liveness probes re-added Aug 19, 2019

consideRatio changed the title ~~[MR] Readiness and Liveness probes re-added~~ [MRG] Readiness and Liveness probes re-added Aug 19, 2019

betatim merged commit 2d435d6 into jupyterhub:master Aug 20, 2019

manics mentioned this pull request Aug 21, 2019

Move JupyterHub database upgrade to an initContainer #1372

Closed

consideRatio mentioned this pull request Sep 11, 2019

PR Discussion - Quick pod relocation, shutdown, and readiness #1406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Readiness and Liveness probes re-added #1361

[MRG] Readiness and Liveness probes re-added #1361

consideRatio commented Aug 16, 2019 •

edited

Loading

manics commented Aug 16, 2019 •

edited

Loading

betatim commented Aug 16, 2019

consideRatio commented Aug 19, 2019

consideRatio commented Aug 19, 2019 •

edited

Loading

betatim commented Aug 20, 2019

manics commented Aug 21, 2019 •

edited

Loading

consideRatio commented Aug 21, 2019

[MRG] Readiness and Liveness probes re-added #1361

[MRG] Readiness and Liveness probes re-added #1361

Conversation

consideRatio commented Aug 16, 2019 • edited Loading

manics commented Aug 16, 2019 • edited Loading

betatim commented Aug 16, 2019

consideRatio commented Aug 19, 2019

consideRatio commented Aug 19, 2019 • edited Loading

betatim commented Aug 20, 2019

manics commented Aug 21, 2019 • edited Loading

consideRatio commented Aug 21, 2019

consideRatio commented Aug 16, 2019 •

edited

Loading

manics commented Aug 16, 2019 •

edited

Loading

consideRatio commented Aug 19, 2019 •

edited

Loading

manics commented Aug 21, 2019 •

edited

Loading