-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readiness/Liveness setup challenge #1357
Comments
I think the issue is the readiness probe of the proxy that redirects to the hub readiness and checks, and the hub pod's liveness/readiness check are problematic. |
Hmmmm yes okay so I got it to work again later now... Why? I think my hub pod failed to enter a responsive state because it got stuck waiting for response from singleuser servers, thereby restarting due to failed liveness checks, and got stuck again etc. So why does it get stuck waiting for the singleuser servers like this etc? I figure a good option for now could perhaps be to enable these liveness probes optionally or at least have a flag to disable them. |
Do you know if your hub was completely unresponsive, so even if the health checks were disabled it'd still be unusable, or if the health check was giving an incorrect response? |
I'm not sure yet @manics, it is a bit hard to reproduce and will probably cause some user disruptions trying to do it. I'll live test these probes further. @manics I've now digged into the code base, and can conclude that tornado handlers are not initialized until various other methods have completed, and they can take 30+ seconds to complete. Now I'm quite confident my issue arose due to this. So, the key point would be do resolve that. |
Suggested fixesFix 1 - /health should go to /hub/healthIf we don't, we end up with lots of redirects. I think this is fine no matter what, but it is no point in bouncing around. I've also learned that if the redirect is to the same pod, it will be followed, but if the redirect is to another pod, it won't be followed after k8s 1.14.1 I think. This could cause issues in the future, I'm not 100%, anyhow, lets avoid it by directly requesting /hub/health instead of /health. Hmmm but at the same time... is the Fix 2 - adjust liveness check paramsDetails about liveness / readiness probes in the docs can be found here. I think the failure I ended up with relates to quite hard constraints on these. That the failure was caused by not having a ready hub pod within 30 seconds, (3 failure periods * 10 second / period + 0 initial seconds). I think increasing the value to 5 for the liveness probe's failure threshold may be enough to resolve my issue. The default values for the liveness/readiness probes are... epectedProbe := v1.Probe{
InitialDelaySeconds: 0,
TimeoutSeconds: 1,
PeriodSeconds: 10,
SuccessThreshold: 1,
FailureThreshold: 3,
} Fix 3 - adjust timeouts blocking hub's tornado endpoint startupBefore tornado starts serving stuff, other functions like But and the and the The default value of https://jupyterhub-kubespawner.readthedocs.io/en/latest/spawner.html I think this is why I ran into issues. My hub pod got stuck waiting to get a response from a server that it could not reach, and after a while the liveness probe restarted the hub pod. So, changing the http_timeout will have a great impact on the hub pod startup, that impacts if the liveness probe settings for the hub can make sense. Questions raisedQ1 - can users with a started singleuser servers continue to work with a hub pod down?If we couple the proxy to the hub, that makes the proxy-public service return "service unavailable" if the hub pod isn't responding. Do we really want this? Is it possible for the proxy to let users work on with a hub pod being down? I can imagine that the hub pod needs to verify authentication and authorization etc to access, but I can also imagine that the proxy pod can know it is OK at least for a while assuming some cookie is set etc... Hmmm... I let Q3 be created to represent this question. Q2 - do we avoid circular issues?If the hub relies on the proxy to setup, and the proxy isn't reachable because of the readiness probe referencing the hub, then we have a serious stability issue. I investigated this to some degree, and I think we are good. I figure that when the hub starts, it runs "initialize()" and then it runs "start()" where it actually binds to the network interface and traffic can start to arrive and it can become responsive. During initialize(), it will run checks on all preexisting routes it knows of within its stored state. After that, within start() but after it has started listening on requests, it will update the proxy to route properly within the check_routes() function and setup a periodic call to this function. From what I understand, the verification calls in
Answer: I think we avoid circular issues where the hub pod relies on the proxy pod and the other way around etc, but I'm not confident on that. We should be if we configure the readiness probe of the proxy to the /hub/health endpoint of the hub. Q3 - what is the authorization flow for a user asking the proxy for access to its server?Will the proxy always ask the hub if the user has access to the user pod or will it sometimes use cached credentials? Q4 - is the proxy decoupling itself from traffic?It is my understanding that the proxy pod will redirect the user directly to the pod, but I'm a bit confused here... Hmmm okay I tried accessing myhub.example.com/user/some-other-user. That worked, and I ended up at jupyterhub being asked for permission to provide that user-server with information about my already logged in user. This means, that the proxy will proxy no matter what, and if the user server isn't provided with a suitable cookie with access, then that's the issue. So, I therefor think the proxy is channeling all the traffic through itself, and therefor i see no reason for the user server to be inaccessible if the hub pod is down for a while. So, from this I conclude that I think the readiness of the proxy should not be tied to the hub pod being up. I think the readiness of the proxy pod could be allowed to be Hmmm... So i think pods are accessible without the readiness probe being OK, but services won't send traffic there until it is. Hmmm... I'm thinking about switchover scenarios when we a rolling upgrade of proxy pods for example, and one starts up alongside another... Hmmm... Inspecting the proxy pod, it starts up with a default route to the hub, but it needs also needs to be actively configured by the hub after its startup. But, the hub is really only configuring the proxy k8s service which will be one of the already ready proxy pods! Conclusions so far:
If we could instead configure the routes externally to the pods, and let the pods use that information, then we would avoid such challenges. We could also avoid them by having the proxy pod poll the hub for configuration about routes. Hmm... Anyhow, at the moment, we need new proxy pods to be able to get configured at all in the first place, and that at least require them to be ready. To avoid circular issues, they would need to self-configure or at least without the need for a web-based request to arrive to them until they are ready. I've learned that if we don't defined a readiness probe, they will be ready right away when all the containers are started in the pod. |
https://jupyterhub.readthedocs.io/en/stable/reference/separate-proxy.html implies the proxy will allow access to a singleuser pod if the hub subsequently goes down. I guess the problem is the proxy service is actually two services, one for the hub and one for the servers. I think |
The other way to look at it is the proxy is "just a proxy": each individual singleuser server and hub pod are all separate services, so the proxy health check should reflect only whether the proxy is available regardless of the backend services. This means it should present a nice error page to the user if the hub isn't available. |
Current suggestion
Long term suggestionsWe ensure to configure the proxy pods more reliably. They should be configured even though they are more than one, and they should be able to configure without being ready and not yet accepting traffic from a service. They could for example be ready whenever they have read from a configmap about what routes are available or similar. |
I'm not confident on what goes on here, but the default helm value we provide has a With this in mind, I guess we should configure the probes to not specifically visit the path /hub/health but instead |
@consideRatio yes, that sounds right! |
This works fine after bumping to latest JupyterHub in #1422 |
I recently merged #1004 that had successful CI tests but then reverted the merge in #1356 after concluding my upgrade failed.
I'm not confident on what goes on. My hub and proxy pod wasn't entering a Ready state after this PR was merged by me. So I'll investigate things further but for now I figure I'll revert the PR so it does not cause disruptions for others like me.
I wonder if this is related to:
scheme: HTTPS
.UPDATE: No I'm quite confident this isn't it after trying to access the endpoints from another pod. Using http should be fine.
Host:
header, as this controller routes based on web-request's host header.UPDATE: No I think k8s should approve any response between 200 and <400, also it sais timeout.
The text was updated successfully, but these errors were encountered: