-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user pods severely impacted after hub pod restart #3229
Comments
this also could be tangentially related to jupyterhub/jupyterhub#4544 for orphaned resource cleanup... |
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
@shaneknapp this could be jupyterhub/kubespawner#786 which is fixed by jupyterhub/kubespawner#742. Try kubespawner master in your hub image and see if that fixes this? |
@yuvipanda @consideRatio i did some testing and it LOOKS like bumping kubespawner to |
|
another win! bumping our kubespawner install dist to |
more testing, and this time: no user pods were interrupted in any way after hub restart. i'll do some more testing and see if i can reproduce what i saw yesterday. unsure if it makes a difference, but i was testing yesterday on chrome, macos, m2 proc. today on my silly win11 desktop workstation. both times using chrome. |
@consideRatio @yuvipanda i see that this was merged in to the 2i2c repo... did it fix things, or have you not entered 'here be dragons' territory yet? :) |
@shaneknapp for the token cookie thing, does the version of jupyterhub in the user image match what's in the hub pod? And is |
lemme try to reproduce on my lappy486. |
ok, i did a bunch more testing on both my mac and windows workstation (log in as others via admin, kill hub pod, frantically click 'save' on 5 diff notebooks) and i didn't get any filesystem errors. simultaneously, we merged a PR to prod (on hubs w/many current users) about 45m ago and i haven't seen anything in the logs, nor are there any orphaned pods... i think we're good to go? |
alright, a little more testing and a couple of trepidatious deployments later and this definitely fixes things. thanks @yuvipanda @consideRatio for help and getting kubespawner 6.1.0 out the door! we'll revert our kubespawner changes once 6.1.0 is live. |
Bug description
this is a weird one, but after spending yesterday confirming w/the team here at uc berkeley we are pretty certain this Is Really Bad[tm].
when restarting a hub pod (we're on GCP), any logged in users have a high chance of their single user server pod losing it's 'connection' to the hub and underlying filesystem, as well as beginning to behave erratically (more detail below). once the idle culler kills the process, that user pod becomes orphaned either in the 'Completed' or 'OOMKilled' state. these orphaned pods pile up (sometimes hitting close to 100), and unless we manually delete them won't ever be cleaned up. if the user logs back in, the orphaned pod is reclaimed and finally goes away.
the observed erratic behavior on a single user server when a hub pod restarts:
this is Really Bad[tm] because a hub pod restarting shouldn't have an impact on any users' pods, and the underlying home dir filesystem going away is particularly troubling.
outside of the end-user impact, this also has a non-zero impact on our team: we can no longer safely deploy any changes to prod, as students and instructors losing their work is probably the worst thing that can happen. instead, we're deploying early in the morning to minimize impact... but the impact is still there.
the logs show that the PVC w/homedirs weren't able to be remounted, active users were marked idle and killed during restart, and a variety of http 424, 405, 404, and 304 error codes. plenty of weirdness in there...
How to reproduce
kubectl -n <hubname>-staging delete pod hub-5b6d7849dd-kx77z
.kubectl get pods --all-namespaces | tr -s ' ' | cut -d ' ' -f 1,2,4 | grep -e Completed -e OOMKilled
to clean up any pods left over in step 5, we run
kubectl get pods --all-namespaces | tr -s ' ' | cut -d ' ' -f 1,2,4 | grep -e Completed -e OOMKilled | xargs -n3 bash -c 'kubectl --namespace=$0 delete pod $1'
Expected behaviour
users are unaware, and not impacted by hub pod restarts.
Actual behaviour
the observed erratic behavior on a single user server when a hub pod restarts:
this happens for both classic notebooks and jupyterlab.
Your personal set up
UC Berkeley datahub: https://data.berkeley.edu/datahub
https://github.com/berkeley-dsep-infra/datahub/
running on GCP
GKE 1.25.11-gke.1700
z2h installation
jupyterhub 4.0.2
jupyterlab 4.0.4
ubuntu 22.04LTS
Logs
here are some logs w/redacted usernames. i can provide more if necessary... [hub-restart-stderr.json.gz](https://github.com/jupyterhub/jupyterhub/files/12703798/hub-restart-stderr.json.gz) [user-errors-post-restart.json.gz](https://github.com/jupyterhub/jupyterhub/files/12703797/user-errors-post-restart.json.gz)@yuvipanda this is pretty legit.... :\
The text was updated successfully, but these errors were encountered: