Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user pods severely impacted after hub pod restart #3229

Closed
shaneknapp opened this issue Sep 22, 2023 · 13 comments
Closed

user pods severely impacted after hub pod restart #3229

shaneknapp opened this issue Sep 22, 2023 · 13 comments
Labels

Comments

@shaneknapp
Copy link

shaneknapp commented Sep 22, 2023

Bug description

this is a weird one, but after spending yesterday confirming w/the team here at uc berkeley we are pretty certain this Is Really Bad[tm].

when restarting a hub pod (we're on GCP), any logged in users have a high chance of their single user server pod losing it's 'connection' to the hub and underlying filesystem, as well as beginning to behave erratically (more detail below). once the idle culler kills the process, that user pod becomes orphaned either in the 'Completed' or 'OOMKilled' state. these orphaned pods pile up (sometimes hitting close to 100), and unless we manually delete them won't ever be cleaned up. if the user logs back in, the orphaned pod is reclaimed and finally goes away.

the observed erratic behavior on a single user server when a hub pod restarts:

  1. nothing happens, and you're able to continue working w/o issue (to the best of my understanding... i was the only one who didn't seem to be impacted by this while testing, but i might have missed something).
  2. your server freezes, and you need to restart it. any work done in a notebook etc that wasn't saved before this is lost.
  3. your server allows you to continue working, but throws up errors about the filesystem going away.
  4. your server allows you to continue working, but only notifies you that there are issues saving via a small status message of 'failed' that's next to the small 'autosave' status box at the top of the notebook.

this is Really Bad[tm] because a hub pod restarting shouldn't have an impact on any users' pods, and the underlying home dir filesystem going away is particularly troubling.

outside of the end-user impact, this also has a non-zero impact on our team: we can no longer safely deploy any changes to prod, as students and instructors losing their work is probably the worst thing that can happen. instead, we're deploying early in the morning to minimize impact... but the impact is still there.

the logs show that the PVC w/homedirs weren't able to be remounted, active users were marked idle and killed during restart, and a variety of http 424, 405, 404, and 304 error codes. plenty of weirdness in there...

How to reproduce

  1. have a few people log in to a hub (we had 4 people) or via the admin panel, start a few different users' servers.
  2. restart the hub pod (i did this via kubectl -n <hubname>-staging delete pod hub-5b6d7849dd-kx77z.
  3. wait a couple of minutes....
  4. when the hub process restarts, any combination of the above four behaviors will commence.
  5. after 30m, any servers experiencing behaviors 2, 3, or 4 will be culled and then become orphaned in either Completed or OOMKilled (this can be shown via kubectl get pods --all-namespaces | tr -s ' ' | cut -d ' ' -f 1,2,4 | grep -e Completed -e OOMKilled

to clean up any pods left over in step 5, we run kubectl get pods --all-namespaces | tr -s ' ' | cut -d ' ' -f 1,2,4 | grep -e Completed -e OOMKilled | xargs -n3 bash -c 'kubectl --namespace=$0 delete pod $1'

Expected behaviour

users are unaware, and not impacted by hub pod restarts.

Actual behaviour

the observed erratic behavior on a single user server when a hub pod restarts:

  1. nothing happens, and you're able to continue working w/o issue.
  2. your server freezes, and you need to restart it. any work done in a notebook etc that wasn't saved before this is lost.
  3. your server allows you to continue working, but throws up errors about the filesystem going away.
  4. your server allows you to continue working, but only notifies you that there are issues saving via a small status message of 'failed' that's next to the small 'autosave' status box at the top of the notebook.

this happens for both classic notebooks and jupyterlab.

Your personal set up

UC Berkeley datahub: https://data.berkeley.edu/datahub
https://github.com/berkeley-dsep-infra/datahub/
running on GCP
GKE 1.25.11-gke.1700
z2h installation
jupyterhub 4.0.2
jupyterlab 4.0.4
ubuntu 22.04LTS

Logs here are some logs w/redacted usernames. i can provide more if necessary... [hub-restart-stderr.json.gz](https://github.com/jupyterhub/jupyterhub/files/12703798/hub-restart-stderr.json.gz) [user-errors-post-restart.json.gz](https://github.com/jupyterhub/jupyterhub/files/12703797/user-errors-post-restart.json.gz)

@yuvipanda this is pretty legit.... :\

@shaneknapp shaneknapp added the bug label Sep 22, 2023
@shaneknapp
Copy link
Author

this also could be tangentially related to jupyterhub/jupyterhub#4544 for orphaned resource cleanup...

@welcome
Copy link

welcome bot commented Sep 22, 2023

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@manics manics transferred this issue from jupyterhub/jupyterhub Sep 22, 2023
@yuvipanda
Copy link
Collaborator

@shaneknapp this could be jupyterhub/kubespawner#786 which is fixed by jupyterhub/kubespawner#742. Try kubespawner master in your hub image and see if that fixes this?

@shaneknapp
Copy link
Author

shaneknapp commented Sep 24, 2023

@yuvipanda @consideRatio i did some testing and it LOOKS like bumping kubespawner to main fixes things. this is now live on the berkeley jupyterhubs. we'll do some more testing this week and report back.

@shaneknapp
Copy link
Author

@yuvipanda @consideRatio i did some testing and it LOOKS like bumping kubespawner to main fixes things. this is now live on the berkeley jupyterhubs. we'll do some more testing this week and report back.

(dh) ➜  datahub git:(use-master-kubespawner) kubectl exec -it hub-cd4799b-wgmbf -n data100-prod -- /bin/bash
jovyan@hub-cd4799b-wgmbf:/srv/jupyterhub$ pip list | grep kubespawner
jupyterhub-kubespawner           6.0.1.dev0

@shaneknapp
Copy link
Author

ok..... more testing revealed that things behave better w/the updated kubspawner but i'm still experiencing some issues.

after a hub restart, two out of four test users were able to carry on w/o any issues: no disconnections from the hub in the logs, no filesystem errors etc.

however, the other two users were unable to save for about ~2 minutes and lost the connection to the underlying homedir filesystem.

this was manifested in the notebook like so:
Screenshot 2023-09-25 at 10 20 34 AM

and the logs said that the token in the cookie expired:

{
  "textPayload": "[W 2023-09-25 10:20:14.748 SingleUserNotebookApp auth:739] Token stored in cookie may have expired",
  "insertId": "wevmob1pgtrjuz2h",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "ucb-datahub-2018",
      "pod_name": "jupyter-REDACTED",
      "container_name": "notebook",
      "namespace_name": "logodev-staging",
      "cluster_name": "fall-2019",
      "location": "us-central1"
    }
  },
  "timestamp": "2023-09-25T17:20:14.748574469Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/heritage": "jupyterhub",
    "k8s-pod/hub_jupyter_org/network-access-hub": "true",
    "k8s-pod/component": "singleuser-server",
    "compute.googleapis.com/resource_name": "gke-fall-2019-user-small-courses-2023-f5d5ad27-tz52",
    "k8s-pod/chart": "jupyterhub-3.0.2",
    "k8s-pod/hub_jupyter_org/servername": "",
    "k8s-pod/release": "logodev-staging",
    "k8s-pod/app": "jupyterhub"
  },
  "logName": "projects/ucb-datahub-2018/logs/stderr",
  "receiveTimestamp": "2023-09-25T17:20:15.740063406Z"
}
{
  "textPayload": "[W 2023-09-25 10:20:14.748 SingleUserNotebookApp handlers:649] wrote error: 'Forbidden'",
  "insertId": "rkg7cn89t13qh366",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "pod_name": "jupyter-REDACTED",
      "namespace_name": "logodev-staging",
      "location": "us-central1",
      "container_name": "notebook",
      "project_id": "ucb-datahub-2018",
      "cluster_name": "fall-2019"
    }
  },
  "timestamp": "2023-09-25T17:20:14.749128740Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/hub_jupyter_org/network-access-hub": "true",
    "k8s-pod/app": "jupyterhub",
    "k8s-pod/chart": "jupyterhub-3.0.2",
    "k8s-pod/release": "logodev-staging",
    "compute.googleapis.com/resource_name": "gke-fall-2019-user-small-courses-2023-f5d5ad27-tz52",
    "k8s-pod/component": "singleuser-server",
    "k8s-pod/hub_jupyter_org/servername": "",
    "k8s-pod/heritage": "jupyterhub"
  },
  "logName": "projects/ucb-datahub-2018/logs/stderr",
  "receiveTimestamp": "2023-09-25T17:20:15.740063406Z"
}
{
  "textPayload": "[W 2023-09-25 10:20:14.749 SingleUserNotebookApp log:191] 403 GET /user/REDACTED/api/metrics/v1 (@10.0.3.195) 1.78ms",
  "insertId": "hkegipdr47pf81dy",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "container_name": "notebook",
      "pod_name": "jupyter-REDACTED",
      "project_id": "ucb-datahub-2018",
      "cluster_name": "fall-2019",
      "location": "us-central1",
      "namespace_name": "logodev-staging"
    }
  },
  "timestamp": "2023-09-25T17:20:14.749632890Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-fall-2019-user-small-courses-2023-f5d5ad27-tz52",
    "k8s-pod/app": "jupyterhub",
    "k8s-pod/component": "singleuser-server",
    "k8s-pod/heritage": "jupyterhub",
    "k8s-pod/release": "logodev-staging",
    "k8s-pod/hub_jupyter_org/servername": "",
    "k8s-pod/chart": "jupyterhub-3.0.2",
    "k8s-pod/hub_jupyter_org/network-access-hub": "true"
  },
  "logName": "projects/ucb-datahub-2018/logs/stderr",
  "receiveTimestamp": "2023-09-25T17:20:15.740063406Z"
}

@shaneknapp
Copy link
Author

another win! bumping our kubespawner install dist to main also got rid of culled pods ending up orphaned in the OOMKilled or Completed state!

@shaneknapp
Copy link
Author

more testing, and this time: no user pods were interrupted in any way after hub restart. i'll do some more testing and see if i can reproduce what i saw yesterday.

unsure if it makes a difference, but i was testing yesterday on chrome, macos, m2 proc. today on my silly win11 desktop workstation. both times using chrome.

@shaneknapp
Copy link
Author

@consideRatio @yuvipanda i see that this was merged in to the 2i2c repo... did it fix things, or have you not entered 'here be dragons' territory yet? :)

@yuvipanda
Copy link
Collaborator

@shaneknapp for the token cookie thing, does the version of jupyterhub in the user image match what's in the hub pod? And is JUPYTERHUB_SINGLEUSER_APP still set? It could be set to NotebookApp to force use of the older jupyter_server setup.

@shaneknapp
Copy link
Author

@shaneknapp for the token cookie thing, does the version of jupyterhub in the user image match what's in the hub pod? And is JUPYTERHUB_SINGLEUSER_APP still set? It could be set to NotebookApp to force use of the older jupyter_server setup.

lemme try to reproduce on my lappy486.

@shaneknapp
Copy link
Author

ok, i did a bunch more testing on both my mac and windows workstation (log in as others via admin, kill hub pod, frantically click 'save' on 5 diff notebooks) and i didn't get any filesystem errors.

simultaneously, we merged a PR to prod (on hubs w/many current users) about 45m ago and i haven't seen anything in the logs, nor are there any orphaned pods...

i think we're good to go?

@shaneknapp
Copy link
Author

alright, a little more testing and a couple of trepidatious deployments later and this definitely fixes things. thanks @yuvipanda @consideRatio for help and getting kubespawner 6.1.0 out the door!

we'll revert our kubespawner changes once 6.1.0 is live.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants