user pods severely impacted after hub pod restart #3229

shaneknapp · 2023-09-22T18:40:09Z

Bug description

this is a weird one, but after spending yesterday confirming w/the team here at uc berkeley we are pretty certain this Is Really Bad[tm].

when restarting a hub pod (we're on GCP), any logged in users have a high chance of their single user server pod losing it's 'connection' to the hub and underlying filesystem, as well as beginning to behave erratically (more detail below). once the idle culler kills the process, that user pod becomes orphaned either in the 'Completed' or 'OOMKilled' state. these orphaned pods pile up (sometimes hitting close to 100), and unless we manually delete them won't ever be cleaned up. if the user logs back in, the orphaned pod is reclaimed and finally goes away.

the observed erratic behavior on a single user server when a hub pod restarts:

nothing happens, and you're able to continue working w/o issue (to the best of my understanding... i was the only one who didn't seem to be impacted by this while testing, but i might have missed something).
your server freezes, and you need to restart it. any work done in a notebook etc that wasn't saved before this is lost.
your server allows you to continue working, but throws up errors about the filesystem going away.
your server allows you to continue working, but only notifies you that there are issues saving via a small status message of 'failed' that's next to the small 'autosave' status box at the top of the notebook.

this is Really Bad[tm] because a hub pod restarting shouldn't have an impact on any users' pods, and the underlying home dir filesystem going away is particularly troubling.

outside of the end-user impact, this also has a non-zero impact on our team: we can no longer safely deploy any changes to prod, as students and instructors losing their work is probably the worst thing that can happen. instead, we're deploying early in the morning to minimize impact... but the impact is still there.

the logs show that the PVC w/homedirs weren't able to be remounted, active users were marked idle and killed during restart, and a variety of http 424, 405, 404, and 304 error codes. plenty of weirdness in there...

How to reproduce

have a few people log in to a hub (we had 4 people) or via the admin panel, start a few different users' servers.
restart the hub pod (i did this via kubectl -n <hubname>-staging delete pod hub-5b6d7849dd-kx77z.
wait a couple of minutes....
when the hub process restarts, any combination of the above four behaviors will commence.
after 30m, any servers experiencing behaviors 2, 3, or 4 will be culled and then become orphaned in either Completed or OOMKilled (this can be shown via kubectl get pods --all-namespaces | tr -s ' ' | cut -d ' ' -f 1,2,4 | grep -e Completed -e OOMKilled

to clean up any pods left over in step 5, we run kubectl get pods --all-namespaces | tr -s ' ' | cut -d ' ' -f 1,2,4 | grep -e Completed -e OOMKilled | xargs -n3 bash -c 'kubectl --namespace=$0 delete pod $1'

Expected behaviour

users are unaware, and not impacted by hub pod restarts.

Actual behaviour

the observed erratic behavior on a single user server when a hub pod restarts:

nothing happens, and you're able to continue working w/o issue.
your server freezes, and you need to restart it. any work done in a notebook etc that wasn't saved before this is lost.
your server allows you to continue working, but throws up errors about the filesystem going away.
your server allows you to continue working, but only notifies you that there are issues saving via a small status message of 'failed' that's next to the small 'autosave' status box at the top of the notebook.

this happens for both classic notebooks and jupyterlab.

Your personal set up

UC Berkeley datahub: https://data.berkeley.edu/datahub
https://github.com/berkeley-dsep-infra/datahub/
running on GCP
GKE 1.25.11-gke.1700
z2h installation
jupyterhub 4.0.2
jupyterlab 4.0.4
ubuntu 22.04LTS

Logs

here are some logs w/redacted usernames. i can provide more if necessary... [hub-restart-stderr.json.gz](https://github.com/jupyterhub/jupyterhub/files/12703798/hub-restart-stderr.json.gz) [user-errors-post-restart.json.gz](https://github.com/jupyterhub/jupyterhub/files/12703797/user-errors-post-restart.json.gz)

@yuvipanda this is pretty legit.... :\

The text was updated successfully, but these errors were encountered:

shaneknapp · 2023-09-22T18:42:28Z

this also could be tangentially related to jupyterhub/jupyterhub#4544 for orphaned resource cleanup...

welcome · 2023-09-22T19:10:43Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

yuvipanda · 2023-09-22T19:47:28Z

@shaneknapp this could be jupyterhub/kubespawner#786 which is fixed by jupyterhub/kubespawner#742. Try kubespawner master in your hub image and see if that fixes this?

shaneknapp · 2023-09-24T18:09:47Z

@yuvipanda @consideRatio i did some testing and it LOOKS like bumping kubespawner to main fixes things. this is now live on the berkeley jupyterhubs. we'll do some more testing this week and report back.

shaneknapp · 2023-09-24T18:23:24Z

@yuvipanda @consideRatio i did some testing and it LOOKS like bumping kubespawner to main fixes things. this is now live on the berkeley jupyterhubs. we'll do some more testing this week and report back.

(dh) ➜  datahub git:(use-master-kubespawner) kubectl exec -it hub-cd4799b-wgmbf -n data100-prod -- /bin/bash
jovyan@hub-cd4799b-wgmbf:/srv/jupyterhub$ pip list | grep kubespawner
jupyterhub-kubespawner           6.0.1.dev0

shaneknapp · 2023-09-25T17:33:37Z

ok..... more testing revealed that things behave better w/the updated kubspawner but i'm still experiencing some issues.

after a hub restart, two out of four test users were able to carry on w/o any issues: no disconnections from the hub in the logs, no filesystem errors etc.

however, the other two users were unable to save for about ~2 minutes and lost the connection to the underlying homedir filesystem.

this was manifested in the notebook like so:

and the logs said that the token in the cookie expired:

{
  "textPayload": "[W 2023-09-25 10:20:14.748 SingleUserNotebookApp auth:739] Token stored in cookie may have expired",
  "insertId": "wevmob1pgtrjuz2h",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "project_id": "ucb-datahub-2018",
      "pod_name": "jupyter-REDACTED",
      "container_name": "notebook",
      "namespace_name": "logodev-staging",
      "cluster_name": "fall-2019",
      "location": "us-central1"
    }
  },
  "timestamp": "2023-09-25T17:20:14.748574469Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/heritage": "jupyterhub",
    "k8s-pod/hub_jupyter_org/network-access-hub": "true",
    "k8s-pod/component": "singleuser-server",
    "compute.googleapis.com/resource_name": "gke-fall-2019-user-small-courses-2023-f5d5ad27-tz52",
    "k8s-pod/chart": "jupyterhub-3.0.2",
    "k8s-pod/hub_jupyter_org/servername": "",
    "k8s-pod/release": "logodev-staging",
    "k8s-pod/app": "jupyterhub"
  },
  "logName": "projects/ucb-datahub-2018/logs/stderr",
  "receiveTimestamp": "2023-09-25T17:20:15.740063406Z"
}
{
  "textPayload": "[W 2023-09-25 10:20:14.748 SingleUserNotebookApp handlers:649] wrote error: 'Forbidden'",
  "insertId": "rkg7cn89t13qh366",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "pod_name": "jupyter-REDACTED",
      "namespace_name": "logodev-staging",
      "location": "us-central1",
      "container_name": "notebook",
      "project_id": "ucb-datahub-2018",
      "cluster_name": "fall-2019"
    }
  },
  "timestamp": "2023-09-25T17:20:14.749128740Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/hub_jupyter_org/network-access-hub": "true",
    "k8s-pod/app": "jupyterhub",
    "k8s-pod/chart": "jupyterhub-3.0.2",
    "k8s-pod/release": "logodev-staging",
    "compute.googleapis.com/resource_name": "gke-fall-2019-user-small-courses-2023-f5d5ad27-tz52",
    "k8s-pod/component": "singleuser-server",
    "k8s-pod/hub_jupyter_org/servername": "",
    "k8s-pod/heritage": "jupyterhub"
  },
  "logName": "projects/ucb-datahub-2018/logs/stderr",
  "receiveTimestamp": "2023-09-25T17:20:15.740063406Z"
}
{
  "textPayload": "[W 2023-09-25 10:20:14.749 SingleUserNotebookApp log:191] 403 GET /user/REDACTED/api/metrics/v1 (@10.0.3.195) 1.78ms",
  "insertId": "hkegipdr47pf81dy",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "container_name": "notebook",
      "pod_name": "jupyter-REDACTED",
      "project_id": "ucb-datahub-2018",
      "cluster_name": "fall-2019",
      "location": "us-central1",
      "namespace_name": "logodev-staging"
    }
  },
  "timestamp": "2023-09-25T17:20:14.749632890Z",
  "severity": "ERROR",
  "labels": {
    "compute.googleapis.com/resource_name": "gke-fall-2019-user-small-courses-2023-f5d5ad27-tz52",
    "k8s-pod/app": "jupyterhub",
    "k8s-pod/component": "singleuser-server",
    "k8s-pod/heritage": "jupyterhub",
    "k8s-pod/release": "logodev-staging",
    "k8s-pod/hub_jupyter_org/servername": "",
    "k8s-pod/chart": "jupyterhub-3.0.2",
    "k8s-pod/hub_jupyter_org/network-access-hub": "true"
  },
  "logName": "projects/ucb-datahub-2018/logs/stderr",
  "receiveTimestamp": "2023-09-25T17:20:15.740063406Z"
}

shaneknapp · 2023-09-25T18:04:02Z

another win! bumping our kubespawner install dist to main also got rid of culled pods ending up orphaned in the OOMKilled or Completed state!

shaneknapp · 2023-09-26T17:26:34Z

more testing, and this time: no user pods were interrupted in any way after hub restart. i'll do some more testing and see if i can reproduce what i saw yesterday.

unsure if it makes a difference, but i was testing yesterday on chrome, macos, m2 proc. today on my silly win11 desktop workstation. both times using chrome.

shaneknapp · 2023-09-26T17:28:03Z

@consideRatio @yuvipanda i see that this was merged in to the 2i2c repo... did it fix things, or have you not entered 'here be dragons' territory yet? :)

yuvipanda · 2023-09-26T19:15:54Z

@shaneknapp for the token cookie thing, does the version of jupyterhub in the user image match what's in the hub pod? And is JUPYTERHUB_SINGLEUSER_APP still set? It could be set to NotebookApp to force use of the older jupyter_server setup.

shaneknapp · 2023-09-26T19:43:11Z

@shaneknapp for the token cookie thing, does the version of jupyterhub in the user image match what's in the hub pod? And is JUPYTERHUB_SINGLEUSER_APP still set? It could be set to NotebookApp to force use of the older jupyter_server setup.

lemme try to reproduce on my lappy486.

shaneknapp · 2023-09-27T19:46:49Z

ok, i did a bunch more testing on both my mac and windows workstation (log in as others via admin, kill hub pod, frantically click 'save' on 5 diff notebooks) and i didn't get any filesystem errors.

simultaneously, we merged a PR to prod (on hubs w/many current users) about 45m ago and i haven't seen anything in the logs, nor are there any orphaned pods...

i think we're good to go?

shaneknapp · 2023-09-28T17:06:11Z

alright, a little more testing and a couple of trepidatious deployments later and this definitely fixes things. thanks @yuvipanda @consideRatio for help and getting kubespawner 6.1.0 out the door!

we'll revert our kubespawner changes once 6.1.0 is live.

shaneknapp added the bug label Sep 22, 2023

manics transferred this issue from jupyterhub/jupyterhub Sep 22, 2023

shaneknapp mentioned this issue Sep 22, 2023

install latest kubespawner on hub image berkeley-dsep-infra/datahub#5060

Merged

This was referenced Sep 22, 2023

Specify tags & pullPolicy for alpine/git: image 2i2c-org/infrastructure#3165

Merged

Use main branch of kubespawner with important bugfix 2i2c-org/infrastructure#3168

Merged

shaneknapp closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

user pods severely impacted after hub pod restart #3229

user pods severely impacted after hub pod restart #3229

shaneknapp commented Sep 22, 2023 •

edited

Loading

shaneknapp commented Sep 22, 2023

welcome bot commented Sep 22, 2023

yuvipanda commented Sep 22, 2023

shaneknapp commented Sep 24, 2023 •

edited

Loading

shaneknapp commented Sep 24, 2023

shaneknapp commented Sep 25, 2023

shaneknapp commented Sep 25, 2023

shaneknapp commented Sep 26, 2023

shaneknapp commented Sep 26, 2023

yuvipanda commented Sep 26, 2023

shaneknapp commented Sep 26, 2023

shaneknapp commented Sep 27, 2023

shaneknapp commented Sep 28, 2023

user pods severely impacted after hub pod restart #3229

user pods severely impacted after hub pod restart #3229

Comments

shaneknapp commented Sep 22, 2023 • edited Loading

Bug description

How to reproduce

Expected behaviour

Actual behaviour

Your personal set up

shaneknapp commented Sep 22, 2023

welcome bot commented Sep 22, 2023

yuvipanda commented Sep 22, 2023

shaneknapp commented Sep 24, 2023 • edited Loading

shaneknapp commented Sep 24, 2023

shaneknapp commented Sep 25, 2023

shaneknapp commented Sep 25, 2023

shaneknapp commented Sep 26, 2023

shaneknapp commented Sep 26, 2023

yuvipanda commented Sep 26, 2023

shaneknapp commented Sep 26, 2023

shaneknapp commented Sep 27, 2023

shaneknapp commented Sep 28, 2023

shaneknapp commented Sep 22, 2023 •

edited

Loading

shaneknapp commented Sep 24, 2023 •

edited

Loading