Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate as CHP proxy pod's deployment strategy #1401

Conversation

consideRatio
Copy link
Member

Using a rolling update by default on the proxy pod is a mistake by us,
because of the JupyterHub / CHP proxy interaction. JupyterHub assumes in
check_routes / add_route etc to be speaking to one specific CHP proxy
server, but there can be different ones responding if we make an upgrade
and the proxy pod is making a rolling upgrade.

For example, consider a hub pod making a recreate upgrade, and a proxy
pod makinga rolling upgrade. The new hub pod could for example get ready
before the proxy pod and start speaking with the old proxy pod and later
at a crucial point start speaking with the new pod. If you switch to
speaking with the new pod at the wrong time, you may end up with failure
to get responses from user pods that are verified to be around, and then
they are deleted.

So, this commit hope to fix a sneaky bug where user pods are deleted
during upgrades where the proxy pod is also updated!


Note that with the traefik proxy that would store state in a key value store, this may not be a problem, but we don't yet use traefik proxy.

jupyterhub/values.yaml Outdated Show resolved Hide resolved
jupyterhub/values.yaml Outdated Show resolved Hide resolved
Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Looks good to me modulo the two nits.

Knock knock
Race condition
Who's there?

😂

Using a rolling update by default on the proxy pod is a mistake by us,
because of the JupyterHub / CHP proxy interaction. JupyterHub assumes in
check_routes / add_route etc to be speaking to one specific CHP proxy
server, but there can be different ones responding if we make an upgrade
and the proxy pod is making a rolling upgrade.

For example, consider a hub pod making a recreate upgrade, and a proxy
pod makinga rolling upgrade. The new hub pod could for example get ready
before the proxy pod and start speaking with the old proxy pod and later
at a crucial point start speaking with the new pod. If you switch to
speaking with the new pod at the wrong time, you may end up with failure
to get responses from user pods that are verified to be around, and then
they are deleted.

So, this commit hope to fix a sneaky bug where user pods are deleted
during upgrades where the proxy pod is also updated!
@consideRatio consideRatio force-pushed the proxy-doesnt-support-rollingupgrades branch from 902a032 to 4bb76c7 Compare September 10, 2019 11:36
@consideRatio consideRatio merged commit 3a5b37a into jupyterhub:master Sep 10, 2019
consideRatio added a commit to consideRatio/zero-to-jupyterhub-k8s that referenced this pull request Sep 10, 2019
Upgrades from previous state to this would fail without this fix of the
issue caused by removing the fix in this PR: jupyterhub#1401
consideRatio added a commit that referenced this pull request Sep 10, 2019
@consideRatio
Copy link
Member Author

After this, it will be typical to find these kinds of errors on the hub starting up until the proxy becomes ready again.

[E 2019-09-10 13:40:47.342 JupyterHub app:2482]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2480, in launch_instance_async
        await self.start()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2405, in start
        await self.proxy.check_routes(self.users, self._service_map)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 62, in locked_method
        return await method(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 315, in check_routes
        routes = await self.get_all_routes()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 804, in get_all_routes
        resp = await self.api_request('', client=client)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 773, in api_request
        result = await client.fetch(req)
    tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20001 milliseconds
    
[D 2019-09-10 13:40:47.344 JupyterHub application:647] Exiting application: jupyterhub
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<JupyterHub.launch_instance_async() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/app.py:2477> exception=SystemExit(1,)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2480, in launch_instance_async
    await self.start()
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2405, in start
    await self.proxy.check_routes(self.users, self._service_map)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 62, in locked_method
    return await method(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 315, in check_routes
    routes = await self.get_all_routes()
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 804, in get_all_routes
    resp = await self.api_request('', client=client)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 773, in api_request
    result = await client.fetch(req)
tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20001 milliseconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2492, in launch_instance
    loop.start()
  File "/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2483, in launch_instance_async
    self.exit(1)
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 648, in exit
    sys.exit(exit_status)
SystemExit: 1

@limeicao
Copy link

hi,guys ,have you solved the questions , tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20001 milliseconds,i have meet the same problems, it bothers me for several days , please do give me some suggestions . waiting the response

@consideRatio
Copy link
Member Author

Yes, use the latest version, 0.10.6 of the helm chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants