Skip to content
This repository has been archived by the owner on Sep 16, 2022. It is now read-only.

Implement graceful backoff for Postgress #328

Closed
vpetersson opened this issue Jul 22, 2019 · 20 comments · Fixed by #373
Closed

Implement graceful backoff for Postgress #328

vpetersson opened this issue Jul 22, 2019 · 20 comments · Fixed by #373

Comments

@vpetersson
Copy link
Contributor

We're getting flooded in Sentry because of periodical issues connecting to Postgres. (example.

While in a perfect world, we'd always be able to connect. In reality however, there will be instances where Posgress is unreachable (network issues etc).

Let's implement a graceful backoff to limit the Sentry issues.

@a-martynovich
Copy link
Contributor

@vpetersson The example you provided is the middleware written specifically for K8s health check. It probes all db connections and throws if at least one fails.

Do you have any other examples of connection issues happening in other places? I believe connection hot-switching is built-in and happens automatically.

@vpetersson
Copy link
Contributor Author

@vpetersson The example you provided is the middleware written specifically for K8s health check. It probes all db connections and throws if at least one fails.

Do you have any other examples of connection issues happening in other places? I believe connection hot-switching is built-in and happens automatically.

Yes, you’re right. This is for the Kubernetes health check, and that’s by design. This is what is likely going to detect these errors first, which is why it is first-line of defense.

I haven’t read the actual source code for this health check, but my point is that we need to patch Django’s native database calls with a graceful back-off, such that it works for both the health check and for other functions.

@a-martynovich
Copy link
Contributor

@vpetersson What do you mean by graceful back-off?

According to Django docs, if a db connection fails during a request, the next request will get a new connection opened. Is that what we need?

@vpetersson
Copy link
Contributor Author

I guess technically speaking, what I was referring to was exponential backoff). Meaning, if first attempt fails, it will increase the retry period and eventually give up after a few minutes perhaps.

@a-martynovich
Copy link
Contributor

@vpetersson Isn't it what k8s health/readiness checks are for? I mean if a node is unhealthy because it can't connect to db then k8s will use another node while this node gets restarted (excuse me if i'm using wrong terminology, but I hope you see my point).

If we use backoff a user request may get stuck for a long time (or even time out) because it will be handled by the single failing node. Backoff may or may not solve the issue, we don't know that. Quickly switching to another, healthy node is a faster option, I think.

@vpetersson
Copy link
Contributor Author

@vpetersson Isn't it what k8s health/readiness checks are for? I mean if a node is unhealthy because it can't connect to db then k8s will use another node while this node gets restarted (excuse me if i'm using wrong terminology, but I hope you see my point).

If we use backoff a user request may get stuck for a long time (or even time out) because it will be handled by the single failing node. Backoff may or may not solve the issue, we don't know that. Quickly switching to another, healthy node is a faster option, I think.

This is a bit more complex. Essentially, in a 12 factor app (we're probably closer to a monolith than a 12 factor app, but it still has some good insights), you should essentially assume that every dependency can go down at any given moment. It is then the applications to gracefully retry using some kind of exponential back-off algorithm.

You're right that the purpose of health/readiness checks are designed largely for this, however, I'm not sure it is a silver bullet. We still need a bit more intelligence in our app layer to handle these things i think.

The problem with just assuming it's a node issue is that it may not be that. It could be a cluster wide issue (e.g. CloudSQL could be down, or there could be a routing issue etc). Hence, having some level of retry logic for perhaps up to one minute or so would help a bit.

@a-martynovich
Copy link
Contributor

What will happen if a web request gets handled by a node which cannot connect to SQL and we have this backoff implemented? A timeout will eventually happen. Is it better than failing with an error? I don't think so.

Also what will happen to health checks with this backoff in place? They will just get slower because they will hang too. Is that a good thing?

We need to implement a handoff, not backoff. That is, if one node (pod) can't handle a request, it should be passed to another healthy node (pod).

Fast failing is better than slow failing, hope we agree on that. If we have a global SQL issue we don't know when it's going to be resolved, therefore retrying will probably induce timeout (slow failing). If we have a node- or cluster-wide SQL issue then we just switch to another node or cluster and resolve it or fail if none of the nodes can connect to SQL (fast failing).

@vpetersson
Copy link
Contributor Author

Ok, let's take a step back here. In some contexts, like with tests, it's better to fail fast. In this particular scenario, I'm not sure if that's ideal. If for whatever reason a pod cannot communicate with the CloudSQL instance for 20 seconds, we don't necessarily want the pods to be flagged as damaged and be scheduled for re-spawning as this is a rather expensive process.

Imagine for instance there is a glitch in the network and we're unable to speak to CloudSQL for 30 seconds. In the current "Fail fast" scenario, due to the health checks, Kubernetes will essentially assume all pods are broken and set them to be rescheduled. Now imagine that we have a cluster of 15 nodes and each running 3 instances of each type (admin, api, mtls-api). Now we will have a 135 pods hammering away at the same time. Scale this up further, and we now have something like a stampede issue at our hands.

If we on the other hand had an exponential back-off with some jitter, we could get around this. I'm not suggesting that they should retry forever, but perhaps up to 120s (this needs to be aligned with the health checks in Kubernetes).

Does that make sense?

@a-martynovich
Copy link
Contributor

@vpetersson Health checks - all clear now.

But what should happen to a web or mtls request if it encounters SQL issues? Should it wait until a connection is re-established or should it fail fast?

@vpetersson
Copy link
Contributor Author

I think we should we should wait for up to 120 seconds or so. It will be a better user experience than having it fail fast.

@rptrchv
Copy link
Contributor

rptrchv commented Aug 7, 2019

@vpetersson We can try this https://github.com/jdelic/django-dbconn-retry on the prod. Locally I didn't manage to reproduce the real issue, so I still don't know whether it useful or not.
If it's ok to try it on the prod - let me know and I'll install the app.

@vpetersson
Copy link
Contributor Author

@rptrchv That project doesn't seem very well maintained (nor popular). I always worry about adopt such modules/libraries. Is there another one we can look at?

@vpetersson
Copy link
Contributor Author

Here's the type of errors that we need to mitigate:
https://sentry.io/share/issue/4231fd0479584a12a496561d82925a6b/

@rptrchv
Copy link
Contributor

rptrchv commented Aug 19, 2019

@vpetersson Connection refused error simply means Postgresql is not available at the particular host/port.
I have strong feeling it happens during deployment due to postgres docker image(s) rebuilding.
In order to check this hypothesis I need timestamps of several (5-10) of the most recent Sentry warnings for Connection refused error. But note, the Sentry shared issue report does not contain timestamps at all.

@rptrchv
Copy link
Contributor

rptrchv commented Aug 19, 2019

@vpetersson I see now you don't use docker for Postgresql, but CloudSQL instead, but the question is still on - I need timestamps of several most recent Connection refused errors to check my hypothesis that is't somehow connected with deployment procedure.

@vpetersson
Copy link
Contributor Author

@rptrchv I can say with confidence that it isn't related to the deployment, as it happens randomly. What it is however related to is that we are using ephemeral nodes for Kubernetes. As such, the node lifespan is 24 hours, and then they will be replaced. While we already run two instances of the proxy for redundancy purposes, there seem to be situation where both are unavailable for a short instance, which is why we need to have retry logic.

(Clarification: During deploys, the CloudSQL proxy is not recreated)

@vpetersson
Copy link
Contributor Author

vpetersson commented Aug 26, 2019

@vpetersson
Copy link
Contributor Author

vpetersson commented Sep 3, 2019

Yep, getting more of these:
https://sentry.io/share/issue/fee02f6efe7542b2ae320a15a63fe3c0/

@a-martynovich
Copy link
Contributor

The "No route to host" error is not handled with graceful backoff, but it's easy to add it to the list.

@vpetersson
Copy link
Contributor Author

Closing this out as it seems like #396 resolved the last issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants