Implement graceful backoff for Postgress #328

vpetersson · 2019-07-22T10:37:01Z

We're getting flooded in Sentry because of periodical issues connecting to Postgres. (example.

While in a perfect world, we'd always be able to connect. In reality however, there will be instances where Posgress is unreachable (network issues etc).

Let's implement a graceful backoff to limit the Sentry issues.

a-martynovich · 2019-07-23T05:51:05Z

@vpetersson The example you provided is the middleware written specifically for K8s health check. It probes all db connections and throws if at least one fails.

Do you have any other examples of connection issues happening in other places? I believe connection hot-switching is built-in and happens automatically.

vpetersson · 2019-07-23T08:02:08Z

@vpetersson The example you provided is the middleware written specifically for K8s health check. It probes all db connections and throws if at least one fails.

Do you have any other examples of connection issues happening in other places? I believe connection hot-switching is built-in and happens automatically.

Yes, you’re right. This is for the Kubernetes health check, and that’s by design. This is what is likely going to detect these errors first, which is why it is first-line of defense.

I haven’t read the actual source code for this health check, but my point is that we need to patch Django’s native database calls with a graceful back-off, such that it works for both the health check and for other functions.

a-martynovich · 2019-07-23T15:19:30Z

@vpetersson What do you mean by graceful back-off?

According to Django docs, if a db connection fails during a request, the next request will get a new connection opened. Is that what we need?

vpetersson · 2019-07-23T16:39:45Z

I guess technically speaking, what I was referring to was exponential backoff). Meaning, if first attempt fails, it will increase the retry period and eventually give up after a few minutes perhaps.

a-martynovich · 2019-07-23T18:06:36Z

@vpetersson Isn't it what k8s health/readiness checks are for? I mean if a node is unhealthy because it can't connect to db then k8s will use another node while this node gets restarted (excuse me if i'm using wrong terminology, but I hope you see my point).

If we use backoff a user request may get stuck for a long time (or even time out) because it will be handled by the single failing node. Backoff may or may not solve the issue, we don't know that. Quickly switching to another, healthy node is a faster option, I think.

vpetersson · 2019-07-29T12:38:05Z

@vpetersson Isn't it what k8s health/readiness checks are for? I mean if a node is unhealthy because it can't connect to db then k8s will use another node while this node gets restarted (excuse me if i'm using wrong terminology, but I hope you see my point).

If we use backoff a user request may get stuck for a long time (or even time out) because it will be handled by the single failing node. Backoff may or may not solve the issue, we don't know that. Quickly switching to another, healthy node is a faster option, I think.

This is a bit more complex. Essentially, in a 12 factor app (we're probably closer to a monolith than a 12 factor app, but it still has some good insights), you should essentially assume that every dependency can go down at any given moment. It is then the applications to gracefully retry using some kind of exponential back-off algorithm.

You're right that the purpose of health/readiness checks are designed largely for this, however, I'm not sure it is a silver bullet. We still need a bit more intelligence in our app layer to handle these things i think.

The problem with just assuming it's a node issue is that it may not be that. It could be a cluster wide issue (e.g. CloudSQL could be down, or there could be a routing issue etc). Hence, having some level of retry logic for perhaps up to one minute or so would help a bit.

a-martynovich · 2019-08-05T06:57:23Z

What will happen if a web request gets handled by a node which cannot connect to SQL and we have this backoff implemented? A timeout will eventually happen. Is it better than failing with an error? I don't think so.

Also what will happen to health checks with this backoff in place? They will just get slower because they will hang too. Is that a good thing?

We need to implement a handoff, not backoff. That is, if one node (pod) can't handle a request, it should be passed to another healthy node (pod).

Fast failing is better than slow failing, hope we agree on that. If we have a global SQL issue we don't know when it's going to be resolved, therefore retrying will probably induce timeout (slow failing). If we have a node- or cluster-wide SQL issue then we just switch to another node or cluster and resolve it or fail if none of the nodes can connect to SQL (fast failing).

vpetersson · 2019-08-05T07:21:25Z

Ok, let's take a step back here. In some contexts, like with tests, it's better to fail fast. In this particular scenario, I'm not sure if that's ideal. If for whatever reason a pod cannot communicate with the CloudSQL instance for 20 seconds, we don't necessarily want the pods to be flagged as damaged and be scheduled for re-spawning as this is a rather expensive process.

Imagine for instance there is a glitch in the network and we're unable to speak to CloudSQL for 30 seconds. In the current "Fail fast" scenario, due to the health checks, Kubernetes will essentially assume all pods are broken and set them to be rescheduled. Now imagine that we have a cluster of 15 nodes and each running 3 instances of each type (admin, api, mtls-api). Now we will have a 135 pods hammering away at the same time. Scale this up further, and we now have something like a stampede issue at our hands.

If we on the other hand had an exponential back-off with some jitter, we could get around this. I'm not suggesting that they should retry forever, but perhaps up to 120s (this needs to be aligned with the health checks in Kubernetes).

Does that make sense?

a-martynovich · 2019-08-05T07:25:57Z

@vpetersson Health checks - all clear now.

But what should happen to a web or mtls request if it encounters SQL issues? Should it wait until a connection is re-established or should it fail fast?

vpetersson · 2019-08-05T07:28:27Z

I think we should we should wait for up to 120 seconds or so. It will be a better user experience than having it fail fast.

rptrchv · 2019-08-07T17:19:30Z

@vpetersson We can try this https://github.com/jdelic/django-dbconn-retry on the prod. Locally I didn't manage to reproduce the real issue, so I still don't know whether it useful or not.
If it's ok to try it on the prod - let me know and I'll install the app.

vpetersson · 2019-08-12T16:30:06Z

@rptrchv That project doesn't seem very well maintained (nor popular). I always worry about adopt such modules/libraries. Is there another one we can look at?

vpetersson · 2019-08-15T09:56:17Z

Here's the type of errors that we need to mitigate:
https://sentry.io/share/issue/4231fd0479584a12a496561d82925a6b/

rptrchv · 2019-08-19T05:02:16Z

@vpetersson Connection refused error simply means Postgresql is not available at the particular host/port.
I have strong feeling it happens during deployment due to postgres docker image(s) rebuilding.
In order to check this hypothesis I need timestamps of several (5-10) of the most recent Sentry warnings for Connection refused error. But note, the Sentry shared issue report does not contain timestamps at all.

rptrchv · 2019-08-19T07:26:02Z

@vpetersson I see now you don't use docker for Postgresql, but CloudSQL instead, but the question is still on - I need timestamps of several most recent Connection refused errors to check my hypothesis that is't somehow connected with deployment procedure.

vpetersson · 2019-08-19T08:02:02Z

@rptrchv I can say with confidence that it isn't related to the deployment, as it happens randomly. What it is however related to is that we are using ephemeral nodes for Kubernetes. As such, the node lifespan is 24 hours, and then they will be replaced. While we already run two instances of the proxy for redundancy purposes, there seem to be situation where both are unavailable for a short instance, which is why we need to have retry logic.

(Clarification: During deploys, the CloudSQL proxy is not recreated)

vpetersson · 2019-08-26T10:22:47Z

@rptrchv looks like this didn't fully work:
https://sentry.io/share/issue/9c1bf922e4844a9688369c635d5561da/
and
https://sentry.io/share/issue/161a196bbf664fe3aca5ca7a59955f96/

vpetersson · 2019-09-03T13:49:21Z

Yep, getting more of these:
https://sentry.io/share/issue/fee02f6efe7542b2ae320a15a63fe3c0/

a-martynovich · 2019-09-03T13:54:43Z

The "No route to host" error is not handled with graceful backoff, but it's easy to add it to the list.

vpetersson · 2019-09-10T10:40:36Z

Closing this out as it seems like #396 resolved the last issues.

rptrchv mentioned this issue Aug 21, 2019

Implement graceful backoff for Postgress #373

Merged

vpetersson closed this as completed in #373 Aug 23, 2019

vpetersson reopened this Aug 26, 2019

rptrchv mentioned this issue Sep 4, 2019

Add a new OperationalError type support #396

Merged

vpetersson closed this as completed Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement graceful backoff for Postgress #328

Implement graceful backoff for Postgress #328

vpetersson commented Jul 22, 2019

a-martynovich commented Jul 23, 2019

vpetersson commented Jul 23, 2019

a-martynovich commented Jul 23, 2019

vpetersson commented Jul 23, 2019

a-martynovich commented Jul 23, 2019

vpetersson commented Jul 29, 2019

a-martynovich commented Aug 5, 2019

vpetersson commented Aug 5, 2019

a-martynovich commented Aug 5, 2019

vpetersson commented Aug 5, 2019

rptrchv commented Aug 7, 2019

vpetersson commented Aug 12, 2019

vpetersson commented Aug 15, 2019

rptrchv commented Aug 19, 2019

rptrchv commented Aug 19, 2019

vpetersson commented Aug 19, 2019

vpetersson commented Aug 26, 2019 •

edited

Loading

vpetersson commented Sep 3, 2019 •

edited

Loading

a-martynovich commented Sep 3, 2019

vpetersson commented Sep 10, 2019

Implement graceful backoff for Postgress #328

Implement graceful backoff for Postgress #328

Comments

vpetersson commented Jul 22, 2019

a-martynovich commented Jul 23, 2019

vpetersson commented Jul 23, 2019

a-martynovich commented Jul 23, 2019

vpetersson commented Jul 23, 2019

a-martynovich commented Jul 23, 2019

vpetersson commented Jul 29, 2019

a-martynovich commented Aug 5, 2019

vpetersson commented Aug 5, 2019

a-martynovich commented Aug 5, 2019

vpetersson commented Aug 5, 2019

rptrchv commented Aug 7, 2019

vpetersson commented Aug 12, 2019

vpetersson commented Aug 15, 2019

rptrchv commented Aug 19, 2019

rptrchv commented Aug 19, 2019

vpetersson commented Aug 19, 2019

vpetersson commented Aug 26, 2019 • edited Loading

vpetersson commented Sep 3, 2019 • edited Loading

a-martynovich commented Sep 3, 2019

vpetersson commented Sep 10, 2019

vpetersson commented Aug 26, 2019 •

edited

Loading

vpetersson commented Sep 3, 2019 •

edited

Loading