Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry on serialization errors #1342

Merged
merged 6 commits into from
Nov 21, 2018

Conversation

vito
Copy link
Contributor

@vito vito commented Nov 15, 2018

Fixes #1341

Turns out it's easy enough to just capture the specific serialization error code and retry the transaction.

Tested this locally - the storage conformance tests pass (they failed when removing serialization altogether), and our integration tests no longer fail on concurrent logins.

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing -- some comments inline.

To be honest, I don't completely understand the issue yet. If, as indicated in your issue, everything works fine without the serialization requirement, since the ids are always distinct, why can't that be serialized? 🤔


err = tx.Commit()
if err != nil {
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code == "40001" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Can we do something like

pqErr.Code.Name() == "serialization_failure"

using Name and the string representation? It would be a tad more explicit, I suppose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that's much better.

return err
}
defer tx.Rollback()
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ This is addressing an edge case, but can we limit the number of retries? Also, do we care that it's hammering or does this need some wait times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, to be honest I'm not sure. The Postgres docs just say "retry" but don't say anything about backoff and attempts.

I'm a little nervous about limiting it as it could potentially mean bubbling up the error, given enough concurrency. I agree that it's scary to have it unbounded, but so long as we're getting serialization failures, I'm not sure there's any resolution other than to retry until it works (which should eventually happen as requests calm down). It helps that we're at least catching this very specific error code and only retrying on that one.

Not sure about the hammering/wait times. In my experience it's sufficient to just try again immediately to resolve concurrency "disputes", since they tend to resolve fairly quickly, and the net effect is usually just a few more queries (which the database should be able to handle). If it were a failure that could come from load or network connectivity, I'd totally agree, but this almost feels more like regular handling of database behavior. 🤷‍♂️

}
defer tx.Rollback()
for {
tx, err := db.Begin()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be worth switching to BeginTx(context.TODO(), sql.LevelSerializable), which would, I think, mean we can ditch the explicit SET TRANSACTION ISOLATION LEVEL SERIALIZABLE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if we did that we could also go all-in on contexts and do something like

ctx, cancel := context.WithCancel(context.TODO())
defer cancel() // will rollback transaction
tx, err := db.BeginTx(ctx, nil /* or some proper level? */)
if err != nil {
  return err
}
if err := fn(tx); err != nil {
  return err
}
err = tx.Commit()
// etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, didn't know that was a thing! I'll try it out.

@vito vito force-pushed the pr/retry-on-pg-serialization-error branch from 6f3a4db to 2c294ca Compare November 16, 2018 18:08
@vito
Copy link
Contributor Author

vito commented Nov 16, 2018

@srenatus Yeah, I'm a bit confused by it too. Maybe the failures have something to do with concurrent reads, not just writes? There's also the periodic auth request garbage-collection.

Copy link
Contributor Author

@vito vito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a problem. Let's hold off on merging for now - will have to take another swing at this.

storage/sql/sql.go Show resolved Hide resolved
@vito
Copy link
Contributor Author

vito commented Nov 19, 2018

@srenatus Alright, I've pushed another commit that refactors all the error wrapping out of the ExecTx fn, and I left a comment warning future developers to not wrap errors in there. Without this last commit, my initial fix only applied some of the time, since the failure could either happen on the individual queries in fn or the Commit at the end, which was the only thing I added the check for.

Along the way I tried to keep the error wrapping/handling consistent across all the conn methods. This change ended up being kinda big, but if it makes you feel any better there were tests that failed along the way, so the test coverage seems decent. I didn't end up having to change any tests, since no user-facing behavior changed once I fixed the things the tests told me to.

With this all cleaned up, I've got our integration tests successfully running in a loop with ~23 concurrent logins.


if err := fn(tx); err != nil {
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code.Name() == "serialization_failure" {
// serialization error; retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial bug report you mentioned

The Postgres docs say "applications using this level must be prepared to retry transactions due to serialization failures", which Dex doesn't seem to be doing, and it doesn't feel right to force the client to retry API/login requests due to this, and I don't think they'd even have enough information to recognize that it's a serialization failure to retry in the first place.

Can you add a link to those docs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@@ -114,7 +114,7 @@ import:
- package: github.com/mattn/go-sqlite3
version: 3fb7a0e792edd47bf0cf1e919dfc14e2be412e15
- package: github.com/lib/pq
version: 50761b0867bd1d9d069276790bcd4a3bccf2324a
version: 9eb73efc1fcc404148b56765b0d3f61d9a5ef8ee
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ There's a lot going on between these two commits -- is there anything in particular that you're after, or is this a chore-type update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary for BeginTx with transaction level support: lib/pq@ed1a26d

@@ -74,6 +75,11 @@ var (
}

if err := fn(tx); err != nil {
if pqErr, ok := err.(*pq.Error); ok && pqErr.Code.Name() == "serialization_failure" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 So, the serialization_failure error potentially comes back from the fn(tx), and not only from the Commit below, as previously thought? Also, do we know if it ever comes back from the tx.Commit()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw it come from both. I believe it just comes down to timing.

@srenatus
Copy link
Contributor

@vito thanks for hanging on to this 👍 looks like in the meantime, some other PRs have gotten into this PR's way (conflicts).

@vito
Copy link
Contributor Author

vito commented Nov 20, 2018

@srenatus Yep, just working on fixing the dependencies up now and adding a comment per @ericchiang's suggestion. Will have it ready soon.

@vito vito force-pushed the pr/retry-on-pg-serialization-error branch 2 times, most recently from 1811c16 to 6d3dbc8 Compare November 20, 2018 15:03
@vito
Copy link
Contributor Author

vito commented Nov 20, 2018

OK, all rebased and good to go!

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I still don't understand the issue as good as I'd like, but this seems to be an improvement. I've yet to see these in the wild; but your test suite agrees, so 👍

💭 I wonder if your tests depend on your infrastructure/stack, or could become part of github.com/dexidp/dex, too? 😉

//
// "applications using this level must be prepared to retry transactions due to
// serialization failures"
func isRetryableSerializationFailure(err error) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srenatus
Copy link
Contributor

ℹ️ This is probably good to go, but I'd like to take a dive into PG doc land to wrap my head around this some more. However, I'll do that tomorrow morning, probably. In the mean time, there's not going to be a release out of thin air, so I'm fine either way -- shall we merge this now or later?

@vito
Copy link
Contributor Author

vito commented Nov 20, 2018

@srenatus Doesn't matter to me, feel free to take your time. We've still got a few things on our fork so we won't be blocked on a merge either way.

also use a context for the rollback, which is a bit cleaner since it
only results in one 'defer', rather than N from the loop
prior to this change, many of the functions in the ExecTx callback would
wrap the error before returning it. this made it impossible to check
for the error code.

instead, the error wrapping has been moved to be external to the
`ExecTx` callback, so that the error code can be checked and
serialization failures can be retried.
@vito vito force-pushed the pr/retry-on-pg-serialization-error branch from 6d3dbc8 to 85dd068 Compare November 20, 2018 15:51
@srenatus
Copy link
Contributor

I've tried replicating the failures. And, while I was able to trigger the condition using concurrent updates, I have no idea why these happen. This isn't entirely satisfying, but since the internet says that there doesn't have to be a serialization failure for these retryable errors to occur, and applications need to be aware of this, I think this is good to go.

The for {} here makes me a little nervous, but since we error out if some other error is returned, this should not loop forever.

@srenatus
Copy link
Contributor

srenatus commented Nov 21, 2018

And, while I was able to trigger the condition using concurrent updates, I have no idea why these happen.

Oh, forgot to mention: using this branch, those failures, while still in PGs logs (of course), didn't matter anymore, as the retry was successful. 👍

ℹ️ What I've done locally is quick and dirty, so, nothing to commit at this moment.

@vito vito deleted the pr/retry-on-pg-serialization-error branch November 27, 2018 16:50
This was referenced Nov 29, 2018
srenatus added a commit that referenced this pull request Dec 2, 2018
mmrath pushed a commit to mmrath/dex that referenced this pull request Sep 2, 2019
…ation-error

retry on serialization errors
mmrath pushed a commit to mmrath/dex that referenced this pull request Sep 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants