-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quickfix #1355: only try three times #1356
Conversation
Signed-off-by: Stephan Renatus <srenatus@chef.io>
Fair enough! I can test this out Monday (on vacation atm) but I'm a little worried this limit will be too low. It was super easy to trigger the failure before the retry logic was added. I can also try to work on a test to verify concurrent logins. |
@vito Sure, enjoy your time off. We're just not able to cut a release right now, but that's OK. I'm having a hard time coming up with a way to differentiate legitimate serialization failures from retryable ones that'll eventually succeed... might try to find out what other folks do there. |
@srenatus Ok, so here's my theory. I don't think there's really such a thing as legitimate vs. retryable serialization failures - I think that was a symptom of the tests, because they literally force an unsatisfiable concurrency situation which I would expect is unrealistic compared to actual API usage. I'm gonna go out on a limb here and suggest that all serialization failures should be retriable, otherwise it's a potential bug in the code, and part of the point of setting serializable transaction level would be to notice and fix those before they end up in the wild. The tests passed for me because, as you found, my implementation leaked connections, eventually exhausting the server connection pool and resulting in an error. All the tests checked for was that an error was hit, so, well, they passed. But I think the real fix here is to fix the connection leak by calling So far I've confirmed the connection leak theory by adding a |
@vito thanks for looking into this. I'm wondering about this, too,
-- when would this happen? I think it could happen if you "choose" two different connectors at the same time. When you see the connector list, like
the first link goes to
Now, if you'd do the two requests concurrently, this would lead to a serialization failure that isn't entirely wrong, and can't succeed on retry (unless I'm missing something). 👉 Pushed a commit here illustrating the situation a little, srenatus@ad2e530. (Also, it would be a condition that can be triggered by an unauthenticated user 🤔) What do you think? I guess there's a few options here -- for example, there has been evidence that restructuring the authn request code could be a good idea (#646, #1292). |
@srenatus I would expect that one transaction would succeed at a time until they all eventually succeed via their retry logic, in some (non-deterministic) order. The docs seem to indicate this:
(from https://www.postgresql.org/docs/current/transaction-iso.html#XACT-SERIALIZABLE) If somehow all transactions end up failing resulting in a retry-loop "live-lock", perhaps there's something wrong with how Dex is using Postgres transactions for this? (I don't know what this would look like - maybe holding them open too long and having multiple points at which a transaction may fail based on different data being referenced/updated?) I haven't run your test change there yet, but I'm guessing you're seeing it fail (or hang)? |
Yeah, depending on whether I'll What's our desired behaviour, though? I'd think if the two-connectors-selected situation is happening, I would want the first one to success (whichever is first) and the second one to fail. In some way we need to ensure there's no cross-over, or mixup, of connectors. You cannot start an LDAP flow, do something to have the connector ID switch to SAML, and finish the LDAP flow. Or you can do that, but the result must not look like you had logged in using SAML 😉 @ericchiang Could you chime in here? Do you know what edge cases this was supposed to catch? |
Fair; I really don't know enough about the full Dex API and auth flow to know what kinds of guarantees are critical for consistency reasons, and if there are good reasons to fail user-initiated requests due to concurrency. So I'd appreciate @ericchiang's input here too. 🙂 I typically expect last-write-wins; to me, two simultaneous requests with different intents have no absolutely "correct" resolution because it's subject to network jitter, but I could totally see why someone would want to start one login flow and then start a different one soon after, expecting the last one to work - they may have mistakenly initiated the first request via a simple misclick and just gone back to start over. This case is more trivially serializable, of course, but I'd guess supporting that would mean going with last-write-wins if it really comes down to the wire. |
Just to clarify about the tests. The tests just try to make sure that dex does the right thing in the case of concurrency. When they were written, no storage implementations retried, so the if there was a racy call, at least one of them should fail. Since you've added some retries for optimistic concurrency, the test now passes. Which is the right behavior now, but the test is still locked into the old assumption. We should probably change the test. I'm actually more concerned about the tx leak and HEAD having a failing test. To get HEAD back to a good state, can we:
I'm fine with retrying on any error. Maybe we should bake that into the storage layer so all implementations do that? |
…lization-error Revert "retry on serialization errors" This will come back, as outline here: #1356 (comment)
|
The code of this PR is not useful anymore -- I'm keeping it open regardless, to track the #1356 (comment) progress above. |
…-serialization-error Revert "retry on serialization errors" This will come back, as outline here: dexidp#1356 (comment)
cc @vito 😉