Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Spanner Client should prevent nested transactions #2361

Closed
vkedia opened this issue Jun 6, 2017 · 31 comments
Closed

Cloud Spanner Client should prevent nested transactions #2361

vkedia opened this issue Jun 6, 2017 · 31 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.

Comments

@vkedia
Copy link
Contributor

vkedia commented Jun 6, 2017

Cloud Spanner does not support nested transactions. But the client library does allow users to call Database.runTransaction inside of the callback that is provided to an outer Database.runTransaction. This is misleading since users might believe that this will behave like a nested transaction but in reality these will be two independent transactions. This is confusing and a source of bugs. Specifically the inner transaction might succeed while the outer might ABORT, in which case the callback will be rerun thus rerunning the inner transaction.
We should prevent this from happening by detecting and raising an error if someone calls Database.runTransaction inside the callback provided to Database.runTransaction. Note that this needs to be done before Beta since this is a known breaking change.
Also note that if we come across a use case where we do want to allow calling a nested runTransaction, we should be able to enable that in future without making a breaking change.

cc @jgeewax @bjwatson

@vkedia vkedia added api: spanner Issues related to the Spanner API. priority: p0 Highest priority. Critical issue. P0 implies highest priority. labels Jun 6, 2017
@callmehiphop
Copy link
Contributor

@stephenplusplus correct me if I'm wrong, but I'm not sure this is technically feasible? We can block transactions from being started in the event that one is already running, but I think that would imply that a user would not be able to do parallel transactions which sounds like it could be an unwanted side effect.

@bjwatson
Copy link

bjwatson commented Jun 6, 2017

Specifically there's a problem with retrying a block of transaction code that contains a nested transaction. If you have two parallel transactions in different retry blocks, or don't automatically retry either of them, then I don't think there's any problem.

@vkedia
Copy link
Contributor Author

vkedia commented Jun 6, 2017

Yes we do not want to prevent parallel transactions. I am not sure how this would be achieved in node.js since its all asynchronous. In other languages we can store something in thread local. In Go, there is some context object being passed around so we can track it in that.
Specifically what we want to disallow is someone calling database.runTransaction from the callback that is given to runTransaction.
@jgeewax Any ideas on how this would be achieved?

@stephenplusplus
Copy link
Contributor

I'm also not able to think of a way to technically achieve this at the moment. Some thoughts that crossed my mind:

  • Sandboxing the context within a transaction, such that database.runTransaction is overwritten with a function that returns a "nesting not allowed" error. This could be confusing and might not be able to handle if the user built a convenience wrapper around our database.runTransaction function.

  • Reading the execution path of an invocation of runTransaction() to see if "runTransaction" came up twice, in which case, it would be denied. I've had luck with node-stack-trace up until I mimicked the async nature of the execution path we take.

Any other thoughts are welcome.

@bjwatson
Copy link

@swcloud Can you help @stephenplusplus and @vkedia figure out how to fix this problem for Node Spanner? It's Beta blocking.

@bjwatson bjwatson added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p0 Highest priority. Critical issue. P0 implies highest priority. labels Jun 21, 2017
@swcloud
Copy link
Contributor

swcloud commented Jun 21, 2017

With a requirement to make the solution future proof, this is even harder. Adding @ofrobots to see if he has some idea.

@ofrobots
Copy link
Contributor

ofrobots commented Jun 22, 2017

What we really need here is tracking of asynchronous context, similar to what we do in Stackdriver Trace through the use of continuation-local-storage. Then, at the time Database.runTransaction was called you could ask the question whether the 'current execution context' has a pending transaction and throw an error if so.

The main problem is there does not exist a completely reliable mechanism in JavaScript to do async context propagation. Arbitrary userspace code can break async context propagation (examples, more details). We just ran into a real world case of context loss this earlier this week.

This is not a problem for diagnostic tools like Stackdriver Trace, where occasional incorrectness is not fatal. However, for a database library like Spanner, I would suspect we need 100% reliability. We are bound by the fundamental limitations of the JavaScript execution model as it stands today.

I would be very wary of the sandboxing as suggested in this comment for the reasons already mentioned. I do not think we can reliably prevent the code (or the functions called from that code) from getting its hands onto an unsandboxed runTransaction.

I do have one potential suggestion though. Is it acceptable for us to restrict the JS API visible to the callback of runTransaction?

What I am thinking is that we could impose that the contents of the runTransaction callback will be run in a new 'V8 Context' (through vm.runInNewContext). We would be able to control what values are visible inside the new context (e.g. transaction, and other values we control will be available, but things like require etc. will not be available.) Would approach like this be feasible here? I'd be happy to brainstorm on this more – I will need to make myself somewhat more familiar with the Spanner API.

This would be relevant to @matthewloring as well.

@matthewloring
Copy link
Contributor

I think the vm.runInNewContext approach could be made to work. However, I don't think we want to run the callback passed to runTransaction in a new context because this would prevent valid serial transactions of the form:

database.runTransaction((err, t1) => {
  t1.end(() => {
    database.runTransaction((err, t2) => {});
  });
});

One alternative would be for runTransaction to take two callbacks: executeTransaction and continuation. The executeTransaction callback could have the transaction handle and an exit continuation injected into its context leading to code that would like like:

database.runTransaction(() => {
  transaction.queryAndWhatnot();
  done();
}, () => {
  // continue with application logic
});

This could probably be implemented something like:

Database.prototype.runTransaction = function(executeTransaction, continuation) {
  const t = constructTransaction();
  const sandbox = { transaction: t, done: () => { t.end(continuation); } };
  vm.runInNewContext('(' + executeTransaction.toString() + ')();', sandbox);
};

Injecting variables "magically" like transaction and done above is not standard practice so this would require careful documentation. This approach will also cause a performance hit as the function must be re-parsed each time and is likely to be optimized. However, this has the benefit of separating code that may be re-executed on an aborted transaction from the program continuation. In general, it is not expected behavior in node for callbacks to be executed multiple times (even in recoverable failure scenarios). It seems like the current API makes it easy to unintentionally re-run stateful operations.

Introducing a transaction builder could also help this problem. Transactions could be constructed before the runTransaction call allowing the callback passed to runTransaction to be run exactly once preventing any issues associated with re-executing JS code. This would probably require more effort than the new context approach.

@matthewloring
Copy link
Contributor

To clarify, is the callback passed to transaction.end only invoked after the transaction succeeds or could it also be invoked multiple times if the transaction aborts?

@stephenplusplus
Copy link
Contributor

stephenplusplus commented Jun 22, 2017

I actually had vm.runInNewContext in mind when I suggested "sandboxing"-- the thought being, the user has a pre-defined collection of variables they have access to within it, but can't get out. I believe that is the only technical solution available, but I think the complexity of it, and the surprises that go along with it, are a cost that is too high.

Introducing a transaction builder could also help this problem. Transactions could be constructed before the runTransaction call allowing the callback passed to runTransaction to be run exactly once preventing any issues associated with re-executing JS code. This would probably require more effort than the new context approach.

That is what I'd lean towards. Something like:

var database = instance.database('my-db')
var transaction = database.transaction()

transaction.run(function(err) {
  // ..valid transaction commands...

  transaction.run(function(err) {})
  // throws: "Transaction already in progress. Nested transactions are not allowed."
})

This way, surprises are suppressed, and the user is not alienated from the environment they've configured for their application.

Would this work?

The user could still do:

var database = instance.database('my-db')
var transaction1 = database.transaction()

transaction1.run(function(err) {
  // ..valid transaction commands...

  var transaction2 = database.transaction()
  transaction2.run(function(err) {})
  // Would not throw, because `transaction2` doesn't know about `transaction1`
})

@ofrobots
Copy link
Contributor

@stephenplusplus how would you prevent one from writing something like this:

var database = instance.database('my-db')
var transaction = database.transaction()

transaction.run(function(err) {
  // ..valid transaction commands...

  setImmediate( () => {
    var t = database.transaction();
    t.run(...);
  });
})

@stephenplusplus
Copy link
Contributor

Can't, I mentioned that at the bottom. The goal was to make that API use feel disconnected to the point the user would intuitively feel wrong about it. But, if I didn't hit the mark, maybe @matthewloring could show the code that supports his suggestion.

@ofrobots
Copy link
Contributor

Ah. My browser wasn't showing the edited version of your comment.

@ofrobots
Copy link
Contributor

I am not sure if the possibility of accidental nested transactions is acceptable. In real world cases, I suspect it would happen somewhat like this:

transaction.run(function(err) {
  // ..valid transaction commands...
 
  otherFile.doSomething(); // could it start a transaction?
})

I don't think we should give the user an API that requires them to prove correctness of their program at a non-local level.

@stephenplusplus
Copy link
Contributor

We're in a battle with an API requirement that doesn't suit the language. This means our solutions will be a battle of the lesser evil. In a new context, the user can't use their project's own dependencies or even their own functions they've defined, etc. That's a very restricting environment we are forcing on all users, and one that will inevitably be a cause of confusion, resulting in SO/GH issues that lead us back to another discussion about finding a better alternative. I don't consider "users assuming a nested transaction works, when it actually doesn't" worth such a harsh solution that affects even the users of our library who understand proper API usage (the assumed majority, right?).

@bjwatson
Copy link

I setup a meeting for all of us to discuss this issue. The earliest time I could find available on our calendars is next Tuesday at 4:30p PST. Does that work for either of you @stephenplusplus or @callmehiphop?

@ofrobots
Copy link
Contributor

ofrobots commented Jun 23, 2017

I see a spectrum of options here, in order of increasing guarantees of correctness:

  1. Do nothing, and provide warnings documentation that users should not call runTransaction from the code reachable from the callback function. The onus is on the user to read the docs and to prove to themselves that they are not going to accidentally start a nested transaction. I don't think any of us would like this solution.
  2. As suggested here by @stephenplusplus, within the callback function, adjust the API on the transaction object to throw if run is called while the transaction is still pending. This will catch the most obvious of the user's errors – probably the cases where they were starting transactions within visible distance of each other. The user can still shoot themselves in the foot, if they somehow get their hands onto a different transaction object. The onus is on the user to prove to themselves that they are not doing this. I am concerned that we are giving a false sense of security to the user – and that it would be very easy to write code that is wrong.
  3. We use something like CLS or Zones in runTransaction. This will ensure, with a fairly high degree of confidence that we can detect – and prevent – nested calls to runTransaction. However, this will not (can not) catch all of the cases as it is possible to 'confuse' CLS and Zones through some programming patterns. We document this risk to the user. This is very low cost to implement and provides some confidence to us that most, but not all, of the mistakes by the user will be caught. I think this is a better option than 2.
  4. We admit that the strong guarantee is not possible in the language. We force the user to a restrictive pattern, whether it is a pre-compiled query syntax, or a runInNewContext style restriction. We have taken the burden of ensuring that the user cannot make a mistake, but the programming pattern may be extremely onerous/inconvenient for the user, as pointed out.

Ultimately the answer depends quite a bit upon the end users, and what they might value from the Spanner API. Some folks might be willing to tradeoff convenience for strong guarantees – others might not. I don't have a good feeling for this right now.

@ofrobots
Copy link
Contributor

ofrobots commented Jun 23, 2017

Actually thinking a bit more about 3, I think we may be able to detect the cases where we might be in a blind spot.

Specifically, when transaction.end is called, and we detect that we have lost the CLS/Zones context, we can warn the user about that our automatic detection of nested transactions is unable to function due to the programming patterns in use by their application. This might actually help us improve the accuracy of CLS & Zones techniques as well by finding/fixing code from the JS ecosystem that does user-space queuing.

I am not sure we can get to provable guarantees of safety however, but I think we should be able to build a pretty high confidence.

/cc @matthewloring: Thoughts?

@matthewloring
Copy link
Contributor

If we stored the current transaction or some flag on the CLS/Zone context there are a few possible behaviors:

  1. At the start of a runTransaction we do not observe a current transaction in the context. If context is correct, we correctly start a transaction. If context is "confused" we incorrectly start a nested transaction that may be re-run.

  2. At the start of a runTransaction we observe a current transaction in the context. If the context is correct, we correctly prevent the transaction. If the context is "confused" we incorrectly prevent a valid transaction from executing.

The CLS/Zones approach scares me because a misattributed context that happens to have a transaction defined will prevent valid transactions from executing. This seems like an unacceptable failure mode (an incomplete solution may be alright, an unsound one likely is not).

@ofrobots
Copy link
Contributor

ofrobots commented Jun 23, 2017

You listed two scenarios in how context can be 'confused'. Let's call them 'context loss' and 'context misattribution'. I'll make a few assertions:

  1. In cases where we observe a current (misattributed) transaction, and the transaction has already ended, we can immediately detect this to be a misattributed context. There will be cases where context only gets misattributed truly concurrently – we cannot detect these, true.
  2. In cases where we reach a runTransaction but have lost context – we would allow a nested transaction to start in that case. Eventually we will reach the end of the outer transaction and will have an opportunity to observe that we are calling end but have lost context. This gives us an opportunity to warn the user about context loss. There are going to be cases where context is always 'found' before we reach end – we will not be able to warn the user in these scenarios.
  3. Context confusion behaviour of a program is a dynamic property of the program. However, in most cases I would expect that a user would be able to observe context confusion in their test and staging environments. This has been my experience with context confusion so far. Yes, there are going to be dynamic paths that are only hit in production, and it is possible that context confusion only occurs on these production specific paths – and never anywhere else – and that the context confusion happens to manifest only in a fashion where our safeguards (above) cannot detect it. We won't be able to catch those cases.

The question ultimately becomes whether this is acceptable risk.

Perhaps we need to offer both this solution along with a safe sandboxing, albeit extremely onerous, solution and let users manage their risk.

@bjwatson
Copy link

Great discussion. If we can settle this issue here, then I'll cancel the meeting for next week.

I agree with the principle that we should err on the side of negative errors (i.e. not catching a nested transaction, rather than erroneously catching a non-nested transaction). It's probably okay if positive errors (the latter case) occur due to some really unusual design pattern, but if we can detect and warn about this pattern, then I think we'll be okay.

What do you think @vkedia and @swcloud?

@vkedia
Copy link
Contributor Author

vkedia commented Jun 23, 2017

+1 to what @bjwatson said. We should never erroneously prevent non nested transaction. And letting some nested transaction slip through the crack is ok. Intent here is not to strictly prohibit nested transactions but to prevent it in the common cases.

@stephenplusplus
Copy link
Contributor

@ofrobots @matthewloring given these points above, which solution sounds like the best path?

@bjwatson
Copy link

I rescheduled this meeting for tomorrow morning so that @matthewloring can attend.

@danoscarmike
Copy link
Contributor

@stephenplusplus @bjwatson ping. What's the status of this issue?

@bjwatson
Copy link

bjwatson commented Jul 6, 2017

We decided that the best we can do in Node is warn the user about nested transactions when we believe one has occurred. We cannot block it, because it's too hard / impossible to avoid false positives (i.e. a transaction that appears to be nested, but really isn't).

Since adding warnings is not really a breaking change, we're not blocking Beta on this issue.

@danoscarmike
Copy link
Contributor

Ack. TY.

@stephenplusplus
Copy link
Contributor

stephenplusplus commented Aug 7, 2017

I just want to get a clear "TODO" for this issue. I think it was settled that a warning would be our only action here, but in a Node.js application, are we talking about printing to console.log or adding something to our documentation? With console.log, it's generally not a good practice to interrupt a terminal's output from a low-level library like ours. Let me know if I misunderstood!

@bjwatson
Copy link

bjwatson commented Aug 7, 2017

@stephenplusplus At a minimum, we need to ensure our documentation is clear about this. We were also talking about logging a warning if we detect a likely violation.

Thanks for pointing out the language/cultural issue with console.log. Do you think it would be better to use something like the debug module?

@stephenplusplus
Copy link
Contributor

The debug module sounds like a good idea 👍

@stephenplusplus
Copy link
Contributor

PR opened at #2517 -- please take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.
Projects
None yet
Development

No branches or pull requests

10 participants