Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency control in core-data #26325

Closed
adamziel opened this issue Oct 20, 2020 · 7 comments
Closed

Concurrency control in core-data #26325

adamziel opened this issue Oct 20, 2020 · 7 comments
Assignees
Labels
[Package] Core data /packages/core-data [Package] Data /packages/data [Package] Edit Site /packages/edit-site [Package] Edit Widgets /packages/edit-widgets [Type] Bug An existing feature does not function as intended

Comments

@adamziel
Copy link
Contributor

adamziel commented Oct 20, 2020

The problem

While investigating #22127, I discovered what appears to be a massive problem with API interactions in core-data.

There seems to be no concept of concurrency control.

A very simple example is that if I call saveEntityRecord twice on the same record, and it will spark two concurrent POST requests. One of them wins, but the client doesn't know which one. Similarly, I can trigger saveEntityRecord and deleteEntityRecord at the same time and the result will vary depending on the exact timing. That's not very common at the moment, but obvious in #22127.

Let's talk about a common problem. Consider this minimal component that renders some data retrieved using getEntityRecords and saves changes using saveEntityRecord:

https://github.com/samueljseay/gutenberg/blob/c852bee7ce33498c3ee7faca743fac9e473bb03c/test-plugins/core-data/js/index.js#L1-L57

Fetching data

How does the resolution flow look like? To answer that question, I will use a chart since reasoning about timing issues without one is just too hard. Time flows downwards:

Zrzut ekranu 2020-10-21 o 11 07 15

Easy peasy, first withSelect only gets an empty list because nothing is stored yet, then resolver kicks in asynchronously, talks to the API, updates the store, and the store re-runs withSelect handler once the data is available.

Saving data

Each time the user clicks a checkbox, the component dispatches saveEntityRecords() and triggers the following chain of events:

Zrzut ekranu 2020-10-21 o 11 14 29

There's already a problem here, saveEntityRecord() calls getRawEntityRecord(), which doesn't consider the record with a specific id to be resolved even though a list of records from /wp/v2/books was loaded earlier on. Instead, a resolver is triggered, and an explicit GET request is issued to /wp/v2/books/1 - that happens around the same time as the PUT request to persist changes. Depending on the timing of both, GET results could override the last RECEIVE_ITEMS triggered by saveEntityRecord(). In that case, the user would see some flickering on the screen, and the store would end up with stale data.

Fetching and saving data combined

What is really interesting, though, is what happens when we combine fetching AND saving. Let's take a look:

Zrzut ekranu 2020-10-21 o 11 19 11

Woah, that's a lot of arrows and boxes! What happens on that chart is:

  1. We fetch the data first - just like on the very first chart in this issue.
  2. The user triggers a save after the data is loaded - same as the second chart.
  3. Transparent boxes are the interactions with withSelect as well.

This is pretty fragile - there are multiple requests started around the same time, and they may be resolved in any order. If GET /wp/v2/books is resolved last, the store is stuck with stale data until the page is refreshed.

Batch processing is affected tenfold as more entity records in the mix means more timing issues.

Possible solutions

Atomic operations

I really don't like how saveEntityRecord() triggers a bunch of side effects while everything is still up in the air. And even if saveEntityRecord didn't, the user could. The point is that writes and reads mix together in unpredictable ways.

A quick, immediate solution

We could prevent concurrent conflicting operations:

  • By not resolving while write operation is in progress
  • Using locking (shared lock and exclusive lock maybe?). Selectors would "just work", but API reads (resolvers) and API writes would never run clashing operations at the same time.

Also, to reduce the number of factors in the mix we could apply optimistic updates a bit differently. Namely, instead of using receiveEntityRecords, we could leverage editedEntityRecord by just adding one more edit signifying a "checkpoint". It would work like this: 1. Edits before the checkpoint are frozen as long as the checkpoint is in place, 2. Successful save discards all the edits before the checkpoint and replaces the entity record with the server response 3. Failed save simply removes the checkpoint, no rollback is needed.

One other solution is to assume core/data is a low-level API that just does what it's told and shift the burden of concurrency control onto the consumer. As in: assume it's the developer's responsibility to avoid conflicting select() and dispatch() calls. Even in this scenario, core/data gets in its own way by triggering resolvers when it shouldn't. Also, the pitfalls are very generic so the logic implemented by each and every consumer would be almost the same. An alternative would be to build a higher-level API on top of core/data that would understand locking and concurrency.

Long term considerations

The above would solve the problem for now and may even suffice for a bit longer. It has some downsides though:

  1. Stacking multiple conflicting operations may take a long time to process
  2. It doesn't address conflict resolution, e.g. server receiving conflicting updates from another user. This would be really nice to address but it doesn't seem to be a blocker for now. These explorations could part of multi-user concurrent editing project too - it's a closely related problem and could potentially require rethinking how core data works. Here's a bunch of related acronyms that cold come handy later on: MVVC, OT, CRDT.

While 1 could potentially be addressed by 2, there is another simple solution: squash enqueued operations when possible. E.g. if there are two updates waiting to be processed, we could perform just one. Thee updates and a delete? Perform only the delete.

Re-use fetched data

  • Selecting specific entity record could re-use one fetched earlier from the list API endpoint. As in getEntityRecords( 'postType', 'book' ); followed by getEntityRecords( 'postType', 'book', 4 ); could ideally trigger just a single request - the first one to /wp/v2/books. Ideally that would be the case even if the list request is still in progress.
  • Alternatively, resolving a list of entity records could also resolve specific records, although I'm not sure if it would fix it entirely.

cc @youknowriad @mcsf @gziolo @draganescu @noisysocks @talldan @tellthemachines @kevin940726 @jorgefilipecosta @mtias @samueljseay @ellatrix @TimothyBJacobs

@adamziel adamziel added [Package] Data /packages/data [Package] Core data /packages/core-data labels Oct 20, 2020
@adamziel
Copy link
Contributor Author

adamziel commented Oct 20, 2020

Also, I'm acting as a reporter here, but I'm also happy to spin up a PR exploring a solution (or many PRs).

@adamziel adamziel changed the title Race conditions in resolution logic Race conditions in core/data Oct 20, 2020
@adamziel adamziel changed the title Race conditions in core/data Race conditions in core-data Oct 20, 2020
@adamziel
Copy link
Contributor Author

adamziel commented Oct 21, 2020

I considered different strategies of removing the interference between resolvers and did a pairwise analysis of how different types of API operations interact when concurrent:

Concurrent API operations on two entity records

Same type, different IDs

Zrzut ekranu 2020-10-21 o 12 26 11

Same type, same ID

Zrzut ekranu 2020-10-21 o 12 58 17

Record operations vs list operations

Zrzut ekranu 2020-10-21 o 12 54 44

Partial results as in we need to re-request the data after the writes are finished:

Zrzut ekranu 2020-10-21 o 12 27 14

Some record vs list scenarios are "optimistically negotiable" with various degrees of complexity. For example:

  • Updating a record and fetching a list of results concurrently could work if we optimistically mask any server results with local changes until the update is finished. Keeping track of version numbers or timestamps would make it more reliable.
  • Similarly, deleting a record and fetching a list concurrently could if we keep track of deletions in progress and filter out corresponding records in selectors.
  • Deleting a record and fetching a page (number=2, size=10) of results concurrently could work if we're okay with having only 9 results for a moment and then re-fetching the page once the delete is complete. There is some added complexity if a request for page 3 was completed after the delete was finished but before re-requesting page 2 was completed. Using a cursor for pagination would make it easier.
  • Inserting a record while fetching a list of results - we could guess where the new record fits in the list of results returned by the server. I'm not sure how reliable could this one be though.

That being said, I don't think the initial fix needs to include any of that.

@adamziel adamziel changed the title Race conditions in core-data Race conditions and data consistency problems in core-data Oct 21, 2020
@adamziel adamziel changed the title Race conditions and data consistency problems in core-data Race conditions and lack of data consistency in core-data Oct 21, 2020
@adamziel adamziel changed the title Race conditions and lack of data consistency in core-data Concurrency control in core-data Oct 21, 2020
@adamziel adamziel added [Package] Edit Widgets /packages/edit-widgets [Package] Edit Site /packages/edit-site labels Oct 21, 2020
@adamziel adamziel self-assigned this Oct 21, 2020
@noisysocks
Copy link
Member

Tough problem! Thanks for the clear write-up, though.

Are there any other libraries that we can look to for inspiration?

It sounds like we need to make it so that, instead of calling apiFetch() directly, actions like getEntityRecord() and saveEntityRecord() add an apiFetch() request to a queue. Each entity has its own queue. Requests in the queue are performed one at a time. Some subsequences of requests in the queue can be done in parallel, e.g. multiple GET requests. Some subsequences of requests in the queue can be collapsed into one request, e.g. multiple PUT /books/1 requests.

Stacking multiple conflicting operations may take a long time to process

How common is this? Maybe it's not such a bad trade-off.

@adamziel
Copy link
Contributor Author

adamziel commented Oct 22, 2020

@noisysocks my thoughts exactly! Only I wouldn't enqueue just network requests but everything asynchronous or affected by asynchronicity. If only network requests themselves are stacked there are less possible outcomes, but there are still many. For example:

Zrzut ekranu 2020-10-22 o 14 21 41

Note how the failed operation overwritten the result of the successful one. I can come up with more example like that. The point is that to get rid of interferences I would considering entire segments of code as "critical sections" or "atomic operations":

Zrzut ekranu 2020-10-22 o 14 35 37

This makes things conceptually simple by taking all the asynchronicity out of the equation - interfering operations are always executed serially and the outcomes are easily predictable as nothing else updates parts of the state they depend on.

@adamziel
Copy link
Contributor Author

I'm exploring the idea of atomic operations and locks in #26389

@adamziel
Copy link
Contributor Author

adamziel commented Nov 3, 2020

Surfacing this comment here:

#26627 and #26575 improved the situation here. There is much less flickering, although there is still some of it as seen below:

core data issue

I think locks would be the ultimate solution here. The downside is that they would add a layer of complexity to an already complex core-data so I'm still trying to come up with some alternatives. But for reference, see the same interaction with #26389 applied:

2020-11-03 12-41-39 2020-11-03 12_41_59

@gziolo gziolo added the [Type] Bug An existing feature does not function as intended label Nov 8, 2020
@adamziel
Copy link
Contributor Author

#26389 addressed the bulk of this problem. It would still be amazing to implement a lock-less solution or be more "optimistic" about different operations, but the bug part of this issue is now fixed 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Package] Core data /packages/core-data [Package] Data /packages/data [Package] Edit Site /packages/edit-site [Package] Edit Widgets /packages/edit-widgets [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

No branches or pull requests

4 participants
@adamziel @noisysocks @gziolo and others