[WIP]: feat(gatsby): use lokijs for in-memory database #9338

Moocar · 2018-10-24T02:01:12Z

Draft. Don't merge!! Feedback requested. Questions down the bottom

Summary

One of the most expensive bootstrap phases in Gatsby for large sites is the GraphQL querying phase. There are two aspects to this:

Custom field resolvers (e.g sharp). Will hopefully be improved by running queries in parallel (Research running queries in parallel across cores #8400)
Queries are brute force and don't use indexing

This PR addresses 2, querying.

Gatsby stores all its nodes in a redux namespace (nodes). To query, it uses sift. Sift doesn't have any concepts of indexing, so it performs a brute force search by iterating over all nodes until it finds a match (or matches). The same applies for sorting. This is very slow on large sites.

This PR speeds up query performance for 10k page sites by about 2x and 20k sites by about 5x. See recording.md for results.

Loki

The main change is to replace the redux nodes namespace with loki.js. And then at query time, use loki to query instead of sift. In redux, nodes are stored in a giant array, whereas with loki, I have divided them up into collections per type of node (internal.type). This is because almost everywhere in Gatsby, we want to query for the nodes of a type, and almost never by all nodes in the system. This means that plugins that currently filter through all nodes looking for those of a certain type can call getNodesByType() which as O(1) operation.

Differences

Like sift, loki's query syntax matches mongo. There is one major difference though. Sift can query nested fields using datastructures, e.g { foo: {bar: {eq: "1"} } }, whereas loki uses dotted string paths, e.g { "foo.bar": {eq: "1"} }.

Nested indexes

Loki didn't support indexes on nested (dotted) fields. I've addressed this in techfort/LokiJS#718. Hopefully it will be merged soon. If not, we can create a fork for now and publish

Persistence

Loki is an in-memory DB but supports file serialization. It supports the same autosave functionality that we've built into redux/index.js:saveState() but is supposedly even faster. If we move pages, dataDependencies etc into loki we should be able to remove most of this functionality.

Another added benefit is that saves only occur if data is dirty.

Query Benchmark

Added a benchmark project to test query performance.

Questions

I've called the loki create/update/delete functions from inside the nodes reducer. Any thoughts on that? I can't think of anything "wrong" with it, but it feels icky. I'm tempted to instead bypass redux entirely and call the db/index.js directly from actions.js and then emit a CREATE_NOTE event.
Assuming there is agreement on this PR, should we move pages, componentDataDependencies and other redux namespaces into loki instead? Benefits would include dirty flagged persistence, and potential querying speed up
I've included a flag useQueryIndex which can be set on the query context. This tells loki to create indexes for all fields in the query. This allows users to create indexes where applicable, which for most sites, will only be on node groups > 1000 pages or so. But the other option is that we do something more intelligent to auto set indexes. E.g flag groups of nodes that are > 1000, and if a query occurs against them, then index the query's fields.
All nodes inserted into loki have a $loki field. And nodes are mutable, so that field will leak into the rest of Gatsby. It hasn't been an issue so far, but I'm curious if folks think it will be a problem. We could remove the field after query results are returned, but we'd have to clone the whole object which would be a performance penalty.

Note, the tests are failing. I'm working on fixing them up now but wanted to get the PR up now.

DSchau

This looks good! A lot going on here, so I'd need to do a deeper dive before we'd get this merged--and obviously because it's a WIP!

Left some comments, mostly on lodash.

DSchau · 2018-10-24T20:22:38Z

benchmarks/query/recording.md

+- loki without indexes is overall slightly faster than master, except when there are many types
+- loki with indexes is about 2x faster on sites with 10k pages, and 5x faster with 20k pages. But is ever so slightly slower when those pages are split across 100 types.
+
+Overall, loki is a big win for sites with lots of pages of the same type. For smaller sites, the difference is negligible.


A caching PR I recently worked on had similar findings, which I think is totally reasonable. If it's mostly the same for smaller sites but it makes Gatsby much more scalable I think that's a big win.

I honestly wonder if using lokijs consistently (e.g. for cache and internal Gatsby utilities) would be worthwhile. I'd lean towards yes!

The fs fallback is interesting, since I used a similar approach for the cache! Basically I kept 100-200 items in memory, and then it fell back to the filesystem as its cache, which seems to work pretty well for large sites.

@Moocar did you measure how much memory is used in these different benchmarks?

@DSchau same question?

@KyleAMathews not explicitly (although I can), but I used the upwards bounds of when it ran out of memory as the basis for which I considered the number of items. It is dependent on machine setup, size of items being cached, etc. too.

@DSchau Thanks for all the feedback! For caching, using loki would be a step up from using pure arrays, but an even bigger win would probably be to use a real caching library that can also provide persistence. Once that has LRU, paging to disk and other cache-specific semantics, assuming one exists! (I haven't looked)

@KyleAMathews Memory usage should be less since we're actually able to remove some caches. But I'll double check. What's your favorite memory profiling tool?

That's what I assume too but it's nice to know for sure.

There's probably something on npm but you can also add a setInterval in the code and log out the memory usage every .1 seconds or whatever. You could also log out different events too e.g. when query running starts etc. so you could visualize things easily.

DSchau · 2018-10-24T20:24:10Z

packages/gatsby-source-contentful/src/gatsby-node.js

-  const nodes = getNodes()
-
-  if (!nodes.some(n => n.internal.type === `ContentfulAsset`)) {
+  const nodes = getNodesByType(`ContentfulAsset`)


I'll look through the implementation of this as I go through, but would this be worth returning an empty array or something so we can resolve the empty case more easily?

Yeah it would save us all of those checks below

Yep! Good call. I'll return the empty collection

DSchau · 2018-10-24T20:25:14Z

packages/gatsby/src/bootstrap/index.js

+    })
+  } catch (e) {
+    report.error(
+      `Error starting DB. Perhaps try deleting ${path.dirname(dbSaveFile)}`


🤔 probably worth just doing it automatically, or at least prompting?

My thoughts too! Glad you agree. I'll implement it

DSchau · 2018-10-24T20:25:33Z

packages/gatsby/src/bootstrap/load-plugins/load.js

@@ -6,6 +6,7 @@ const crypto = require(`crypto`)
 const glob = require(`glob`)
 const { store } = require(`../../redux`)
 const existsSync = require(`fs-exists-cached`).sync
+const createNodeId = require(`../../utils/create-node-id`)


@pieh we need a gatsby-utils stat :) Something one of us can work on soon!

DSchau · 2018-10-24T20:27:37Z

packages/gatsby/src/db/index.js

+ */
+function deleteAllCollections() {
+  if (db) {
+    _.forEach(db.listCollections(), collInfo => {


Might be worth benchmarking _.forEach vs. native forEach (in general, not just here!). I tend to prefer to defer to native array methods, but I may be unique in that regard.

I think that it's a safer bet to presume they'll be optimized very efficiently, if they're not already!

Cool, I'm actually a relative newcomer to js so any tips like this are greatly appreciated. I may have gone a bit lodash mad in this PR as I come from a functional background. Is this the native for each you mean? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of

Silly question, is db.listCollections() an object or an array? I presumed array, but on second glance, it could actually be an object?

If it's the latter, then this _.forEach may actually be preferable!

I'm about to board a plane, but I'll add some more feedback on this tomorrow. Nice work!!

Sorry, forgot to reply. It's an array. I've been using for (const foo of someArray) and I'm definitely preferring it

DSchau · 2018-10-24T20:28:29Z

packages/gatsby/src/db/index.js

+ * by node.id
+ */
+function updateNode(node, oldNode) {
+  invariant(node.internal, `node has no "internal" field`)


I like the use of invariant here (and elsewhere!)

Yeah, my approach with invariants like this is to be quite liberal in their application, and then perhaps remove them in the future if this code becomes very stable.

DSchau · 2018-10-24T20:29:08Z

packages/gatsby/src/db/index.js

+  if (!oldNode) {
+    oldNode = getNode(node.id)
+  }
+  const updateNode = _.merge(oldNode, node)


Is this doing anything an Object.assign wouldn't do? (again, sorry, I just don't always love lodash!)

should we actually merge oldNode with new one? This might cause weird effects, like deleting some fields from node will actually not be deleted in db?

@pieh Great catch again. I just looked at the redux implementation and all it does is overwrite the existing node in redux. Not sure why I did a merge here. I'll change to overwrite.

DSchau · 2018-10-24T20:30:04Z

packages/gatsby/src/db/index.js

+ */
+function getNodesByType(typeName) {
+  const coll = db.getCollection(typeName)
+  if (!coll) return null


Yeah, maybe just return empty array here? Makes it easier for a consumer of this "API" to handle the empty and null cases the same way.

addressed in other comment

pieh

This looks very promising! It would be great if we could keep current implementation while this is being worked on and inject new one when using some kind of feature flag

pieh · 2018-10-24T20:36:00Z

packages/gatsby/src/schema/run-query.js

+}
+
+function chooseQueryEngine(queryArgs) {
+  if (hasPluginFields(queryArgs)) {


This will definitely need to be more robust - our filter schema support is not really that great but even now we can filter by linked node with ___NODE convention (which need to use custom resolver to access underlying node)

Great catch. I didn't notice the linked field example. Can you think of any other types of fields that might need to be resolved too? How about mapping fields?

When @KyleAMathews and I were discussing this PR, we decided that the vast majority of sites would not be using pluginFields. I assume they also aren't using linked fields. But I could be wrong in that assumption

We currently don't support filtering on mapped fields (that's what I meant that support isn't great), but it's something that we would definitely would want to be able to do (here's one of the issues where people mention they can't really filter by mapped fields - #7251, I have WIP branch for this - pieh@ccd110c).

For now only ___NODE fields would be problem. Worth looking into date or local File fields which do have custom resolvers for output, but we don't create correct filter schema for those right now.

When we create schema type fields with custom resolvers we should track those fields (similar to how you track fields from plugins)

I hate the "resolve all the things" approach of run-sift, but I get why it exists. It's a shame, because the second we have to resolve all nodes, we lose any ability to perform index based lookups.

For plugin filter fields, I think this will always be the case since their resolvers can do anything. But I wonder if there's an elegant solution for linked and mapped nodes. If we were using a graph database, then we would already have the link setup.

One way or another, we'll need to find a proper solution eventually, because I imagine many sites will want to take advantage of linked and mapped filtering.

pieh · 2018-10-24T20:38:24Z

packages/gatsby-source-contentful/src/gatsby-node.js

-  const nodes = getNodes()
-
-  if (!nodes.some(n => n.internal.type === `ContentfulAsset`)) {
+  const nodes = getNodesByType(`ContentfulAsset`)


Yeah it would save us all of those checks below

pieh · 2018-10-24T20:41:46Z

packages/gatsby/src/db/index.js

+  if (!oldNode) {
+    oldNode = getNode(node.id)
+  }
+  const updateNode = _.merge(oldNode, node)


should we actually merge oldNode with new one? This might cause weird effects, like deleting some fields from node will actually not be deleted in db?

Moocar · 2018-10-24T22:40:30Z

@pieh I love the idea of feature flagging this. I'm kicking myself for not writing all this code with that in mind. I'll have a think about how hard it would be to do that

pieh · 2018-10-24T23:00:24Z

packages/gatsby/src/schema/build-node-types.js

+    // if we can query by pure data or whether we have to resolve the
+    // fields first. See `./run-query.js`
+    _.forEach(inferredFieldNames, fieldName => {
+      pluginFieldTracking.add(fieldName)


I think we should namespace fieldNames - if one node type uses resolver for field foo, we don't want to revert to sift if we query other type and use field with same foo name

pieh · 2018-10-24T23:07:15Z

packages/gatsby/src/schema/query-loki.js

+  // Allow page creators to specify that they want indexes
+  // automatically created for their pages.
+  if (context.useQueryIndex) {
+    ensureIndexes(coll, lokiArgs)


How expensive this is when indexes were already applied? We definitely will run same query (with different context) when we use same template for multiple pages

This has also potential to keep unneeded indexes - for example when we modify filters (remove some fields from filter), but not sure if it's worth keeping track of that

@pieh loki handles the case of the index already existing so there's no performance penalty.

But yes, this will keep old indexes around if filters are modified. My solution here for applying queries is very basic, but I think there's a lot more we could do here over time (see original PR description). My suggestion would be to keep as is, but iterate on this based on feedback

In preparation for #9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality. I also added `getNodesByType` since it makes sense even without loki.

**Builds on #9416 . Merge that one before this** In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of #9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines. This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag). Another benefit is that the query engine no longer needs to understand page dependencies.

In preparation for gatsbyjs#9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality. I also added `getNodesByType` since it makes sense even without loki.

**Builds on gatsbyjs#9416 . Merge that one before this** In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of gatsbyjs#9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines. This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag). Another benefit is that the query engine no longer needs to understand page dependencies.

This PR adds a feature flag `GATSBY_DB_NODES` that can be used to change the storage engine for the gatsby data layer (`nodes`). - `redux` (default) which uses the existing redux implementation, and `sift` for querying. Or, you can use - `loki` which uses the loki in-memory database to store and query. This PR re-implements functionality in #9338, but with all tests passing and addressing previous feedback. It should also be easier to review since it builds on several refactorings. Some things to note: 1. I [submitted a PR to lokijs](techfort/LokiJS#718) which still hasn't been merged, though the author says he'll start working on it soon. Therefore, in in interim, I've published [@moocar/lokijs](https://www.npmjs.com/package/@moocar/lokijs). 1. I haven't implemented auto indexing of query fields yet. I'll attack that next. 1. I suggest we ask the community to try out the feature flag once it's merged to get feedback. All the tests pass, but this is such a big change that we'll want to test it gradually 1. While loki uses the same mongo query language as sift, they do have different behavior. Most of my time on this PR was spent ensuring that loki behaves **exactly** like sift. See [db/loki/nodes-query.js](https://github.com/gatsbyjs/gatsby/blob/cddbe893a4ce638babb1cbe5e5da4c13b6f5e57d/packages/gatsby/src/db/loki/nodes-query.js). But there's a chance a few edge cases have slipped through the cracks. 1. the feature flag works with the tests too `GATSBY_DB_NODES=loki yarn test`. We should perhaps look into running this on all PRs

pieh · 2018-11-30T16:00:54Z

I'll close this PR - as I think everything from here was already merged in more incremental chunks in past couple weeks

In preparation for gatsbyjs#9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality. I also added `getNodesByType` since it makes sense even without loki.

**Builds on gatsbyjs#9416 . Merge that one before this** In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of gatsbyjs#9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines. This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag). Another benefit is that the query engine no longer needs to understand page dependencies.

This PR adds a feature flag `GATSBY_DB_NODES` that can be used to change the storage engine for the gatsby data layer (`nodes`). - `redux` (default) which uses the existing redux implementation, and `sift` for querying. Or, you can use - `loki` which uses the loki in-memory database to store and query. This PR re-implements functionality in gatsbyjs#9338, but with all tests passing and addressing previous feedback. It should also be easier to review since it builds on several refactorings. Some things to note: 1. I [submitted a PR to lokijs](techfort/LokiJS#718) which still hasn't been merged, though the author says he'll start working on it soon. Therefore, in in interim, I've published [@moocar/lokijs](https://www.npmjs.com/package/@moocar/lokijs). 1. I haven't implemented auto indexing of query fields yet. I'll attack that next. 1. I suggest we ask the community to try out the feature flag once it's merged to get feedback. All the tests pass, but this is such a big change that we'll want to test it gradually 1. While loki uses the same mongo query language as sift, they do have different behavior. Most of my time on this PR was spent ensuring that loki behaves **exactly** like sift. See [db/loki/nodes-query.js](https://github.com/gatsbyjs/gatsby/blob/cddbe893a4ce638babb1cbe5e5da4c13b6f5e57d/packages/gatsby/src/db/loki/nodes-query.js). But there's a chance a few edge cases have slipped through the cracks. 1. the feature flag works with the tests too `GATSBY_DB_NODES=loki yarn test`. We should perhaps look into running this on all PRs

Moocar added 30 commits October 4, 2018 15:16

Added query benchmark

64a7149

Merge branch 'query-benchmark' into integrate-loki

ae5de27

draft of new run-query

4da734d

added query-sift.js

0fe2778

move page dependency tracking into build-node-types

361bb82

clean up

5364fc5

add promise to query-loki

9e99174

Replaced redux nodes with loki

409c029

create indexes for each field

8438bad

Dynamically create pages across generated types

8303810

update recordings

0172556

Merge branch 'query-benchmark' into integrate-loki

997b826

fix build-node-connections

93f9d91

make plugin ids unique

a21efee

added tests to loki

93c60f7

getNodesByType

d3dc9be

format graphql query

d2d9353

recordings

dd7e631

Merge branch 'master' into integrate-loki

e9bcfb4

Added persistence

968ef2b

linter fixes

75dae87

query benchmark PR udpates

79a5685

docs

62aee97

docs

ac97c30

plugin-fields.js and move fns to db/index

e936d0e

docs

d1be8e8

docs

9aae188

handle firstOnly

feb0e1b

docs

1eab83f

fixed

443b169

Moocar added 2 commits October 24, 2018 12:37

remove loki fs adapter

059b485

benchmark machine

760817e

Moocar requested review from a team as code owners October 24, 2018 02:01

remove unused fn

187adb9

DSchau added the status: WIP label Oct 24, 2018

DSchau changed the title ~~use lokijs for in-memory DB(Draft, don't merge)~~ [WIP]: use lokijs for in-memory DB(Draft, don't merge) Oct 24, 2018

DSchau changed the title ~~[WIP]: use lokijs for in-memory DB(Draft, don't merge)~~ [WIP]: feat(gatsby): use lokijs for in-memory database Oct 24, 2018

DSchau reviewed Oct 24, 2018

View reviewed changes

pieh reviewed Oct 24, 2018

View reviewed changes

Moocar added 2 commits October 25, 2018 10:38

only track loki nodes once db is loaded

be39c66

Merge branch 'master' into integrate-loki

0ec21f7

This was referenced Oct 25, 2018

query-benchmark #9379

Merged

feat(gatsby): Add nodes db module #9416

Merged

feat(gatsby): Move connection out of sift #9508

Merged

kakadiadarpan mentioned this pull request Oct 30, 2018

Gatsby Limit? #9542

Closed

Moocar mentioned this pull request Nov 14, 2018

feat(Gatsby): add lokijs nodes db implementation #9919

Merged

pieh closed this Nov 30, 2018

[WIP]: feat(gatsby): use lokijs for in-memory database #9338

[WIP]: feat(gatsby): use lokijs for in-memory database #9338

Conversation

Moocar commented Oct 24, 2018 • edited Loading

Summary

Loki

Differences

Nested indexes

Persistence

Query Benchmark

Questions

DSchau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DSchau Oct 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Moocar commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pieh commented Nov 30, 2018

Moocar commented Oct 24, 2018 •

edited

Loading

DSchau Oct 24, 2018 •

edited

Loading