Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]: feat(gatsby): use lokijs for in-memory database #9338

Closed
wants to merge 35 commits into from

Conversation

Moocar
Copy link
Contributor

@Moocar Moocar commented Oct 24, 2018

Draft. Don't merge!! Feedback requested. Questions down the bottom

Summary

One of the most expensive bootstrap phases in Gatsby for large sites is the GraphQL querying phase. There are two aspects to this:

  1. Custom field resolvers (e.g sharp). Will hopefully be improved by running queries in parallel (Research running queries in parallel across cores #8400)
  2. Queries are brute force and don't use indexing

This PR addresses 2, querying.

Gatsby stores all its nodes in a redux namespace (nodes). To query, it uses sift. Sift doesn't have any concepts of indexing, so it performs a brute force search by iterating over all nodes until it finds a match (or matches). The same applies for sorting. This is very slow on large sites.

This PR speeds up query performance for 10k page sites by about 2x and 20k sites by about 5x. See recording.md for results.

Loki

The main change is to replace the redux nodes namespace with loki.js. And then at query time, use loki to query instead of sift. In redux, nodes are stored in a giant array, whereas with loki, I have divided them up into collections per type of node (internal.type). This is because almost everywhere in Gatsby, we want to query for the nodes of a type, and almost never by all nodes in the system. This means that plugins that currently filter through all nodes looking for those of a certain type can call getNodesByType() which as O(1) operation.

Differences

Like sift, loki's query syntax matches mongo. There is one major difference though. Sift can query nested fields using datastructures, e.g { foo: {bar: {eq: "1"} } }, whereas loki uses dotted string paths, e.g { "foo.bar": {eq: "1"} }.

Nested indexes

Loki didn't support indexes on nested (dotted) fields. I've addressed this in techfort/LokiJS#718. Hopefully it will be merged soon. If not, we can create a fork for now and publish

Persistence

Loki is an in-memory DB but supports file serialization. It supports the same autosave functionality that we've built into redux/index.js:saveState() but is supposedly even faster. If we move pages, dataDependencies etc into loki we should be able to remove most of this functionality.

Another added benefit is that saves only occur if data is dirty.

Query Benchmark

Added a benchmark project to test query performance.

Questions

  • I've called the loki create/update/delete functions from inside the nodes reducer. Any thoughts on that? I can't think of anything "wrong" with it, but it feels icky. I'm tempted to instead bypass redux entirely and call the db/index.js directly from actions.js and then emit a CREATE_NOTE event.
  • Assuming there is agreement on this PR, should we move pages, componentDataDependencies and other redux namespaces into loki instead? Benefits would include dirty flagged persistence, and potential querying speed up
  • I've included a flag useQueryIndex which can be set on the query context. This tells loki to create indexes for all fields in the query. This allows users to create indexes where applicable, which for most sites, will only be on node groups > 1000 pages or so. But the other option is that we do something more intelligent to auto set indexes. E.g flag groups of nodes that are > 1000, and if a query occurs against them, then index the query's fields.
  • All nodes inserted into loki have a $loki field. And nodes are mutable, so that field will leak into the rest of Gatsby. It hasn't been an issue so far, but I'm curious if folks think it will be a problem. We could remove the field after query results are returned, but we'd have to clone the whole object which would be a performance penalty.

Note, the tests are failing. I'm working on fixing them up now but wanted to get the PR up now.

@Moocar Moocar requested review from a team as code owners October 24, 2018 02:01
@DSchau DSchau changed the title use lokijs for in-memory DB(Draft, don't merge) [WIP]: use lokijs for in-memory DB(Draft, don't merge) Oct 24, 2018
@DSchau DSchau changed the title [WIP]: use lokijs for in-memory DB(Draft, don't merge) [WIP]: feat(gatsby): use lokijs for in-memory database Oct 24, 2018
Copy link
Contributor

@DSchau DSchau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! A lot going on here, so I'd need to do a deeper dive before we'd get this merged--and obviously because it's a WIP!

Left some comments, mostly on lodash.

- loki without indexes is overall slightly faster than master, except when there are many types
- loki with indexes is about 2x faster on sites with 10k pages, and 5x faster with 20k pages. But is ever so slightly slower when those pages are split across 100 types.

Overall, loki is a big win for sites with lots of pages of the same type. For smaller sites, the difference is negligible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A caching PR I recently worked on had similar findings, which I think is totally reasonable. If it's mostly the same for smaller sites but it makes Gatsby much more scalable I think that's a big win.

I honestly wonder if using lokijs consistently (e.g. for cache and internal Gatsby utilities) would be worthwhile. I'd lean towards yes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fs fallback is interesting, since I used a similar approach for the cache! Basically I kept 100-200 items in memory, and then it fell back to the filesystem as its cache, which seems to work pretty well for large sites.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Moocar did you measure how much memory is used in these different benchmarks?

@DSchau same question?

Copy link
Contributor

@DSchau DSchau Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleAMathews not explicitly (although I can), but I used the upwards bounds of when it ran out of memory as the basis for which I considered the number of items. It is dependent on machine setup, size of items being cached, etc. too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DSchau Thanks for all the feedback! For caching, using loki would be a step up from using pure arrays, but an even bigger win would probably be to use a real caching library that can also provide persistence. Once that has LRU, paging to disk and other cache-specific semantics, assuming one exists! (I haven't looked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleAMathews Memory usage should be less since we're actually able to remove some caches. But I'll double check. What's your favorite memory profiling tool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I assume too but it's nice to know for sure.

There's probably something on npm but you can also add a setInterval in the code and log out the memory usage every .1 seconds or whatever. You could also log out different events too e.g. when query running starts etc. so you could visualize things easily.

const nodes = getNodes()

if (!nodes.some(n => n.internal.type === `ContentfulAsset`)) {
const nodes = getNodesByType(`ContentfulAsset`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look through the implementation of this as I go through, but would this be worth returning an empty array or something so we can resolve the empty case more easily?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would save us all of those checks below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Good call. I'll return the empty collection

})
} catch (e) {
report.error(
`Error starting DB. Perhaps try deleting ${path.dirname(dbSaveFile)}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 probably worth just doing it automatically, or at least prompting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts too! Glad you agree. I'll implement it

@@ -6,6 +6,7 @@ const crypto = require(`crypto`)
const glob = require(`glob`)
const { store } = require(`../../redux`)
const existsSync = require(`fs-exists-cached`).sync
const createNodeId = require(`../../utils/create-node-id`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pieh we need a gatsby-utils stat :) Something one of us can work on soon!

*/
function deleteAllCollections() {
if (db) {
_.forEach(db.listCollections(), collInfo => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth benchmarking _.forEach vs. native forEach (in general, not just here!). I tend to prefer to defer to native array methods, but I may be unique in that regard.

I think that it's a safer bet to presume they'll be optimized very efficiently, if they're not already!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm actually a relative newcomer to js so any tips like this are greatly appreciated. I may have gone a bit lodash mad in this PR as I come from a functional background. Is this the native for each you mean? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question, is db.listCollections() an object or an array? I presumed array, but on second glance, it could actually be an object?

If it's the latter, then this _.forEach may actually be preferable!

I'm about to board a plane, but I'll add some more feedback on this tomorrow. Nice work!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to reply. It's an array. I've been using for (const foo of someArray) and I'm definitely preferring it

* by node.id
*/
function updateNode(node, oldNode) {
invariant(node.internal, `node has no "internal" field`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of invariant here (and elsewhere!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my approach with invariants like this is to be quite liberal in their application, and then perhaps remove them in the future if this code becomes very stable.

if (!oldNode) {
oldNode = getNode(node.id)
}
const updateNode = _.merge(oldNode, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doing anything an Object.assign wouldn't do? (again, sorry, I just don't always love lodash!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we actually merge oldNode with new one? This might cause weird effects, like deleting some fields from node will actually not be deleted in db?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pieh Great catch again. I just looked at the redux implementation and all it does is overwrite the existing node in redux. Not sure why I did a merge here. I'll change to overwrite.

*/
function getNodesByType(typeName) {
const coll = db.getCollection(typeName)
if (!coll) return null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe just return empty array here? Makes it easier for a consumer of this "API" to handle the empty and null cases the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in other comment

Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising! It would be great if we could keep current implementation while this is being worked on and inject new one when using some kind of feature flag

}

function chooseQueryEngine(queryArgs) {
if (hasPluginFields(queryArgs)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely need to be more robust - our filter schema support is not really that great but even now we can filter by linked node with ___NODE convention (which need to use custom resolver to access underlying node)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I didn't notice the linked field example. Can you think of any other types of fields that might need to be resolved too? How about mapping fields?

When @KyleAMathews and I were discussing this PR, we decided that the vast majority of sites would not be using pluginFields. I assume they also aren't using linked fields. But I could be wrong in that assumption

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't support filtering on mapped fields (that's what I meant that support isn't great), but it's something that we would definitely would want to be able to do (here's one of the issues where people mention they can't really filter by mapped fields - #7251, I have WIP branch for this - pieh@ccd110c).

For now only ___NODE fields would be problem. Worth looking into date or local File fields which do have custom resolvers for output, but we don't create correct filter schema for those right now.

When we create schema type fields with custom resolvers we should track those fields (similar to how you track fields from plugins)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate the "resolve all the things" approach of run-sift, but I get why it exists. It's a shame, because the second we have to resolve all nodes, we lose any ability to perform index based lookups.

For plugin filter fields, I think this will always be the case since their resolvers can do anything. But I wonder if there's an elegant solution for linked and mapped nodes. If we were using a graph database, then we would already have the link setup.

One way or another, we'll need to find a proper solution eventually, because I imagine many sites will want to take advantage of linked and mapped filtering.

const nodes = getNodes()

if (!nodes.some(n => n.internal.type === `ContentfulAsset`)) {
const nodes = getNodesByType(`ContentfulAsset`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would save us all of those checks below

if (!oldNode) {
oldNode = getNode(node.id)
}
const updateNode = _.merge(oldNode, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we actually merge oldNode with new one? This might cause weird effects, like deleting some fields from node will actually not be deleted in db?

@Moocar
Copy link
Contributor Author

Moocar commented Oct 24, 2018

@pieh I love the idea of feature flagging this. I'm kicking myself for not writing all this code with that in mind. I'll have a think about how hard it would be to do that

// if we can query by pure data or whether we have to resolve the
// fields first. See `./run-query.js`
_.forEach(inferredFieldNames, fieldName => {
pluginFieldTracking.add(fieldName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should namespace fieldNames - if one node type uses resolver for field foo, we don't want to revert to sift if we query other type and use field with same foo name

// Allow page creators to specify that they want indexes
// automatically created for their pages.
if (context.useQueryIndex) {
ensureIndexes(coll, lokiArgs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive this is when indexes were already applied? We definitely will run same query (with different context) when we use same template for multiple pages

This has also potential to keep unneeded indexes - for example when we modify filters (remove some fields from filter), but not sure if it's worth keeping track of that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pieh loki handles the case of the index already existing so there's no performance penalty.

But yes, this will keep old indexes around if filters are modified. My solution here for applying queries is very basic, but I think there's a lot more we could do here over time (see original PR description). My suggestion would be to keep as is, but iterate on this based on feedback

pieh pushed a commit that referenced this pull request Oct 29, 2018
In preparation for #9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality.

I also added `getNodesByType` since it makes sense even without loki.
@kakadiadarpan kakadiadarpan mentioned this pull request Oct 30, 2018
pieh pushed a commit that referenced this pull request Oct 31, 2018
**Builds on #9416 . Merge that one before this**

In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of #9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines.

This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag).

Another benefit is that the query engine no longer needs to understand page dependencies.
jedrichards pushed a commit to jedrichards/gatsby that referenced this pull request Nov 1, 2018
In preparation for gatsbyjs#9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality.

I also added `getNodesByType` since it makes sense even without loki.
jedrichards pushed a commit to jedrichards/gatsby that referenced this pull request Nov 1, 2018
**Builds on gatsbyjs#9416 . Merge that one before this**

In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of gatsbyjs#9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines.

This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag).

Another benefit is that the query engine no longer needs to understand page dependencies.
pieh pushed a commit that referenced this pull request Nov 29, 2018
This PR adds a feature flag `GATSBY_DB_NODES` that can be used to change the storage engine for the gatsby data layer (`nodes`). 

- `redux` (default) which uses the existing redux implementation, and `sift` for querying. Or, you can use 
- `loki` which uses the loki in-memory database to store and query.

This PR re-implements functionality in #9338, but with all tests passing and addressing previous feedback. It should also be easier to review since it builds on several refactorings. 

Some things to note:

1. I [submitted a PR to lokijs](techfort/LokiJS#718) which still hasn't been merged, though the author says he'll start working on it soon. Therefore, in in interim, I've published [@moocar/lokijs](https://www.npmjs.com/package/@moocar/lokijs). 
1. I haven't implemented auto indexing of query fields yet. I'll attack that next.
1. I suggest we ask the community to try out the feature flag once it's merged to get feedback. All the tests pass, but this is such a big change that we'll want to test it gradually
1. While loki uses the same mongo query language as sift, they do have different behavior. Most of my time on this PR was spent ensuring that loki behaves **exactly** like sift. See [db/loki/nodes-query.js](https://github.com/gatsbyjs/gatsby/blob/cddbe893a4ce638babb1cbe5e5da4c13b6f5e57d/packages/gatsby/src/db/loki/nodes-query.js). But there's a chance a few edge cases have slipped through the cracks.
1. the feature flag works with the tests too `GATSBY_DB_NODES=loki yarn test`. We should perhaps look into running this on all PRs
@pieh
Copy link
Contributor

pieh commented Nov 30, 2018

I'll close this PR - as I think everything from here was already merged in more incremental chunks in past couple weeks

@pieh pieh closed this Nov 30, 2018
gpetrioli pushed a commit to gpetrioli/gatsby that referenced this pull request Jan 22, 2019
In preparation for gatsbyjs#9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality.

I also added `getNodesByType` since it makes sense even without loki.
gpetrioli pushed a commit to gpetrioli/gatsby that referenced this pull request Jan 22, 2019
**Builds on gatsbyjs#9416 . Merge that one before this**

In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of gatsbyjs#9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines.

This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag).

Another benefit is that the query engine no longer needs to understand page dependencies.
gpetrioli pushed a commit to gpetrioli/gatsby that referenced this pull request Jan 22, 2019
This PR adds a feature flag `GATSBY_DB_NODES` that can be used to change the storage engine for the gatsby data layer (`nodes`). 

- `redux` (default) which uses the existing redux implementation, and `sift` for querying. Or, you can use 
- `loki` which uses the loki in-memory database to store and query.

This PR re-implements functionality in gatsbyjs#9338, but with all tests passing and addressing previous feedback. It should also be easier to review since it builds on several refactorings. 

Some things to note:

1. I [submitted a PR to lokijs](techfort/LokiJS#718) which still hasn't been merged, though the author says he'll start working on it soon. Therefore, in in interim, I've published [@moocar/lokijs](https://www.npmjs.com/package/@moocar/lokijs). 
1. I haven't implemented auto indexing of query fields yet. I'll attack that next.
1. I suggest we ask the community to try out the feature flag once it's merged to get feedback. All the tests pass, but this is such a big change that we'll want to test it gradually
1. While loki uses the same mongo query language as sift, they do have different behavior. Most of my time on this PR was spent ensuring that loki behaves **exactly** like sift. See [db/loki/nodes-query.js](https://github.com/gatsbyjs/gatsby/blob/cddbe893a4ce638babb1cbe5e5da4c13b6f5e57d/packages/gatsby/src/db/loki/nodes-query.js). But there's a chance a few edge cases have slipped through the cracks.
1. the feature flag works with the tests too `GATSBY_DB_NODES=loki yarn test`. We should perhaps look into running this on all PRs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants