Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All node removal in LokiJS takes a very long time (affects page creation and already cached builds) #17935

Closed
me4502 opened this issue Sep 27, 2019 · 16 comments · Fixed by #23838
Assignees
Labels
type: bug An issue or pull request relating to a bug in Gatsby

Comments

@me4502
Copy link
Contributor

me4502 commented Sep 27, 2019

Description

I've been experiencing random major slowdowns of Source and transform nodes that don't appear to conform to the other posted issues. After managing to get it reproduced with the debugger running - I've found that deleting stale nodes from LokiJS is the cause of the slowdown. When a large number of nodes become stale, this will take a very very long time, about 120ms per node. When there are 20000 stale nodes, this easily adds up to ~2400s. This is about the amount of time I'm seeing this step take whenever stale nodes need cleaning up.

If there are a lot of stale nodes, would it make sense to just clear all of them and start again? In this case it is quite literally over 99% of the nodes that are stale.

Steps to reproduce

  • Use LokiJS
  • Have a lot of nodes
  • Do whatever it is that causes nodes to become stale

Expected result

I'd expect a large quantity of stale nodes to not impact build time more than not using the cache.

Actual result

Stale node cleanup takes ~40 minutes.

Environment

  System:
    OS: macOS 10.15
    CPU: (12) x64 Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 11.15.0 - ~/.nvm/versions/node/v11.15.0/bin/node
    Yarn: 1.17.3 - ~/.nvm/versions/node/v11.15.0/bin/yarn
    npm: 6.7.0 - ~/.nvm/versions/node/v11.15.0/bin/npm
  Languages:
    Python: 2.7.16 - /usr/local/bin/python
  Browsers:
    Chrome: 77.0.3865.90
    Safari: 13.0.1
  npmPackages:
    gatsby-plugin-svgr: ^2.0.2 => 2.0.2 

Here's a Chrome CPU profile that shows a tiny part of this - I stopped it after a bit to keep file size down, but the profile would just continue like this for 40 minutes. https://drive.google.com/file/d/1QmkUEt3dNUYsRRv4Z1Y6NpCiuS39mfdK/view?usp=sharing

@me4502
Copy link
Contributor Author

me4502 commented Oct 2, 2019

#11747 - I'd say this fits into that umbrella issue

@me4502
Copy link
Contributor Author

me4502 commented Oct 2, 2019

Just did a little bit of looking into it - the stale nodes it finds are the SitePage entries for every single page. They're stale as the pages aren't made during the sourceNodes step.

@me4502 me4502 changed the title Stale node removal in LokiJS takes a very long time All node removal in LokiJS takes a very long time (affects page creation and already cached builds) Oct 4, 2019
@me4502
Copy link
Contributor Author

me4502 commented Oct 4, 2019

After investigating why my computer kept shutting down due to the watchdog timeout, I found that for some reason Node appears to be blocking every thread when creating pages / sourcing nodes when using LokiJS. When running a profile, this slowdown appears to be the problem. Luckily it doesn't happen when profiling as the profiler must allow the watchdog to tick.

It did appear however that 40 of the 55 seconds that createPages took, is caused by removing old nodes from LokiJS. The main slowdown here is that it entirely re-sorts the index whenever a node is removed from a collection.

I wonder if it'd make sense to have a "bulk changes" option in LokiJS, so that it can and and remove all of the stale / old nodes at once.

@LekoArts LekoArts added the type: question or discussion Issue discussing or asking a question about Gatsby label Oct 17, 2019
@gatsbot gatsbot bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Nov 7, 2019
@gatsbot
Copy link

gatsbot bot commented Nov 7, 2019

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 💪💜

@me4502 me4502 added not stale and removed stale? Issue that may be closed soon due to the original author not responding any more. labels Nov 7, 2019
@me4502
Copy link
Contributor Author

me4502 commented Nov 7, 2019

This is still an issue :)

@LekoArts LekoArts added type: bug An issue or pull request relating to a bug in Gatsby and removed type: question or discussion Issue discussing or asking a question about Gatsby labels Nov 13, 2019
@vladar
Copy link
Contributor

vladar commented Dec 25, 2019

CC @pvdz

@pvdz
Copy link
Contributor

pvdz commented Jan 9, 2020

The main slowdown here is that it entirely re-sorts the index whenever a node is removed from a collection.

Sounds like a big oh problem that scales really badly and would not be too hard to fix or work around (on our side). Can you point to the source?

I'm very interested in this but currently knee deep into investigating a different scaling slowdown (in the run queries step).

@pvdz
Copy link
Contributor

pvdz commented Feb 12, 2020

@me4502 while I still haven't been able to look into this, and not sure when that is, we are considering to deprecate Loki and merge whatever is still relevant into master somehow. This is not a "tomorrow" thing, don't worry. The recent improvements for filter caching seem to take away a lot of the pain that Loki tried to help with and the dual system is getting a bit of a burden to maintain.

From what I understand you have quite some knowledge in the whole area, in particular regarding scaling. Can you tell me (from your pov) what Loki brings to the table at this point and how it helps at scale?

@me4502
Copy link
Contributor Author

me4502 commented Feb 12, 2020

@pvdz So currently we've switched away from Loki due to this issue, but still see noticable declines in query performance. I'm not specifically sure why, but Loki was able to run queries probably 8x faster than Sift is able to right now.

The reason we switched away is that this issue made using a cache impossible, and image optimisation with sharp running every build is a more significant slowdown than not using Loki.

@pvdz
Copy link
Contributor

pvdz commented Feb 13, 2020

@me4502 interesting, can you share the filter(s) and/or sort(s) being used in this case? We intend to gather some general stats on how people are using filters and sorts so we can implement better heuristics and myself I have no clue what common cases are. So right now it's only building up "JIT" caches for flat filters.

@me4502
Copy link
Contributor Author

me4502 commented Feb 13, 2020

@pvdz We've changed our site a fair bit since then, but the queries we use a lot are:

Filtering for list contents with in,
Multi-value sorting
Skip/Limit for pagination
A few elemMatch queries

And recently we switched to using eq: $id in a lot of places due to the other performance boost you did for lookups by node ID.

@pvdz
Copy link
Contributor

pvdz commented Feb 14, 2020

other performance boost you did for lookups by node ID.

This should now apply to any key as long as the filter is "flat" (one property per object of the filter) and an eq. We do plan to update this heuristic to support more operations (neq, in, etc) as well as extending to filters that are not flat. Ultimately, I guess, the index should be built up for any filter that is used more than once since the overhead of building up the query more than makes up for not having to iterate the entire set over and over again. Anyways, that's future work.

Anyways, that's a bit OT. I plan to take a look at Loki, no ETA.

@pvdz pvdz self-assigned this Feb 14, 2020
@pvdz
Copy link
Contributor

pvdz commented Feb 20, 2020

A quick check shows that it's being destroyed by InsertionSort, in turn QuickSort.

I think Loki has never been tested at scale, I don't know what's going on.

This is the .sort() that's killing perf. Both index.values as well as this.data is the full set of nodes. At 100k nodes that is an ever growing set. After an hour it reached 40k nodes here, and "run to point" does not quickly lead to the index.dirty line, so it's actually "stuck" in that sort. Intriguing.

I'll have a closer look later but it seems to me like this is the problem during the source and transform nodes step.

This is Collection.prototype.ensureIndex in node_modules/lokijs/src/lokijs.js
The Comparators.lt in node_modules/gatsby/src/db/loki/custom-comparators.js isn't doing anything weird tbh, just a bunch or comparisons on numbers. Should be fast.
Assuming the sort is stable, as is, this shouldn't be something for which a single sort would take more than a minute for 40k elements.
But I need to look further into it.

image

image

@pvdz
Copy link
Contributor

pvdz commented Apr 22, 2020

FYI: We are planning to drop Loki entirely in the next few weeks. The fixes I have and am merging for filters resolve most issues that Loki aimed to work around. Beyond that, Loki doesn't scale well either and it's causing a dual path in our codebase with a high maintenance burden. Additionally, Loki usage is relatively low, because it's never been formally supported or announced.

All in all, enough reasons to drop it.

If you're still highly depending on Loki please reach out to me directly, or anywhere else, so we can discuss the use case and triage why we can't solve the problem without Loki.

@me4502
Copy link
Contributor Author

me4502 commented Apr 22, 2020

Thanks, we stopped using Loki temporarily due to this issue, however with improvements to the current query system we’re seeing less of a need to. Running queries isn’t the slow point of our builds now so it’s not as necessary

@pvdz
Copy link
Contributor

pvdz commented Apr 23, 2020

@me4502 I'm very happy to hear that :)

From what I understand you have large sites. I'd be happy to triage some problems with you some time. No promises but I guess it can't hurt either. Ping me if you're interested in that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug An issue or pull request relating to a bug in Gatsby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants