Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide plugin authors to write source plugins that correctly interact with the cache #11785

Closed
johno opened this issue Feb 15, 2019 · 15 comments
Closed
Labels
type: documentation An issue or pull request for improving or updating Gatsby's documentation

Comments

@johno
Copy link
Contributor

johno commented Feb 15, 2019

This is related to #11747 as an isolated, reproducible case of caching issues. cc/ @pieh @DSchau

Description

Numerous plugins in the ecosystem aren't interlinking a node and the parent on onCreateNode which causes nodes in the cache to suddenly go missing. I stumbled across this issue working on a proof of concept with gatsby-source-instagram-all but I'm sure this affects plenty of other plugins.

It's not currently intuitive how to properly set up data sourcing in ways that interact well with the cache.

These issues are unfortunate because they create distrust in the cache (rightfully so). Logging warnings and helping to guide plugin authors towards the right path will help alleviate caching woes.

Steps to reproduce

Use a plugin like the reproduction and stop, then start, the server.

Expected result

Restarting the server and running the build should work.

Actual result

Restarting the server and running the build require clearing the cache.

Steps to fix

Currently, the steps to fix involves rm -rf .cache 🙀.

To solve the fundamental issue here will require two steps (I think)

• We should issue a warning in onCreateNode if there's no node passed as a parent and link to a doc that elaborates on the issue
• We should have a doc for plugin authors that describe the Gatsby caching system and how nodes need to be interlinked to be properly cached and avoid errors

I'm anticipating that the doc should be in the style of a guide for plugin authors that might be sourcing nodes since the existing docs (that I can find) are more low-level but I'll defer to @marcysutton here.

Related

@johno johno changed the title Guide plugin authors to write plugins that correctly interact with the cache Guide plugin authors to write source plugins that correctly interact with the cache Feb 15, 2019
@KyleAMathews
Copy link
Contributor

Oh interesting! Yeah, not setting a parent in onCreateNode when creating a node is a definite no no. I'm struggling to think of when that'd ever actually make sense... if we can't think of any, we should just hard error there (or maybe warn initially to not break people's sites).

@KyleAMathews
Copy link
Contributor

Another thing we could check for is if all (or most, say 80%+) or a plugin's nodes are deleted startup due to being stale. That'd be easy to calculate and again should be extremely uncommon so we could issue a strong warning that the plugin probably has a bug.

@pieh
Copy link
Contributor

pieh commented Feb 15, 2019

Yeah, not setting a parent in onCreateNode when creating a node is a definite no no.

Screenshot transformer does that - in very similar way as linked plugin (because both use createRemoteFileNode that doesn't set parent). The difference is screenshot plugin also touches those file nodes so they don't get garbage collected. We definitely need to add ability to set parent to createRemoteFileNode helper.

@KyleAMathews
Copy link
Contributor

Ah yeah, ok so it's a weakness in the helper. Oh and we also shouldn't then delete nodes that are linked to from other nodes 🤔

That's the real problem. File nodes don't need a parent. They just need not deleted. If I'm remembering the stale node algorithm correctly, we only check for parents.

@KyleAMathews
Copy link
Contributor

This is a pretty major bug. The stale collection stuff is our GC step but since we're not collecting all references, we're deleting too much.

https://github.com/gatsbyjs/gatsby/blob/7d31fe75f8d73bb52e7f10384033429bafbc8186/packages/gatsby/src/utils/source-nodes.js

I'm not sure though how to cheaply find ___NODE references since recursing through every node would be very expensive.

@pieh
Copy link
Contributor

pieh commented Feb 15, 2019

I'm not sure though how to cheaply find ___NODE references since recursing through every node would be very expensive.

Not ideal, but we do recurse through each node when we produce example value for schema inference. But using that has potential to easily introduce regressions when changing seemingly unrelated code.

I honestly am not sure we should do that (shield ___NODE linked nodes from being garbage collected) - especially with onCreateNode callback that is really uniquely fitting for transformer plugins. The file node that is created and linked there is really derived from "parent" node so it makes sense to enforce that this File node is child.

I guess maybe instead of changing createRemoteFileNode to add support for setting parent, we could adjust createParentChildLink action to set both children on parent, and parent on child?

actions.createParentChildLink = (
{ parent, child }: { parent: any, child: any },
plugin?: Plugin
) => {
// Update parent
parent.children.push(child.id)
parent.children = _.uniq(parent.children)
return {
type: `ADD_CHILD_NODE_TO_PARENT_NODE`,
plugin,
payload: parent,
}
}

That + warning/error containing instructions to use that action seems like it would solve this without adding overhead of recursing nodes to find ___NODE fields

@KyleAMathews
Copy link
Contributor

Hmm thought of a cheap way to do the fix. Just track which nodes are created in onCreateNode from which node. This would be an implicit parent/child relationship even if the data wasn't modeled that way. We could wrap the createNode action to save this. Then in the GC step, we could do a deep search on the remaining parent nodes for links.

@sidharthachatterjee
Copy link
Contributor

So I just tried to reproduce this with gatsby-source-instagram-all and can't seem to be able to. It is consistently working for me.

redux-state.json is written in .cache correctly and onCreateNode is being called everytime the server is started.

Any ideas?

@sidharthachatterjee
Copy link
Contributor

Turns out the Instagram API returns tags for an image in random order every time. So content digest is different every time! That's why they were getting marked as dirty and couldn't reproduce.

@sidharthachatterjee
Copy link
Contributor

Hmm thought of a cheap way to do the fix. Just track which nodes are created in onCreateNode from which node. This would be an implicit parent/child relationship even if the data wasn't modeled that way. We could wrap the createNode action to save this. Then in the GC step, we could do a deep search on the remaining parent nodes for links.

This was published in

- gatsby-remark-graphviz@1.0.7
 - gatsby-source-drupal@3.0.24
 - gatsby-source-filesystem@2.0.21
 - gatsby-source-shopify@2.0.15
 - gatsby-source-wordpress@3.0.37
 - gatsby-transformer-screenshot@2.0.12
 - gatsby@2.1.11

@johno
Copy link
Contributor Author

johno commented Feb 20, 2019

Going to close this since rather than guiding we've been able to fix it with code itself. Thanks @KyleAMathews, @pieh and @sidharthachatterjee!

@pieh
Copy link
Contributor

pieh commented Apr 1, 2019

Reopening this, because this was fixed in single instance (createRemoteFileNode) and not generally, so users can still hit this - as demonstrated in #12981

@pieh pieh reopened this Apr 1, 2019
@gatsbot gatsbot bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Apr 22, 2019
@gatsbot
Copy link

gatsbot bot commented Apr 22, 2019

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

Thanks for being a part of the Gatsby community! 💪💜

@KyleAMathews KyleAMathews added not stale and removed stale? Issue that may be closed soon due to the original author not responding any more. labels Apr 22, 2019
@smerth
Copy link

smerth commented Jun 8, 2019

Hi there,

I have an issue and I think it might fit here. I wrote a plugin called gatsby-source-gh-readme, which you can find on NPM gatsby-source-gh-readme and the repo is on github here gatsby-source-gh-readme.

The plugin makes a graphql call to the github graphql api and should pull in the file contents of the README.md file at the root of each project in a github user's account and then create a node (of mediaType: "text/markdown") for each in Gatsby.

The plugin works, for each README.md file Gatsby creates a node of type markdown and then those node are processed by the remark plugins just like the blog post markdown files...

However, frequently (almost every time,) I need to clear the cache before running gatsby develop or the build will fail. If I clear the cache prior to running gatsby develop the build succeeds.

I prepare the data for createNode() with this function:

// Helper function that processes a repository node to match Gatsby's node structure
  const processRepo = repo => {
    const nodeId = createNodeId(`repo-readme-${repo.id}`);
    const readme = repo.readme;
    const nodeData = Object.assign({}, repo, {
      id: `repo-readme-${nodeId}`,
      parent: null,
      children: [],
      internal: {
        mediaType: "text/markdown",
        type: `GithubReadme`,
        content: readme,
        contentDigest: createContentDigest(repo)
      }
    });
    return nodeData;
  };

you can see that parent is null and children is an empty array. I don't understand what parent should be, or indeed what children should be…

Apart from the fact that there is probably a better way to write this plugin, it does seem to be example of what your talking about in this issue.


Apart from flagging this as a potential example of the problem discussed in this issue. I have some general questions about this plugin that are not related:

  • when writing a plugin how do I decide what parent, or child should be?

  • how could I write this to make use of graphql from gatsby itself, it seems silly to use another library to make the call to the github api, but I couldn't find a way to use gatsby's graphql from inside the plugin.

  • how can I get feedback on how I wrote this plugin? I am sure there is a better way to write this...

@danabrit danabrit added type: documentation An issue or pull request for improving or updating Gatsby's documentation type: feature or enhancement Issue that is not a bug and requests the addition of a new feature or enhancement. and removed type: dx labels Jul 6, 2020
@maiertech
Copy link

Here is an example of a theme that creates tag nodes that disappear from the cache for no obvious reason: reflexjs/reflexjs#39.

@LekoArts LekoArts removed the type: feature or enhancement Issue that is not a bug and requests the addition of a new feature or enhancement. label Mar 1, 2021
@LekoArts LekoArts closed this as completed May 4, 2021
@gatsbyjs gatsbyjs locked and limited conversation to collaborators May 4, 2021
@LekoArts LekoArts removed the not stale label May 7, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
type: documentation An issue or pull request for improving or updating Gatsby's documentation
Projects
None yet
Development

No branches or pull requests