Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webapp upgrade protocol: Disable HTTP caching and reload other browser tabs to prevent fatal errors after new deployments. #1822

Merged
merged 52 commits into from
Oct 2, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Sep 28, 2024

Motivation for the change, related issues

Solves fatal Playground breakages after new version deployments by adopting the following version upgrade protocol:

  • Playground version is upgraded as early as possible after a new release
  • HTTP cache is skipped when fetching new assets
  • Stale Playground tabs are forcibly refreshed

Related to #1821

The problem

Playground got affected by HTTP caching and ended up loading assets from both the old release and the new release. This broke the app's dependency graph and led to fatal errors.

See the visualisation below. When Playground v184 is released, the app will only work properly if all the loaded assets come from v184:

371781531-608a780e-60b8-4ed4-969a-d7497c7500a7

The solution

This PR ensures HTTP cache is skipped for assets that are cached offline. This isn't perfect as the browser will sometimes download the same file twice, but it's much better than breaking the app. We'll explore making the most out of both cache layers in the future.

Here's a rundown of the caching strategy implemented in this PR:

  • Playground version is upgraded as early as possible after a new release
  • HTTP cache is skipped

Playground version is upgraded as early as possible after a new release

New service workers call .skipWaiting(), immediately claim all the clients
that were controlled by the previous service worker, and forcibly refreshes
them.

Why?

Because Playground fetches new resources asynchronously and on demand. However,
deploying a new webapp version of the app destroys the resources referenced in
the previous webapp version. Therefore, we can't allow the previous version
to run when a new version becomes available.

Push notifications

It would be supremely useful to proactively notify the webapp after a fresh deployment.
Playground doesn't do that yet but it likely will in the future.

HTTP cache is skipped

Playground relies on the Cache only strategy. It loads assets from
the network, caches them, and serves them from the cache. The assumption
is that all network requests yield the most recent version of the remote file.

This helps us avoid the HTTP cache problem.

Cache layers

We're dealing with the following cache layers:

  • HTTP cache in the browser
  • CacheStorage in the service worker
  • Edge Cache on playground.wordpress.net

HTTP cache in the browser

This service worker skips the browser HTTP cache for all network requests. This is because
the HTTP cache caused a particularly nasty problem in Playground deployments.

Installing a new service worker purged the CacheStorage and requested a new set of assets
from the network. However, some of these requests were served from the HTTP cache. As a
result, Playground would start loading a mix of old and new assets and quickly error out.
What made it worse is that this broken state was cached in CacheStorage, breaking Playground
for weeks until the cache was refreshed.

See #1822 for more details.

CacheStorage in the service worker

This servive worker uses a Cache only strategy to ensure all the loaded assets
come from the same webapp build.

The Cache only strategy means Playground only loads each assets from
the network once, caches it, and serves it from the cache from that point on.

The only times Playground reaches to the network are:

  • Before the service worker is installed.
  • When the service worker is being activated.
  • On CacheStorage cache miss occurs.

Edge Cache on playground.wordpress.net

The remote server (playground.wordpress.net) has an Edge Cache that's populated with
all static assets on every webapp deployment. All the assets served by playground.wordpress.net
at any point in time come from the same build and are consistent with each other. The
deployment process is atomic-ish so the server should never expose a mix of old and new
assets.

However, what if a new webapp version is deployed right when someone downloaded 10 out of
27 static assets required to boot Playground?

Right now, they'd end up in an undefined state and likely see an error. Then, on a page refresh,
they'd pick up a new service worker that would purge the stale assets and boot the new webapp
version.

This is not a big problem for now, but it's also not the best user experience. This can be
eventually solved with push notifications. A new deployment would notify all the active
clients to upgrade and pick up the new assets.

Other changes

In addition, this PR:

  • Adds E2E tests for app deployments and offline mode
  • Updates CI to run Playwright tests:
    • Firefox
    • Safari
    • Chrome
  • Fixes a few paper cuts
    • Fixed: Boot halted when OPFS isn't available due to error/success hooks never running (4e0ef74)
    • Fixed: "Save in this browser" option stays available even when there's no OPFS support (f6225a9)

Paths not taken

  • Relying on build-time hashes in the filenames for all caching. We can't rely on that for the most important routes: /, /index.html, /remote.html, /sw.js – they need stable URLs for multiple reasons.
  • A different caching strategy, such as network falling back to cache.

Caveats and follow-up work

  • Let's find a way to leverage HTTP cache without breaking the offline cache.
  • There's no way to recover from a deployment happening during a page load – let's fix it.
  • A new service worker forcibly reloads other browser tabs and destroys their in-memory context. Let's solve it by storing temporary sites in OPFS.

Testing Instructions (or ideally a Blueprint)

CI.

Yes, it sounds like a lame testing plan fur such a profound change. However, almost none of these changes can be tested in a local dev environment and a large part of this work was about covering app deployment in our E2E tests.

If you want to try these tests locally and see what they do, you'll need this special setup:

$ npx nx e2e:playwright:prepare-app-deploy-and-offline-mode playground-website
$ npx playwright test --config=packages/playground/website/playwright/playwright.config.ts ./packages/playground/website/playwright/e2e/deployment.spec.ts --ui

Related resources

@adamziel adamziel requested a review from a team as a code owner September 28, 2024 22:39
@bgrgicak bgrgicak self-assigned this Sep 30, 2024
@adamziel adamziel mentioned this pull request Sep 30, 2024
@bgrgicak
Copy link
Collaborator

I can confirm that test are failing in the E2E environment. When I try it manually locally in Chrome it works, if I disable Update on refresh_ in the SW settings, it stops working.

I will explore different ways of busting cache tomorrow.

@brandonpayton
Copy link
Member

I am able to reproduce the Playwright failures locally. After loading from the "new" files, the Playground iframe looks blank:
Screenshot 2024-10-01 at 12 10 37 AM

I'm not sure why yet.

One thing I wonder is how much this could be interacting with the regular browser cache. If that is a possible confounding factor, what if we use the Service Worker to add cache-disabling response headers to our responses to the browser? Could that effectively eliminate that layer so that the Service Worker would have full control of all cached app resources?

@brandonpayton
Copy link
Member

As an aside, Playwright's UI mode is handy. I like it!

npx nx run playground-website:e2e:playwright -- --ui

It gives you full control of which tests to run and a nice, browsable display of results and screen captures along the way.

@bgrgicak
Copy link
Collaborator

bgrgicak commented Oct 1, 2024

As an aside, Playwright's UI mode is handy. I like it!

You can also add --debug for extra errors.

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 1, 2024

@brandonpayton HTTP caching is spot on! I updated the tests to:

  • Run once with HTTP caching disabled – this passes
  • Run once with HTTP caching enabled – this fails

I also added a /switch-to-new-version endpoint to avoid messing with the server process in the middle of the test.

Next, I logged the output from the python server for both runs: https://gist.github.com/adamziel/ff50b5fb945840c64b06b83ea373a936.

With caching disabled, the browser requests the same file a few times over. No surprises there.

However, with caching enabled, the browser only re-requests the following files:

127.0.0.1 - - 🕙 "GET /switch-to-new-version HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /sw.js HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /wp-6.6/wordpress-static.zip HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET / HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /assets/index-191909dc.js HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /assets/index-338819e8.css HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /assets/main-25d144ec.css HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /assets/main-cf01976e.js HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /assets/client-22d3bbd1.js HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /assets/config-3ed3fcde.js HTTP/1.1" 200 -
127.0.0.1 - - 🕙 "GET /sw.js HTTP/1.1" 304 -
127.0.0.1 - - 🕙 code 404, message File not found
127.0.0.1 - - 🕙 "GET /assets/wp-6.6-917ecf17.zip HTTP/1.1" 404 -
127.0.0.1 - - 🕙 code 404, message File not found
127.0.0.1 - - 🕙 "GET /assets/php_8_3-4d8e97bf.js HTTP/1.1" 404 -
127.0.0.1 - - 🕙 "GET /sw.js HTTP/1.1" 304 -

It doesn't retrieve the updated /index.html, remote.html, or /assets-required-for-offline-mode.json.

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 1, 2024

I also found that replacing cachedFetch() (that caches the responses in a named Cache instance) with just fetch() solves the problem as well – even when HTTP caching is enabled.

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 1, 2024

I started logging Cache name every time a response is served from offline Cache.

  • When HTTP caching is enabled, the same cache is used before and after the new app version is deployed.
  • When HTTP caching is disabled, only the new cache ID is used after the new app version is deployed.

That would happen without refreshing the service worker, but the problem is solved if I only replace cachedFetch() with fetch() in the new service worker and leave the old one unchanged.

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 1, 2024

Relevant article: https://web.dev/articles/service-worker-caching-and-http-caching

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 1, 2024

The test passes when the server sends no-cache headers for /remote.html and /index.html, even if all the other files are HTTP-cached. Why?

@brandonpayton
Copy link
Member

The test passes when the server sends no-cache headers for /remote.html and /index.html, even if all the other files are HTTP-cached. Why?

@adamziel, maybe because both use build-specific script resources:

  • index.html asks for <script type="module" crossorigin src="/assets/index-f9b131ee.js"></script>
  • remote.html asks for <script type="module" crossorigin src="/assets/wordpress-5b0e2fb1.js"></script>

@brandonpayton
Copy link
Member

brandonpayton commented Oct 1, 2024

And once the browser sees those pages updated and loads those scripts, it finds that those scripts reference dependencies it has never cached, and it requests them from the web server.

@brandonpayton
Copy link
Member

@brandonpayton HTTP caching is spot on! I updated the tests to:

  • Run once with HTTP caching disabled – this passes
  • Run once with HTTP caching enabled – this fails

This is great to know!

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

Good point about the http cache overall though. Hm, maybe we could have a special naming scheme for resources with content hash, such as index-h[bg4def]h.js and keep the http cache active for matching urls only?

@bgrgicak
Copy link
Collaborator

bgrgicak commented Oct 2, 2024

How would it work in the offline mode work without defaulting to the service worker cache? Or in an online mode but on a shaky network?

For offline it's simple try { networkFetch } catch { cacheFetch }.

Shaky networks are tricky, because we shouldn't mix networkFetch and cacheFetch otherwise Playground might break.

What if?

  • we create a timestamp on every website deploy and use it as the version
  • we fetch it as early as we can on boot and use that for cache control in the service worker
    • all files that are fetched before the version number should use the network first approach to caching in the service worker and have cache headers disabled
    • all files that are fetched after that use the cache first approach in the service worker and have cache headers enabled

What's missing here?

buildVersion comes from a custom vite plugin that we control and reflects either the git ref of the current commit or the current date.

Good point. In that case we should find a solution for these files (source). Should they have some form of cache busting? If yes can it be a dynamic name or version query string, if not should we disable cache headers.

  "/builder/builder.html",
  "/favicon.ico",
  "/gutenberg.css",
  "/gutenberg.html",
  "/index.html",
  "/logo-192.png",
  "/logo-256.png",
  "/logo-384.png",
  "/logo-512.png",
  "/manifest.json",
  "/ogimage.png",
  "/previewer.css",
  "/site-manager-background.svg",
  "/wordpress-importer.zip",
  "/wordpress.html",
  "/wordpress.svg",
  "/remote.html",
  "/sw.js",

Even if Vite updates it, we don't explicitly fetch it every time, but allow the browser to decide based on cache headers.

This was related to the comment about why disabling cache headers fixes caching.

From what I understand, if remote.html for example is fetched every time by disabling cache headers.
It will get the latest buildVersion and update service worker cache if needed.

This matches what you found out . The test passes when the server sends no-cache headers for /remote.html and /index.html

@bgrgicak
Copy link
Collaborator

bgrgicak commented Oct 2, 2024

Hm, maybe we could have a special naming scheme for resources with content hash, such as index-h[bg4def]h.js and keep the http cache active for matching urls only?

How would that be different from what we do with assets-required-for-offline-mode.json?

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

Hm, maybe we could have a special naming scheme for resources with content hash, such as index-h[bg4def]h.js and keep the http cache active for matching urls only?

How would that be different from what we do with assets-required-for-offline-mode.json?

  1. It would also apply to files not explicitly listed in that json.
  2. It would apply to files from the JSON list that are fetched before the JSON list is downloaded.

I'll reply to the other comment soon

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

Also, I'm tempted to get this in and move further http cache improvements to another pr. We'd make the webapp a bit slower to reload, but I don't see other downsides. Thoughts?

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

For offline it's simple try { networkFetch } catch { cacheFetch }.
Shaky networks are tricky, because we shouldn't mix networkFetch and cacheFetch otherwise Playground might break.

Hm. Let's break it down and see what comes out. We have the following ways of triggering network requests in Playground:

  • Direct document requests, e.g. I type in playground.wordpress.net or playground.wordpress.net/wordpress.html in my address bar and press enter.
  • HTML tags, e.g.<img>, <link>, <script>, <iframe> etc.
  • CSS properties, e.g.background-image and webfonts
  • JavaScript calls, both implicit and explicit, e.g. fetch(), new Worker(), iframe.src = '';, serviceWorker.register(), cache.add(), and likely many other ones

All of these can happen in the following contexts:

  • Vanilla – First Playground page load, before the service worker is in place
  • With service worker in place – after it's installed
  • Within <iframe> on another domain
  • ...probably a few more...

I'm not sure:

  • How to reliably cover all of these with try { networkFetch } catch { cacheFetch }
  • What to do when the 7th import fails assuming we've loaded the previous 6 from the network
  • What would happen if the offline status changes while the app is open

My gut says we need access to a consistent dependency graph throughout the entire time the Playground app is open. The version on the server may change while we're working with Playground, and when that happens we can only do two things:

  1. Keep sourcing what we have from the offline cache – and we probably won't have everything, e.g. php_7_2.wasm
  2. Reload the page and fetch the new app version

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

The offline cookbook discusses a lot of strategies. You've described network falling back to cache. Here's an excerpt from the description:

This means you give online users the most up-to-date content, but offline users get an older cached version. If the network request succeeds you'll most likely want to update the cache entry.

Most up-to-date content is a problem. Playground can only work if we ship all content from the new version or all content from the old version, but not fresh content from the new version, old content from cache.

However, this method has flaws. If the user has an intermittent or slow connection they'll have to wait for the network to fail before they get the perfectly acceptable content already on their device. This can take an extremely long time and is a frustrating user experience. See the next pattern, Cache then network, for a better solution.

This is a large downside.

I don't see a reliable way of doing any other strategy than Cache only:

Ideal for: anything you'd consider static to a particular "version" of your site. You should have cached these in the install event, so you can depend on them being there.

I'm happy to be wrong here. Do you think we could do it in another way?

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

Good point. In that case we should find a solution for these files (source). Should they have some form of cache busting? If yes can it be a dynamic name or version query string, if not should we disable cache headers.

At least a few items on that list cannot have a dynamic name. AFAIR the service worker script must have a stable name. All html files need a stable URL for people linking to them and embedding them. I think manifest.json also needs to be stable, but I'm not 100% sure about that one. We need a caching strategy that works end to end without relying on the names. To me, the hashes in assets names are helpful as a debugging aid but not as a reliable caching strategy.

@bgrgicak
Copy link
Collaborator

bgrgicak commented Oct 2, 2024

Also, I'm tempted to get this in and move further http cache improvements to another pr. We'd make the webapp a bit slower to reload, but I don't see other downsides. Thoughts?

Let's do that. This PR is blocking all other development, and from the discussion, it doesn't seem like we will be able to resolve everything quickly.

@bgrgicak
Copy link
Collaborator

bgrgicak commented Oct 2, 2024

Ideal for: anything you'd consider static to a particular "version" of your site. You should have cached these in the install event, so you can depend on them being there.

That's us. In that case we fundamentals right in trunk. All requests go through cache until it's busted.

Now how to handle busting?

The most reliable way I can think of is to use the cached version, check if there is a new one, if yes offer the user to reload and update.

The only other option I can think of is to first load the version, check if it's different, bust cache, and reload if it is.
This sounds more seamless, but it also creates a bottleneck, every page load would wait for the version to be fetched before doing anything else.

What would be some other strategies?

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

The most reliable way I can think of is to use the cached version, check if there is a new one, if yes offer the user to reload and update.

Yeah that's our only option for now. Technically, this PR gets us there. The new service worker handles cache invalidation and upgrading active clients to the new version as follows:

  1. Load the old app version
  2. Fetch the service worker
  3. New service worker found, run the activation hooks
  4. The activation hook purge the cache and reload every browser tab

Ideally we'd have something on the server to tell an active client "hey, there's a new version, you need to upgrade". Something like a websocket or a push notification. TBD.

Aha, page reloading is not much of a choice, see this comment. The best we can do is force-reload and ensure no in-memory state got destroyed in the process.

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

I think this is good to go once the e2e tests pass. Or even if they don't pass, assuming the same things are broken on trunk.

What do you think @brandonpayton? The only downside is we don't have a flow for preserving in-memory state in other browser tab. This PR would just reload them and destroy their in-memory state. Fixing that would require writing throwaway code for another day, so I suggest merging this destructive strategy and then spending that day on saving temp sites in OPFS.

@adamziel adamziel changed the title Explore reliable website deployments (service worker cache clearing) [Webapp] Version upgrade protocol: Disable HTTP caching and reload other browser tabs to prevent fatal errors after new deployments. Oct 2, 2024
@adamziel adamziel changed the title [Webapp] Version upgrade protocol: Disable HTTP caching and reload other browser tabs to prevent fatal errors after new deployments. Webapp upgrade protocol: Disable HTTP caching and reload other browser tabs to prevent fatal errors after new deployments. Oct 2, 2024
@adamziel
Copy link
Collaborator Author

adamziel commented Oct 2, 2024

A summary of failed tests:

  • Safari: 1 flaky test and 1 legitimate failure when requesting /wp-admin instead of /wp-admin/ – perhaps we need the client-side heuristics to add the trailing slash at least for the wp-admin URL after all. That problem is there in trunk, too.
  • Firefox: 2 multisite-related failures. I cannot reproduce them locally. They may or may not be legitimate failures. We don't have a staging environment yet so the only way to find out is deploying and testing that firefox scenario in production. We should get a staging setup really soon.

I'm going to merge this PR with the failing tests without commenting them out. Let them remind us about these suspicious scenarios and let's fix them in the next PR or two.

@adamziel adamziel merged commit 41649c1 into trunk Oct 2, 2024
7 of 9 checks passed
@adamziel adamziel deleted the reliable-deployments-sw branch October 2, 2024 19:31
@brandonpayton
Copy link
Member

What do you think @brandonpayton? The only downside is we don't have a flow for preserving in-memory state in other browser tab. This PR would just reload them and destroy their in-memory state. Fixing that would require writing throwaway code for another day, so I suggest merging this destructive strategy and then spending that day on saving temp sites in OPFS.

Agreed. I think that's sound reasoning.

@brandonpayton
Copy link
Member

Note: We've disabled the website deployment workflow until temp sites are created in OPFS by default. Then we can be more confident in the Service Worker forcing page reloads when deployments occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants