Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? #3839

Closed
warner opened this issue Sep 16, 2021 · 3 comments · Fixed by #3906
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet testnet-problem problem found during incentivized testnet xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented Sep 16, 2021

I'm investigating performance data from the recent phase4.5 ("metering") testnet, specifically vattp (v14). The code in this vat is super-simple, and does not accumulate any significant state over time. Yet for some reason the time it took for each crank increased significantly over the week-long testnet period.

This is a graph of the total delivery time (slogfile deliver event until deliver-result), vs the delivery number, for just the deliverInboundAck() messages:
vattp deliverInboundAck_ total delivery

The code for deliverInboundAck() is:

    deliverInboundAck(name, ack) {
      const o = getRemote(name).outbound;
      let num = o.highestRemoved + 1;
      while (num <= o.highestAdded && num <= ack) {
        D(mailbox).remove(name, num);
        o.highestRemoved = num;
        num += 1;
      }
    },

It's just retrieving a high-water mark number from a Map for the named remote, and making one remove device invocation for each message number between that number and the incoming ack number, then updating the high-water mark. The observed metering-visible activity (computrons) is a nearly-deterministic function of the number of messages that get removed, over the entire testnet period, so at least the JS engine doesn't think the amount of work is increasing over time.

When I graph the amount of time that elapses between the initial deliver event and the kernel-side end of the last syscall, I see this shape:

vattp deliverInboundAck_ (deliver) to (last syscall)

which tells me:

  • 1: it is not growing significantly over time (maybe 1ms total)
  • 2: there are multiple evenly-spaced horizontal bands which probably represent the small extra time taken to remove 0/1/2/etc old messages, and each removal adds maybe 1ms
  • 3: there's some additional delay (about 10ms) that comes and goes, for hours or days at a time
  • 4: the total time spent in this early phase is fairly small, no more than 30ms or so
  • 5: therefore the 150ms-200ms spent on each delivery by the end of the testnet is all happening after the last syscall returns

I'm fairly sure that the extra time is being spent during the forced GC that liveslots does at end-of-crank (the same thing we're aiming to do less frequently with #3767 ). However, given that vattp is not doing anything to justify an increased heap size (no objects are being passed around), my hypothesis is that something inside liveslots is causing the heap to grow, and the only thing I can think of is a bunch of result promises.

There are three pathways through vattp that get called a significant number of times (i.e. more than once per provisioned remote). Two (deliverInboundMessage and deliverInboundAck) are triggered by the mailbox device. They are "sendOnly", and do not have result promises. deliverInboundMessage emits a single eventual-send and a device invocation for each non-duplicate message:

    deliverInboundMessages(name, newMessages) {
      const i = getRemote(name).inbound;
      newMessages.forEach(m => {
        const [num, body] = m;
        if (num > i.highestDelivered) {
          // TODO: SO() / sendOnly()
          E(i.receiver).receive(body);
          i.highestDelivered = num;
          D(mailbox).ackInbound(name, num);
        }
      });
    },

(as noted, we want to change that i.receiver~.receive() to use sendOnly once we have suport for it, because we don't really care about the result promise).

deliverInboundAck does a device invocation, but no eventual-sends.

The third pathway is on the outbound side, when comms tells vattp to transmit a message to the remote:

      const transmitter = Far('transmitter', {
        transmit(msg) {
          const o = r.outbound;
          const num = o.highestAdded + 1;
          D(mailbox).add(name, num, msg);
          o.highestAdded = num;
        },
      });

This does not cause any eventual sends, and the comms message is sendOnly (because comms does not use liveslots).

So the only place that's obviously creating a result promise is the ~.receive() call. Comms should be resolving that (to undefined) a moment later, and liveslots is supposed to forget about resolved promises. But if something is broken in that pathway, maybe pendingPromises is growing, causing heap growth, and subsequent GC slowdowns.

My plan is to build a benchmark environment that creates a vattp vat (without a kernel), sets up a remote, then feeds a long stream of deliverInboundMessages / deliverInboundAck / transmit messages in, and watches how the execution time changes over time. If that demonstrates the same problem that we recorded on the testnet, the next step is to start instrumenting liveslots, specifically pendingPromises, to see if it is growing over time.

@warner warner added bug Something isn't working SwingSet package: SwingSet labels Sep 16, 2021
@warner warner self-assigned this Sep 16, 2021
@warner
Copy link
Member Author

warner commented Sep 17, 2021

I've confirmed that the increase in runtime still occurs in a stripped-down environment (no kernel, just vattp and synthesized deliverInboundMessages/deliverInboundAck deliveries). I've also confirmed that neither pendingPromises nor slotToVal are growing over time.

I'm starting to suspect something in XS is failing to mark a Promise as unreferenced, since there's no obvious userspace strong collection of Promises, and a similar bug was fixed in XS several months ago.

@warner
Copy link
Member Author

warner commented Sep 17, 2021

I'm pretty sure this is a problem in XS, and I've filed Moddable-OpenSource/moddable#701 with steps to reproduce for the Moddable folks to take a look at.

@warner warner changed the title increasing end-of-crank time consumed in testnet4.5 vattp trace, maybe pendingPromises not getting pruned increasing end-of-crank time consumed in testnet4.5 vattp trace, promises not getting GCed? XS bug? Sep 20, 2021
warner added a commit that referenced this issue Sep 29, 2021
* fix a major memory leak: 64 bytes per Map `delete()`, 32 per Set `delete()`
  * should: closes #3839
* unfortunately Map/Set deletion is now O(N) not O(1)
* possibly fix #3877 "cannot read (corrupted?) snapshot"

Note that this breaks snapshot compatibility, and probably metering
compatibility.

closes #3889
warner added a commit that referenced this issue Sep 30, 2021
We upgrade the XS submodule to the latest version:
Moddable-OpenSource/moddable@10cc52e

This fixes a major memory leak: 64 bytes per Map `delete()`, 32 per Set
`delete()`. We believe this should: closes #3839

Unfortunately Map/Set deletion is now O(N) not O(1).

This version of XS also fixes a bug that might be the cause of #3877 "cannot
read (corrupted?) snapshot", but we're not sure.

Note that this breaks snapshot compatibility (snapshots created before this
version cannot be loaded by code after this version). It might also
break metering compatibility, but the chances seem low enough that we decided
to leave the metering version alone.

closes #3889
warner added a commit that referenced this issue Sep 30, 2021
We upgrade the XS submodule to the latest version:
Moddable-OpenSource/moddable@10cc52e

This fixes a major memory leak: 64 bytes per Map `delete()`, 32 per Set
`delete()`. We believe this should: closes #3839

Unfortunately Map/Set deletion is now O(N) not O(1).

This version of XS also fixes a bug that might be the cause of #3877 "cannot
read (corrupted?) snapshot", but we're not sure.

Note that this breaks snapshot compatibility (snapshots created before this
version cannot be loaded by code after this version). It might also
break metering compatibility, but the chances seem low enough that we decided
to leave the metering version alone.

closes #3889
@warner
Copy link
Member Author

warner commented Sep 30, 2021

For reference, the most obvious problem turned out to be unrelated to Promises. XS had a memory leak in which deleting an entry from a Set cost 32 bytes, and 64 bytes for a Map. The ultimate cause was the tombstone entries in the insertion-order linked list used to ensure iterators behaved correctly in the face of deletion. The new version of XS fixes this, although at the cost of O(N) deletion operations.

I ran a simulation of vattp behavior (delivering synthetic messages to an XS worker hosting the vattp code) and observed mostly flat runtime. I ran it for about 10% of the total testnet traffic (the original trace showed a clear linear increase in runtime over that range). There remained some occasional jank that grew linearly, maybe once every 5k deliveries, which I don't understand, but I'm willing to believe it was happening in the V8 side that was driving the simulation. I also ran a much smaller test (100 steps) and inspected XS's the->currentHeapCount to confirm that it remained flat (the original XS worker grew by a couple for each delivery).

I ought to run a more extensive test in which I replay the entire testnet4.5 trace into a new-XS worker and examine the memory and time usage, but I'm more likely to wait until we do a deeper performance investigation of some other vat.

@dckc dckc added testnet-problem problem found during incentivized testnet xsnap the XS execution tool labels Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet testnet-problem problem found during incentivized testnet xsnap the XS execution tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants