maybe remove per-vat (xsnap) memory limit? #5953

warner · 2022-08-13T01:23:44Z

What is the Problem Being Solved?

In today's discussion about DoS attacks/defenses, one concern was an attacker provoking a vat into allocating too much memory (perhaps slowly), and triggering our 2GiB memory limit:

https://github.com/agoric-labs/xsnap-pub/blob/3e671651ed6d8a58491b500c9e7359494ff7da7d/xsnap/sources/xsnapPlatform.c#L290-L291

When xsnap sees the engine try to malloc past this limit, it exits with XS_NOT_ENOUGH_MEMORY_EXIT, aka ExitCode.E_NOT_ENOUGH_MEMORY. The vat manager reports this as a delivery error, which tells the kernel to terminate the vat (in-consensus).

In general, any hard resource limit that kills the process (under the theory that the vat did it deliberately) will convert a provoked-growth bug into a kill-the-vat authority. An attacker who manages to send a few extra arguments with each message (perhaps unused positional arguments, which JS functions ignore unless special schema enforcement is added), might cause the vat to consume extra memory to hold them. Over time, that might grow the memory footprint to the point that it trips the ceiling. If, at that point, the worker is killed, we'll have given the attacker more authority than we intended.

Some pathways we're examining for that consumption:

promise vrefs (vpids) sent into a vat ("imported promises"), for which liveslots must create a Promise for userspace, and must retain the resolve/reject functions (and thus effectively the Promise itself) until the kernel does a dispatch.notify to resolve it
- (it wouldn't help to track the promise with a WeakRef because we cannot sense whether userspace has done a .then or await on it, so liveslots must assume userspace is watching until resolved)
- but note that a null upgrade of the vat will free this memory, because Promises do not survive an upgrade, nor would the userspace code that might have been following them
object vrefs (technically "voids" but for obvious reasons we never use that term) sent into a vat, which create a Presence
flat data, which gets unserialized into Arrays and records and numbers and strings

We have a size limit at the perimeter of the kernel (specifically the maximum cosmos signed-transaction size). That imposes a limit on the largest message that can make it into the bridge device, or the mailbox device. Framing, comms-protocol serialization, and other transformations are applied or removed, leading to some sort of limit (maybe larger, maybe smaller) on the properties of the capdata that arrives in dispatch.deliver or dispatch.notify. Then liveslots and marshal turn this into JS Objects and Promises and Arrays, etc, whose RAM usage is some function of the capdata sizes. If we could compute this overall transformation function, we could pick a signed-cosmos-txn limit X, and then state that the worst-case RAM usage that can be provoked by any single message is Y.

If Y is larger than the per-vat xsnap allocation limit, a single message could kill the vat during deserialization. If userspace and/or liveslots holds on to data over time, a series of sub-X-sized cosmos messages could kill the vat.

We don't (yet) know what that Y = f(X) function is. We expect it's polylinear. And it's probably more accurately expressed as a * body.length + b * slots.map(filterObjectIDs).length + c * slots.map(filterPromiseIDs). And the slope is probably at least 10, since [],[],[], takes only 3 bytes to make an empty Array (which is probably 4-8 32-bit words in RAM), or {},{},{},. The cosmos limit is currently at least several MB, maybe 10MB (to accomodate contract bundles). When we finish #4564 and #4811 and friends, txns of this size will still be allowed, but the component messages inside will be limited according to type, and the messages which go to swingset will be limited to much smaller values (maybe 1MB, maybe more like 10kB).

So we aren't confident that the perimeter limit is tight enough. And we might not be able to prevent liveslots from retaining some data. And userspace might not check for surprising extra properties on inbound objects that it stores. So the attack can be spread over multiple messages. The net result is that we might not be able to prevent attackers from inflating (perhaps slowly) the XS memory footprint of an innocent vat to the point that it passes a fixed limit.

But, if the vat is not intentionally using that extra data, then the Unix xsnap process which hosts the vat won't be touching it. The details are tricky, because the GC mark() phase must walk some portion of all objects, but for some forms of data (especially large strings), the application won't touch those pages. And the standard unix virtual memory system is pretty good at not spending RAM on unused pages.

So our thought is "thrash before crash". If we allow the xsnap process to grow large, but most of the pages are unused, the host OS will push those pages out to swap, preserving the RAM for only the active set. If the host has enough swap space, we can accomodate several inflated vats.

Description of the Design

I'm thinking:

change xsnapPlatform.c fxCreateMachinePlatform() to increase the->allocationLimit to 20 GiB or 100 GiB.
- alternatively, I think we might be able to set the limit after initialization, in xsnap-worker.c after the xsCreateMachine() call, in response to an argv parameter
- I believe the value is stored in the heap snapshot
- it might be possible to change this dynamically
- changes should be part of consensus, since the XS_NOT_ENOUGH_MEMORY_EXIT error is treated as part of consensus
tell validators they should provision a meaningful amount of swap
- maybe this can wait, if it looks like our other defenses are working

Security Considerations

A hard memory limit chooses one side of a tradeoff:

enforce a limit:
- good: host OS memory usage is protected (although the linux OOM killer protects this too)
- good: actually buggy/malicious vats/contracts only hurt themselves
- bad: vats are more vulnerable to being killed by attacker (provoked into using too much memory)
do not enforce a limit:
- good: harder for attacker to provoke the vat
- bad: attacker might trigger OOM killer and take out the whole validator
- bad: buggy/malicious vats can kill the whole validator

Since we don't have third-party contracts in MN-1, any buggy/malicious contracts are our own fault, and halting the whole chain is better than an in-consensus termination of some important vat.

Test Plan

A unit test which creates an xsnap worker configured with various allocationLimits, evaluates code to build a string just below that limit (successfully), then allocates a little bit more (which should cause the worker to exit with ExitCode.E_NOT_ENOUGH_MEMORY).

We already have a test to exceed the limit, but it works by allocating more and more until the limit is reached. We need to test that the configured limit is respected.

To avoid undue stress on the CI host (or developer's laptops), the test shouldn't exercise much larger than 2GB.

The text was updated successfully, but these errors were encountered:

warner added enhancement New feature or request SwingSet package: SwingSet xsnap the XS execution tool labels Aug 13, 2022

This was referenced Aug 13, 2022

liveslots: reject too many imported promises #5954

Open

use method schemas to limit inbound argument shapes #5955

Closed

constrain mailbox to devnet config, to protect against c-list/liveslots DoS attacks #5965

Closed

warner added the possible-post-launch label Aug 24, 2022

This was referenced Aug 24, 2022

walletFactory uses virtual objects #4489

Closed

mark all core vats as "critical" #6051

Closed

mhofman mentioned this issue Aug 26, 2022

Extraneous imported promises cause memory growth #6074

Open

Tartuffo added this to the Mainnet 1 RC0 milestone Aug 31, 2022

Tartuffo removed this from the Mainnet 1 RC0 milestone Sep 21, 2022

Tartuffo added migrate-backlog and removed migrate-backlog labels Nov 17, 2022

ivanlei removed the possible-post-launch label Dec 4, 2022

ivanlei added the vaults_triage DO NOT USE label Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maybe remove per-vat (xsnap) memory limit? #5953

maybe remove per-vat (xsnap) memory limit? #5953

warner commented Aug 13, 2022

maybe remove per-vat (xsnap) memory limit? #5953

maybe remove per-vat (xsnap) memory limit? #5953

Comments

warner commented Aug 13, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan