Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cosmos-sdk vat snapshot/transcript retention configuration #10032

Merged
merged 7 commits into from
Sep 6, 2024

Conversation

gibson042
Copy link
Member

Ref #9174
Fixes #9387
Fixes #9386

TODO:

Description

Adds consensus-independent vat-snapshot-retention ("debug" vs. "operational") and vat-transcript-retention ("archival" vs. "operational" vs. "default") cosmos-sdk swingset configuration (values chosen to correspond with artifactMode) for propagation in AG_COSMOS_INIT. The former defaults to "operational" and the latter defaults to "default", which infers a value from cosmos-sdk pruning to allow simple configuration of archiving nodes.

It also updates the semantics of TranscriptStore keepTranscripts: false configuration to remove items from only the previously-current span rather than from all previous spans when rolling over (to avoid expensive database churn). Removal of older items can be accomplished by reloading from an export that does not include them.

Security Considerations

I don't think this changes any relevant security posture.

Scaling Considerations

This will reduce the SQLite disk usage for any node that is not explicitly configured to retain snapshots and/or transcripts. The latter in particular is expected to have significant benefits for mainnet (as noted in #9174, about 116 GB ÷ 147 GB ≈ 79% of the database on 2024-03-29 was vat transcript items).

Documentation Considerations

The new fields are documented in our default TOML template, and captured in a JSDoc type on the JavaScript side.

Testing Considerations

This PR extends coverage TranscriptStore to include keepTranscripts true vs. false, but I don't see a good way to cover Go→JS propagation other than manually (which I have done). It should be possible to add testing for the use and validation of resolvedConfig in AG_COSMOS_INIT handling, but IMO that is best saved for after completion of split-brain (to avoid issues with same-process Go–JS entanglement).

Upgrade Considerations

This is all kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade). But we should mention the new cosmos-sdk configuration in release notes, because it won't be added to existing app.toml files already in use.

slogfile = "{{ .Swingset.SlogFile }}"

# The maximum number of vats that the SwingSet kernel will bring online. A lower number
# requires less memory but may have a negative performance impact if vats need to
# be frequently paged out to remain under this limit.
max_vats_online = {{ .Swingset.MaxVatsOnline }}
max-vats-online = {{ .Swingset.MaxVatsOnline }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@mhofman mhofman Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/cosmic-swingset/src/chain-main.js Outdated Show resolved Hide resolved
Comment on lines 83 to 93
// Transcripts are broken up into "spans", delimited by heap snapshots.
// For every vatID, there will be exactly one current span with isCurrent=1,
// and zero or more non-current (historical) spans with isCurrent=null.
// If we take a heap snapshot after the first hundred deliveries and again
// after the second hundred (i.e., after zero-indexed deliveries 99 and 199),
// and have not yet performed a delivery after the second snapshot, we'll have
// two historical spans (one with startPos=0 and endPos=100, the second with
// startPos=100 and endPos=200) and a single empty current span with
// startPos=200 and endPos=200. After we perform the next delivery, the
// single current span will still have startPos=200 but will now have
// endPos=201.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warner I updated this to avoid mixing zero-indexed and one-indexed values; please look it over for correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's correct. Notes:

  • "exactly one current span with isCurrent=1" admits the thought that maybe some "current spans" don't have isCurrent=1. Maybe say "exactly one current span (i.e. isCurrent=1)" ? Ditto for the second clause.
  • Yep, my previous wording was confused, if the first span included a delivery=100, then it should have had endPos=101, because we increment the span record's .endPos just after adding the item.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually ever have empty spans? I though the first entry of a span was to include a "load snapshot" entry. Maybe the first span can be empty, but I think that has a load entry as well?

I believe that means pos are not strictly equivalent to deliveries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is best reviewed while ignoring whitespace changes.

Copy link
Contributor

@siarhei-agoric siarhei-agoric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@mhofman
Copy link
Member

mhofman commented Sep 5, 2024

vat-transcript-retention ("archival" vs. "operational" vs. "default") cosmos-sdk swingset configuration (values chosen to correspond with artifactMode)

Should we add a "replay" option for consistency? I suppose that without a plan for deletion, that could be implemented the same as "archival" for now.

@mhofman
Copy link
Member

mhofman commented Sep 5, 2024

I don't see a good way to cover Go→JS propagation other than manually (which I have done).

One test would be to bootstrap a node with "operational" mode, and verify that you cannot export the swing-store with "replay" or "archival" mode (previous to the config change you could)

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I don't know Go well enough to approve that side (please wait for @mhofman 's r+), but the JS side looks ok.

golang/cosmos/x/swingset/config.go Outdated Show resolved Hide resolved
packages/cosmic-swingset/src/chain-main.js Outdated Show resolved Hide resolved
Comment on lines 83 to 93
// Transcripts are broken up into "spans", delimited by heap snapshots.
// For every vatID, there will be exactly one current span with isCurrent=1,
// and zero or more non-current (historical) spans with isCurrent=null.
// If we take a heap snapshot after the first hundred deliveries and again
// after the second hundred (i.e., after zero-indexed deliveries 99 and 199),
// and have not yet performed a delivery after the second snapshot, we'll have
// two historical spans (one with startPos=0 and endPos=100, the second with
// startPos=100 and endPos=200) and a single empty current span with
// startPos=200 and endPos=200. After we perform the next delivery, the
// single current span will still have startPos=200 but will now have
// endPos=201.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's correct. Notes:

  • "exactly one current span with isCurrent=1" admits the thought that maybe some "current spans" don't have isCurrent=1. Maybe say "exactly one current span (i.e. isCurrent=1)" ? Ditto for the second clause.
  • Yep, my previous wording was confused, if the first span included a delivery=100, then it should have had endPos=101, because we increment the span record's .endPos just after adding the item.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is good. Some minor feedback, feel free to merge once addressed.

# If relative, it is interpreted against the application home directory
# (e.g., ~/.agoric).
# May be overridden by a SLOGFILE environment variable, which if relative is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overridden? I thought the config took precedence if present?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, CLI option > environment variable > config file: https://github.com/spf13/viper?tab=readme-ov-file#why-viper

Viper uses the following precedence order. Each item takes precedence over the item below it:

  • explicit call to Set
  • flag
  • env
  • config
  • key/value store
  • default

packages/cosmic-swingset/src/chain-main.js Outdated Show resolved Hide resolved
Comment on lines 58 to 59
# * "default": determined by ` + "`pruning`" + ` ("archival" if ` + "`pruning`" + ` is
# "nothing", otherwise "operational")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this is sufficient as even though a pruning strategy of "nothing" internally sets KeepRecent to 0, that value is in fact not a valid value otherwise (2 is the minimum enforced)

// See CustomAppConfig in ../../daemon/cmd/root.go.
type ExtendedConfig struct {
serverconfig.Config `mapstructure:",squash"`
Swingset SwingsetConfig `mapstructure:"swingset"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yikes I missed the missing mapstructure in the last PR review, how did it even work, does it default to a lowercasing of the struct property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so.

packages/cosmic-swingset/src/chain-main.js Outdated Show resolved Hide resolved
});
const { transcriptStore } = kernelStorage;
const { commit, close } = hostStorage;
const testTranscriptStore = test.macro({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should, and if there's an easy way, test that older historical spans are not removed when disabling keepTranscripts

Copy link
Member Author

@gibson042 gibson042 Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to import data that already includes a historical span, which is possible but not trivial. Given that, it didn't seem worth it because I think we don't actually care that they are left around.

golang/cosmos/x/swingset/config.go Outdated Show resolved Hide resolved
golang/cosmos/x/swingset/config.go Show resolved Hide resolved
// last snapshot of their vat)
// * "default": determined by `pruning` ("archival" if `pruning` is
// "nothing", otherwise "operational")
VatTranscriptRetention string `mapstructure:"vat-transcript-retention" json:"vatTranscriptRetention,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure omitempty makes sense here. I'd argue this ends up required at the JS/golang interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I'd rather not send an empty string only for this field if that ever comes up. If it truly were required, we'd enforce that on the JS side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it could ever be empty given the resolution logic

swingsetConfig;
const keepSnapshots =
vatSnapshotRetention === 'debug' ||
(!vatSnapshotRetention && ['1', 'true'].includes(XSNAP_KEEP_SNAPSHOTS));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about the env variable "overriding" an explicit "operational" retention in the config. I think that's fine, but maybe a comment here making the intention explicit would help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually wouldn't happen, because of the !vatSnapshotRetention guard. But at any rate, I've updated both keepSnapshots and keepTranscripts to more clearly apply fallback logic only when the config value is falsy.

Copy link

cloudflare-workers-and-pages bot commented Sep 6, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8b3a6d4
Status: ✅  Deploy successful!
Preview URL: https://5a300033.agoric-sdk.pages.dev
Branch Preview URL: https://gibson-9174-transcript-span.agoric-sdk.pages.dev

View logs

@gibson042 gibson042 force-pushed the gibson-9174-transcript-span-retention branch 2 times, most recently from 9bd39ae to f61c1c1 Compare September 6, 2024 02:43
@gibson042 gibson042 added the automerge:rebase Automatically rebase updates, then merge label Sep 6, 2024
@gibson042 gibson042 force-pushed the gibson-9174-transcript-span-retention branch from f61c1c1 to 8b3a6d4 Compare September 6, 2024 14:11
@mergify mergify bot merged commit d6f50e3 into master Sep 6, 2024
80 checks passed
@mergify mergify bot deleted the gibson-9174-transcript-span-retention branch September 6, 2024 14:45
gibson042 added a commit that referenced this pull request Sep 12, 2024
mergify bot added a commit that referenced this pull request Sep 12, 2024
…lid value error message (#10081)

Ref #10032

## Description
```diff
-value for vat-transcript-retention must be in ["archival" "operational"]
+value for vat-transcript-retention must be in ["archival" "operational" "default"]
```

### Security Considerations
n/a

### Scaling Considerations
n/a

### Documentation Considerations
n/a

### Testing Considerations
n/a

### Upgrade Considerations
n/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
4 participants