Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formally define and consistently use "CapData [record]"/"Smallcaps [encoding]"/etc. #1478

Open
gibson042 opened this issue Jan 28, 2023 · 8 comments
Assignees
Labels
kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024

Comments

@gibson042
Copy link
Contributor

gibson042 commented Jan 28, 2023

Now that we have an implicit definition of "CapData" as a { body: string, slots: Copyable[] } structure with constraints on body (that will hopefully be relaxed), we should make that explicit (which may involve naming the @qclass-based encoding that preceded smallcaps).

From #1402 (review) :

This PR is implicitly defining CapData as a record structure that currently encompasses a body string and a slots array, in which the encoding of body is implicitly either "Smallcaps"1 if the first character is "#" or [unnamed original encoding]2 otherwise.

Footnotes

  1. JSON text in which arbitrary pass-by-copy source data has been made representable mostly by translation of JSON-incompatible values into "#…" strings
  2. JSON text in which arbitrary pass-by-copy source data has been made representable mostly by translation of JSON-incompatible values into objects with @qclass members
@mhofman
Copy link
Contributor

mhofman commented Jan 28, 2023

I still believe that even {body: Copyable, slot: string[]} is a too restrictive definition for future CapData. When relaxing from using a serialized body, I'm not convinced we won't want to pick a different property name than body to hold the encoded but not serialized body.

I think however that slot will not change, and often some routing layers will be able to inspect/update slots while passing through the other properties, whatever their encoding / serialization is. We should however make it clear that any other property could hold "copyable" data, not just strings.

If we want we can place a restriction that any future body related property will have its name start with body, and that properties with other names represent invalid CapData.

I suggest the following types to describes CapData:

type Copiable = string | number | boolean | null | Copiable[] | { [key: string]: Copiable };
type JSONFirstCharacter = '{' | '[' | 'n' | 't' | 'f' | '"';

type CapData<
    Slot extends string,
    BodyContent extends Copiable = Copiable,
    BodyKeySuffix extends string = '',
> = { [bodyKey in `body${BodyKeySuffix}`]: BodyContent } & { slot: Slot[] };

type SerializedSmallCapData<Slot extends string> = CapData<Slot, `#${string}`>;
type SerializedQClassCapData<Slot extends string> = CapData<Slot, `${JSONFirstCharacter}${string}`>;

type SmallCapData<Slot extends string> = CapData<Slot, Copiable, 'SmallCap'>;

@erights
Copy link
Contributor

erights commented Jan 29, 2023

slots are not necessarily strings. For example, sometimes they are the remotable itself.

@mhofman
Copy link
Contributor

mhofman commented Jan 29, 2023

I've always thought of CapData as an encoding that enabled transport over a serialized channel. I suppose we could define a general purpose CapData as allowing remotable in slots, and a specific SerializableCapData restricting the slots to string variants. However I fail to see the cases where this would be useful vs passing a copy of the original data not encoded.

@dckc
Copy link
Contributor

dckc commented Apr 1, 2023

what of bigints and undefined?

@mhofman
Copy link
Contributor

mhofman commented Apr 1, 2023

what of bigints and undefined?

They are not JSON serializable and are encoded as strings by SmallCap and QClass Cap encoding.

More precisely, they are encoded as strings by Smallcaps and as objects by the older @qclass-based encoding.

import '@endo/init';
import { makeMarshal } from '@endo/marshal';
const marshaller = makeMarshal();
marshaller.toCapData(harden([undefined, 0n])));
// => { body: '[{"@qclass":"undefined"},{"@qclass":"bigint","digits":"0"}]', slots: [] }

@warner
Copy link
Contributor

warner commented Aug 4, 2023

All the CapData type definitions I've seen have been parameterized by the type of slot, so swingset knows about CapData<Kref> and CapData<Vref> (both of which are strings). The Board's serializer generates something different (boardIDs, but still a string), while some of its internals have unusual need for marshallers that deal in non-string "slots" (which I think is what @erights was referring to), as a form of intermediate representation. Certainly anything that crosses a vat-like domain boundary will need slots that can be looked up in a c-list, and anything that crosses a memoryspace boundary or network connection will need data-like slots.

If the marshaller grows additional methods to merely transform a passable object graph into a JSON-serializable object tree, instead of going all the way to a string, I'd like to see that named .encode/.decode instead of expanding the return type of .serialize to be something that doesn't involve a string.

Although.. I see a fair point that could be made, that .serialize doesn't currently return a string, but returns { body: string, slots: SlotType[] }, and that a serialize() fully deserving of the name would return just a string. We got here because marshaller.serialize does exactly as much as we needed for swingset's syscall.send (and then virtual-object state serialization), and refrains from doing too much: going all the way to a single string would make the kernel's c-list translation way harder.

I see the value of a marshaller which does slightly less work. It would transfer some responsibility onto the caller (assuming they'll eventually need to get the data into a bytestring, which is almost always true), but it would also let them avoid some amount of duplicate encoding (their one JSON.stringify can be spent on a structure that includes all the syscall and command-describing metadata, as well as the object-tree .body replacement).

I'm not sure how I feel about the idea of expanding body to be more than a string, while still calling it "CapData". We have a ton of code which relies upon it being a single string, for sure, and I don't see a lot of value (but do see a lot of complexity) in adding extra properties and making the recipient figure out what flavor they're receiving. Maybe we define a new type like "CapDataTree" or something, with a property not named body to hold the JSON-serializable (but not yet serialized) non-slot data. The new .encode() method could return this type, leaving .serialize() to return the standard string-.body-ied CapData. Callers who are going to do more serialization anyways can use encode.

The main downside I can think of is that it weakens the canonical-serialization story a bit. With serialize generating a .body string, sorted-properties or whatever are entirely the responsibility of Marshal. If the caller takes over that responsibility, it might be harder to convince ourselves that we'll get consistent results, or those results will depend upon more components than before (I'm thinking of things like insertion order for properties: would .encode guarantee a constent insertion order, to facilitate a subsequent JSON.stringify getting them in the right order? would it even matter?).

@mhofman
Copy link
Contributor

mhofman commented Oct 4, 2023

I'm not sure how I feel about the idea of expanding body to be more than a string, while still calling it "CapData". We have a ton of code which relies upon it being a single string, for sure, and I don't see a lot of value (but do see a lot of complexity) in adding extra properties and making the recipient figure out what flavor they're receiving. Maybe we define a new type like "CapDataTree" or something, with a property not named body to hold the JSON-serializ_able_ (but not yet serializ_ed_) non-slot data. The new .encode() method could return this type, leaving .serialize() to return the standard string-.body-ied CapData. Callers who are going to do more serialization anyways can use encode.

In the structure I suggest, we're using a key named body# to hold the small cap serializable object. body would remain a string for q-class or small cap serialized capdata. As such there is no more overload in the meaning or type of body.

The main requirement is that code handling "cap data" in the general form passes through any non slot property as-is, to make it easier to evolve our notion of "cap data" over time.

The main downside I can think of is that it weakens the canonical-serialization story a bit

Correct, we shouldn't rely on comparing a serialized form unless that serialization has a canonical representation (like djson). However marshal does not care about property key order when decoding, and does in fact sort keys when encoding, so we're assured the key order of a plain JSON.serialize is deterministic.

@kriskowal
Copy link
Member

Have these formal definitions been documented and, if so, where?

@kriskowal kriskowal added the kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024 label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kriskowal-review-2024-01 Issues that kriskowal wants to bring to the attention of the team for review as of January, 2024
Projects
None yet
Development

No branches or pull requests

6 participants