Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove streaming API, fill in spec text, etc #27

Merged
merged 8 commits into from
Nov 1, 2023
Merged

Remove streaming API, fill in spec text, etc #27

merged 8 commits into from
Nov 1, 2023

Conversation

bakkot
Copy link
Collaborator

@bakkot bakkot commented Oct 26, 2023

This removes the streaming API per #13 (comment).

I've also filled out the details of the spec text; documented base64 variations in the RFCs, in other languages, and in existing base64 libraries; settled all the outstanding questions about the API; and updated the polyfill and added tests for it.

cc @phoddie

@bakkot bakkot mentioned this pull request Oct 26, 2023
test-polyfill.mjs Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
Comment on lines +67 to +69
1. NOTE: When _strict_ is *false*, the algorithm below is equivalent to the <a href="https://infra.spec.whatwg.org/#forgiving-base64-decode">forgiving-base64 decode</a> operation in HTML.
1. If _strict_ is *false*, then
1. Remove all occurrences of U+0009 (TAB), U+000A (LF), U+000C (FF), U+000D (CR), and U+0020 (SPACE) from _input_.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: The WHATWG algorithm (and thus atob) does not consider U+000B VERTICAL TABULATION to be whitespace, but ECMA-262 in general does, as does e.g. base64 --decode --ignore-garbage (which isn't saying much though, because --ignore-garbage discards any non-alphabet character and only U+000A LINE FEED is ignored without it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

262 has a much bigger set of whitespace. I'd prefer to match WHATWG unless there's a specific reason to deviate.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated
1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_ twice.
1. Else if _lastChunkSize_ is 3, then
1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_.
1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this step, solving a puzzle seems a bit too opaque for what is fundamentally a simple algorithm. Maybe it would be better to specify things more directly?

    1. Let _inputLength_ be the length of _input_.
    1. If _inputLength_ modulo 4 is 0, then
      1. NOTE: If _input_ is valid, it might end with one or two padding characters that must be removed before comparing its contents against the standard base64 alphabet.
      1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), remove the last element of _input_.
      1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), remove the last element of _input_.
      1. Set _inputLength_ to the length of _input.
    1. Else,
      1. If _strict_ is *true*, throw a *SyntaxError* exception.
    1. If _input_ contains any elements which are not also elements of the standard base64 alphabet, throw a *SyntaxError* exception.
    1. Let _lastChunkSize_ be _inputLength_ modulo 4.
    1. If _lastChunkSize_ is 1, then
      1. Throw a *SyntaxError* exception.
    1. Else if _lastChunkSize_ is 2, then
      1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_ twice.
    1. Else if _lastChunkSize_ is 3, then
      1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_.
    1. Let _paddedInputLength_ be the length of _input_.
    1. Let _bytes_ be an empty List of octets.
    1. Let _buffer_ be 0.
    1. Let _i_ be 0.
    1. Repeat, while _i_ &lt; _paddedInputLength_,
      1. Set _buffer_ to (_buffer_ × 64) + StringIndexOf(CodePointsToString(standard base64 alphabet), _input_[_i_], 0).
      1. If _i_ modulo 4 is 3, then
        1. Append floor(_buffer_ / 256<sup>2</sup>) to _bytes_.
        1. Append floor((_buffer_ modulo 256<sup>2</sup>) / 256) to _bytes_.
        1. Append (_buffer_ modulo 256) to _bytes_.
        1. Set _buffer_ to 0.
      1. Set _i_ to _i_ + 1.
    1. Let _byteLength_ be the length of _bytes_.
    1. If _lastChunkSize_ is 2, then
      1. If _strict_ is *true* and _bytes_[_byteLength_ - 2] is not 0, throw a *SyntaxError* exception.
      1. Set _byteLength_ to _byteLength_ - 2.
    1. Else if _lastChunkSize_ is 3, then
      1. If _strict_ is *true* and _bytes_[_byteLength_ - 1] is not 0, throw a *SyntaxError* exception.
      1. Set _byteLength_ to _byteLength_ - 1.
    1. Let _result_ be ? AllocateTypedArray(*"Uint8Array"*, %Uint8Array%, %Uint8Array.prototype%, _byteLength_).
    1. Set the value at each index of _result_.[[ViewedArrayBuffer]].[[ArrayBufferData]] to the value at the corresponding index of _bytes_.
    1. Return _result_.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems much more opaque to me. What we mean is just do bog-standard base64 decoding. We shouldn't make implementations figure that that we've written down the standard base64 decoding algorithm, which they'll then ignore and use their own existing base64 decoding logic; we should just tell them to do base64 decoding.

The only reason it's phrased somewhat awkwardly is because the RFC doesn't actually specify a decoding algorithm, only encoding. But despite that awkwardness the current text seems much better to me than an explicit algorithm. The world does not need the base64 decoding algorithm written out yet again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The world does not need the base64 decoding algorithm written out yet again.

Perhaps not, but "the <value> such that applying <encoding algorithm> would produce input" is hardly a good reference to any prior example. And from another perspective, ECMA-262 explicitly specifies all kinds of common specialized algorithms, including near-trivial ones like UTF16EncodeCodePoint and longer ones like Encode, and also a number of non-specialized algorithms such as StringIndexOf and StringPad—so what makes base64 special?

Copy link
Collaborator Author

@bakkot bakkot Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not, but "the such that applying <encoding algorithm> would produce input" is hardly a good reference to any prior example.

... Isn't it? I mean, I'm open to other wordings for the thing I'm getting at but I don't really think there's much possibility for confusion with that wording. I am not going to write down the base64 algorithm again.

so what makes base64 special?

None of those algorithms are widely implemented under a common name, specified elsewhere in a document easy to incorporate by reference, and unambiguous. This one is.

(I guess UTF16EncodeCodePoint might meet all those criteria, but that one is small enough that there's very little you need to read to figure out that it's doing the obvious thing.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this one be made into a link?

Copy link

@gibson042 gibson042 Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of those algorithms are widely implemented under a common name, specified elsewhere in a document easy to incorporate by reference, and unambiguous. This one is.

Actually, many of them are. For example, UTF16EncodeCodePoint is RFC 2781 section 2.1 and encodeURIComponent is the WHATWG URL Standard UTF-8 percent-encode … using the component percent-encode set.

Copy link
Collaborator Author

@bakkot bakkot Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be OK swapping out our encodeURIComponent with a reference to WHATWG. But I would not exactly describe it as widely implemented, at least not nearly as much as base64.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceded. I still really dislike algorithm steps that are basically "solve this puzzle" or "find a preimage", but base64 decoding is in fact well understood so in practice implementations will "figure that that we've written down obliquely referenced the standard base64 decoding algorithm [implied by RFC 4648 but not written therein], which they'll then ignore and use their own existing base64 decoding logic". To that end, one final suggestion:

Suggested change
1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_.
1. Let _bytes_ be the unique (possibly empty) sequence of bytes resulting from decoding _input_ as base64 (such that applying the base64 encoding specified in section 4 of RFC 4648 to _bytes_ would produce _input_).

Copy link
Collaborator Author

@bakkot bakkot Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, done.

Also made the references to RFC 4648 be links per @ljharb.

1. If _characters_ cannot result from applying the base64 encoding specified in section 4 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception.
1. Let _bytes_ be the unique sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _characters_.
1. Let _strict_ be ToBoolean(? Get(_opts_, *"strict"*)).
1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.
1. NOTE: The order of validation and decoding in the algorithm below is not observable except that both must fully complete before invocation of AllocateTypedArray. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehhh, I don't think that's necessary. This is a non-normative note.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the impracticality of covering this with tests, I consider it important to emphasize somewhere in the spec text that it actually is required to fully validate arbitrarily-long input before doing something that may produce a different kind of observable error (e.g., passing to Uint8Array.fromBase64 input that is a very long string of valid characters followed by an invalid character [such as from "A".repeat(1e9) + "~"] must throw a SyntaxError rather than a RangeError).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling for excessively-long inputs is not generally well-specified at all. Many other algorithms implicitly ask implementations to allocate memory during execution of the algorithm, which is always a fallible operation, and do not specify what happens when that fails, even though in practice it implementations generally throw a RangeError. I don't want to imply there's more restrictions on this algorithm than any other. So I don't want to write down they can't throw a RangeError in the middle of this algorithm; they can, in practice, throw a RangeError in the middle of basically any algorithm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it does seem common for observable behavior of implementations to correspond with validity checks preempting allocation failures (see below), and I would like the spec to be clear about that everywhere. But point taken regarding this algorithm not being distinguished in that dimension.

$ eshost -se 'encodeURIComponent("€".repeat(4e8)+"\uDBFF").length'
#### ChakraCore

URIError: The URI to be encoded contains an invalid character

#### JavaScriptCore


#### Moddable XS

Error: memory full

#### SpiderMonkey

URIError: malformed URI sequence

#### V8

URIError: URI malformed

(the JavaScriptCore blank seems a silent failure similar to that of e.g. encodeURIComponent("€".repeat(4e8)).length—more like XS than like SpiderMonkey and V8)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely agreed that it would be nice to be clear everywhere. I just don't want to do it in this place and not others, when there's nothing really special about this one. Rectifying memory handling is a larger project.

I'm also not totally sure that requiring validity checks up front is what we'd actually want - the case of excessively long inputs is quite rare, and requiring an extra pass prior to decoding just so the right error can get thrown in those cases seems like it would be a shame.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not totally sure that requiring validity checks up front is what we'd actually want - the case of excessively long inputs is quite rare, and requiring an extra pass prior to decoding just so the right error can get thrown in those cases seems like it would be a shame.

Sure. And that behavior doesn't require an extra pass, but it does require at least one pass (e.g., an implementation could check the remainder of input for validity before deciding to expose allocation failure as an error). Regardless, this discussion at least is settled as non-specific to base64.

ljharb added a commit to es-shims/es-arraybuffer-base64 that referenced this pull request Oct 31, 2023
ljharb added a commit to es-shims/es-arraybuffer-base64 that referenced this pull request Oct 31, 2023
@bakkot
Copy link
Collaborator Author

bakkot commented Nov 1, 2023

Since it's been a week I'm going to go ahead and merge this so the proposal is in a more coherent state, but we can continue discussion here or in a new issue.

@bakkot bakkot merged commit 4a1e7e8 into main Nov 1, 2023
1 check passed
@bakkot bakkot deleted the update branch November 1, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants