Remove streaming API, fill in spec text, etc #27

bakkot · 2023-10-26T06:26:46Z

This removes the streaming API per #13 (comment).

I've also filled out the details of the spec text; documented base64 variations in the RFCs, in other languages, and in existing base64 libraries; settled all the outstanding questions about the API; and updated the polyfill and added tests for it.

cc @phoddie

test-polyfill.mjs

playground/index-raw.html

spec.html

gibson042 · 2023-10-28T18:28:42Z

spec.html

+    1. NOTE: When _strict_ is *false*, the algorithm below is equivalent to the <a href="https://infra.spec.whatwg.org/#forgiving-base64-decode">forgiving-base64 decode</a> operation in HTML.
+    1. If _strict_ is *false*, then
+      1. Remove all occurrences of U+0009 (TAB), U+000A (LF), U+000C (FF), U+000D (CR), and U+0020 (SPACE) from _input_.


Observation: The WHATWG algorithm (and thus atob) does not consider U+000B VERTICAL TABULATION to be whitespace, but ECMA-262 in general does, as does e.g. base64 --decode --ignore-garbage (which isn't saying much though, because --ignore-garbage discards any non-alphabet character and only U+000A LINE FEED is ignored without it).

262 has a much bigger set of whitespace. I'd prefer to match WHATWG unless there's a specific reason to deviate.

spec.html

gibson042 · 2023-10-29T18:21:23Z

spec.html

+      1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_ twice.
+    1. Else if _lastChunkSize_ is 3, then
+      1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_.
+    1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_.


Revisiting this step, solving a puzzle seems a bit too opaque for what is fundamentally a simple algorithm. Maybe it would be better to specify things more directly?

1. Let _inputLength_ be the length of _input_. 1. If _inputLength_ modulo 4 is 0, then 1. NOTE: If _input_ is valid, it might end with one or two padding characters that must be removed before comparing its contents against the standard base64 alphabet. 1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), remove the last element of _input_. 1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), remove the last element of _input_. 1. Set _inputLength_ to the length of _input. 1. Else, 1. If _strict_ is *true*, throw a *SyntaxError* exception. 1. If _input_ contains any elements which are not also elements of the standard base64 alphabet, throw a *SyntaxError* exception. 1. Let _lastChunkSize_ be _inputLength_ modulo 4. 1. If _lastChunkSize_ is 1, then 1. Throw a *SyntaxError* exception. 1. Else if _lastChunkSize_ is 2, then 1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_ twice. 1. Else if _lastChunkSize_ is 3, then 1. Append U+0041 (LATIN CAPITAL LETTER A) to _input_. 1. Let _paddedInputLength_ be the length of _input_. 1. Let _bytes_ be an empty List of octets. 1. Let _buffer_ be 0. 1. Let _i_ be 0. 1. Repeat, while _i_ < _paddedInputLength_, 1. Set _buffer_ to (_buffer_ × 64) + StringIndexOf(CodePointsToString(standard base64 alphabet), _input_[_i_], 0). 1. If _i_ modulo 4 is 3, then 1. Append floor(_buffer_ / 256<sup>2</sup>) to _bytes_. 1. Append floor((_buffer_ modulo 256<sup>2</sup>) / 256) to _bytes_. 1. Append (_buffer_ modulo 256) to _bytes_. 1. Set _buffer_ to 0. 1. Set _i_ to _i_ + 1. 1. Let _byteLength_ be the length of _bytes_. 1. If _lastChunkSize_ is 2, then 1. If _strict_ is *true* and _bytes_[_byteLength_ - 2] is not 0, throw a *SyntaxError* exception. 1. Set _byteLength_ to _byteLength_ - 2. 1. Else if _lastChunkSize_ is 3, then 1. If _strict_ is *true* and _bytes_[_byteLength_ - 1] is not 0, throw a *SyntaxError* exception. 1. Set _byteLength_ to _byteLength_ - 1. 1. Let _result_ be ? AllocateTypedArray(*"Uint8Array"*, %Uint8Array%, %Uint8Array.prototype%, _byteLength_). 1. Set the value at each index of _result_.[[ViewedArrayBuffer]].[[ArrayBufferData]] to the value at the corresponding index of _bytes_. 1. Return _result_.

That seems much more opaque to me. What we mean is just do bog-standard base64 decoding. We shouldn't make implementations figure that that we've written down the standard base64 decoding algorithm, which they'll then ignore and use their own existing base64 decoding logic; we should just tell them to do base64 decoding.

The only reason it's phrased somewhat awkwardly is because the RFC doesn't actually specify a decoding algorithm, only encoding. But despite that awkwardness the current text seems much better to me than an explicit algorithm. The world does not need the base64 decoding algorithm written out yet again.

The world does not need the base64 decoding algorithm written out yet again.

Perhaps not, but "the <value> such that applying <encoding algorithm> would produce input" is hardly a good reference to any prior example. And from another perspective, ECMA-262 explicitly specifies all kinds of common specialized algorithms, including near-trivial ones like UTF16EncodeCodePoint and longer ones like Encode, and also a number of non-specialized algorithms such as StringIndexOf and StringPad—so what makes base64 special?

Perhaps not, but "the such that applying <encoding algorithm> would produce input" is hardly a good reference to any prior example.

... Isn't it? I mean, I'm open to other wordings for the thing I'm getting at but I don't really think there's much possibility for confusion with that wording. I am not going to write down the base64 algorithm again.

so what makes base64 special?

None of those algorithms are widely implemented under a common name, specified elsewhere in a document easy to incorporate by reference, and unambiguous. This one is.

(I guess UTF16EncodeCodePoint might meet all those criteria, but that one is small enough that there's very little you need to read to figure out that it's doing the obvious thing.)

Can this one be made into a link?

None of those algorithms are widely implemented under a common name, specified elsewhere in a document easy to incorporate by reference, and unambiguous. This one is.

Actually, many of them are. For example, UTF16EncodeCodePoint is RFC 2781 section 2.1 and encodeURIComponent is the WHATWG URL Standard UTF-8 percent-encode … using the component percent-encode set.

I'd be OK swapping out our encodeURIComponent with a reference to WHATWG. But I would not exactly describe it as widely implemented, at least not nearly as much as base64.

Conceded. I still really dislike algorithm steps that are basically "solve this puzzle" or "find a preimage", but base64 decoding is in fact well understood so in practice implementations will "figure that that we've ~~written down~~ obliquely referenced the standard base64 decoding algorithm [implied by RFC 4648 but not written therein], which they'll then ignore and use their own existing base64 decoding logic". To that end, one final suggestion:

Suggested change

1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_.

1. Let _bytes_ be the unique (possibly empty) sequence of bytes resulting from decoding _input_ as base64 (such that applying the base64 encoding specified in section 4 of RFC 4648 to _bytes_ would produce _input_).

SGTM, done.

Also made the references to RFC 4648 be links per @ljharb.

gibson042 · 2023-10-29T18:26:06Z

spec.html

-      1. If _characters_ cannot result from applying the base64 encoding specified in section 4 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception.
-      1. Let _bytes_ be the unique sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _characters_.
+    1. Let _strict_ be ToBoolean(? Get(_opts_, *"strict"*)).
+    1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.


Suggested change

1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.

1. NOTE: The order of validation and decoding in the algorithm below is not observable except that both must fully complete before invocation of AllocateTypedArray. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.

Ehhh, I don't think that's necessary. This is a non-normative note.

Given the impracticality of covering this with tests, I consider it important to emphasize somewhere in the spec text that it actually is required to fully validate arbitrarily-long input before doing something that may produce a different kind of observable error (e.g., passing to Uint8Array.fromBase64 input that is a very long string of valid characters followed by an invalid character [such as from "A".repeat(1e9) + "~"] must throw a SyntaxError rather than a RangeError).

Handling for excessively-long inputs is not generally well-specified at all. Many other algorithms implicitly ask implementations to allocate memory during execution of the algorithm, which is always a fallible operation, and do not specify what happens when that fails, even though in practice it implementations generally throw a RangeError. I don't want to imply there's more restrictions on this algorithm than any other. So I don't want to write down they can't throw a RangeError in the middle of this algorithm; they can, in practice, throw a RangeError in the middle of basically any algorithm.

FWIW, it does seem common for observable behavior of implementations to correspond with validity checks preempting allocation failures (see below), and I would like the spec to be clear about that everywhere. But point taken regarding this algorithm not being distinguished in that dimension.

$ eshost -se 'encodeURIComponent("€".repeat(4e8)+"\uDBFF").length' #### ChakraCore URIError: The URI to be encoded contains an invalid character #### JavaScriptCore #### Moddable XS Error: memory full #### SpiderMonkey URIError: malformed URI sequence #### V8 URIError: URI malformed

(the JavaScriptCore blank seems a silent failure similar to that of e.g. encodeURIComponent("€".repeat(4e8)).length—more like XS than like SpiderMonkey and V8)

Yeah, definitely agreed that it would be nice to be clear everywhere. I just don't want to do it in this place and not others, when there's nothing really special about this one. Rectifying memory handling is a larger project.

I'm also not totally sure that requiring validity checks up front is what we'd actually want - the case of excessively long inputs is quite rare, and requiring an extra pass prior to decoding just so the right error can get thrown in those cases seems like it would be a shame.

I'm also not totally sure that requiring validity checks up front is what we'd actually want - the case of excessively long inputs is quite rare, and requiring an extra pass prior to decoding just so the right error can get thrown in those cases seems like it would be a shame.

Sure. And that behavior doesn't require an extra pass, but it does require at least one pass (e.g., an implementation could check the remainder of input for validity before deciding to expose allocation failure as an error). Regardless, this discussion at least is settled as non-specific to base64.

See tc39/proposal-arraybuffer-base64#27

bakkot · 2023-11-01T15:42:15Z

Since it's been a week I'm going to go ahead and merge this so the proposal is in a more coherent state, but we can continue discussion here or in a new issue.

bakkot added 5 commits October 25, 2023 13:37

update specification

8ac24b4

update ecmarkup

a3eb39b

update polyfill & playground, add tests

950cd75

add streaming implementation

b169fea

update readme and FAQ

0e4fc1e

bakkot mentioned this pull request Oct 26, 2023

Streaming / multi-shot #13

Closed

michaelficarra reviewed Oct 28, 2023

View reviewed changes

test-polyfill.mjs Outdated Show resolved Hide resolved

michaelficarra reviewed Oct 28, 2023

View reviewed changes

playground/index-raw.html Outdated Show resolved Hide resolved

michaelficarra reviewed Oct 28, 2023

View reviewed changes

spec.html Outdated Show resolved Hide resolved

michaelficarra reviewed Oct 28, 2023

View reviewed changes

spec.html Outdated Show resolved Hide resolved

gibson042 reviewed Oct 28, 2023

View reviewed changes

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

gibson042 mentioned this pull request Oct 28, 2023

How does the base64 chosen here compare to atob/btoa? #5

Closed

address comments

7d1b5cc

bakkot force-pushed the update branch from 2488941 to 7d1b5cc Compare October 29, 2023 00:56

gibson042 reviewed Oct 29, 2023

View reviewed changes

bakkot added 2 commits October 30, 2023 11:48

tweak wording

2aa320f

make references to RFC links

09e7e81

ljharb added a commit to es-shims/es-arraybuffer-base64 that referenced this pull request Oct 31, 2023

Remove "partial" things

9687555

See tc39/proposal-arraybuffer-base64#27

ljharb added a commit to es-shims/es-arraybuffer-base64 that referenced this pull request Oct 31, 2023

Remove "partial" things

168f8b2

See tc39/proposal-arraybuffer-base64#27

bakkot merged commit 4a1e7e8 into main Nov 1, 2023
1 check passed

bakkot deleted the update branch November 1, 2023 15:42

bakkot mentioned this pull request Nov 1, 2023

Support for writing to an existing buffer #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove streaming API, fill in spec text, etc #27

Remove streaming API, fill in spec text, etc #27

bakkot commented Oct 26, 2023

gibson042 Oct 28, 2023

bakkot Oct 28, 2023

gibson042 Oct 29, 2023

bakkot Oct 29, 2023

gibson042 Oct 30, 2023

bakkot Oct 30, 2023 •

edited

Loading

ljharb Oct 30, 2023

gibson042 Oct 30, 2023 •

edited

Loading

bakkot Oct 30, 2023 •

edited

Loading

gibson042 Oct 30, 2023

bakkot Oct 30, 2023 •

edited

Loading

gibson042 Oct 29, 2023

bakkot Oct 29, 2023

gibson042 Oct 30, 2023

bakkot Oct 30, 2023

gibson042 Oct 30, 2023

bakkot Oct 30, 2023

gibson042 Oct 30, 2023

bakkot commented Nov 1, 2023

	1. Let _bytes_ be the unique (possibly empty) sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_.
	1. Let _bytes_ be the unique (possibly empty) sequence of bytes resulting from decoding _input_ as base64 (such that applying the base64 encoding specified in section 4 of RFC 4648 to _bytes_ would produce _input_).

	1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.
	1. NOTE: The order of validation and decoding in the algorithm below is not observable except that both must fully complete before invocation of AllocateTypedArray. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent.

Remove streaming API, fill in spec text, etc #27

Remove streaming API, fill in spec text, etc #27

Conversation

bakkot commented Oct 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bakkot Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibson042 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

bakkot Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bakkot Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bakkot commented Nov 1, 2023

bakkot Oct 30, 2023 •

edited

Loading

gibson042 Oct 30, 2023 •

edited

Loading

bakkot Oct 30, 2023 •

edited

Loading

bakkot Oct 30, 2023 •

edited

Loading