tc39 · bakkot · Nov 1, 2023 · Oct 25, 2023 · Oct 25, 2023 · Oct 26, 2023
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ It is currently at Stage 2 of [the TC39 process](https://tc39.es/process-documen
 
 Try it out on [the playground](https://tc39.github.io/proposal-arraybuffer-base64/).
 
-Initial spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/).
+Spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/).
 
 ## Basic API
 
@@ -32,27 +32,65 @@ This would add `Uint8Array.prototype.toBase64`/`Uint8Array.prototype.toHex` and
 
 ## Options
 
-An options bag argument for the base64 methods could allow specifying additional details such as the alphabet (to include at least `base64` and `base64url`), whether to generate / enforce padding, and how to handle whitespace.
+An options bag argument for the base64 methods allows specifying the alphabet as either `base64` or `base64url`.
 
-## Streaming API
+When encoding, the options bag also allows specifying `strict: false` (the default) or `strict: true`. When using `strict: false`, whitespace is legal and padding is optional. When using `strict: true`, whitespace is forbidden and standard padding (including any overflow bits in the last character being 0) is enforced - i.e., only [canonical](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5) encodings are allowed.
 
-Additional `toPartialBase64` and `fromPartialBase64` methods would allow working with chunks of base64, at the cost of more complexity. See [the playground](https://tc39.github.io/proposal-arraybuffer-base64/) linked above for examples.
+## Streaming
 
-Streaming versions of the hex APIs are not included since they are straightforward to do manually.
+There is no support for streaming. However, it is [relatively straightforward to do effeciently in userland](./stream.mjs) on top of this API, with support for all the same options as the underlying functions.
 
-See [issue #13](https://github.com/tc39/proposal-arraybuffer-base64/issues/13) for discussion.
+## FAQ
 
-## Questions
+### What variation exists among base64 implementations in standards, in other languages, and in existing JavaScript libraries?
 
-### Should these be asynchronous?
+I have a [whole page on that](./base64.md), with tables and footnotes and everything. There is relatively little room for variation, but languages and libraries manage to explore almost all of the room there is.
 
-In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous.
+To summarize, base64 encoders can vary in the following ways:
+
+- Standard or URL-safe alphabet
+- Whether `=` is included in output
+- Whether to add linebreaks after a certain number of characters
+
+and decoders can vary in the following ways:
+
+- Standard or URL-safe alphabet
+- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`)
+- Whether to fail on non-zero padding bits
+- Whether lines must be of a limited length
+- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace)
+
+### What alphabets are supported?
+
+For base64, you can specify either base64 or base64url for both the encoder and the decoder.
+
+For hex, both lowercase and uppercase characters (including mixed within the same string) will decode successfully. Output is always lowercase.
+
+### How is `=` padding handled?
+
+Padding is always generated. The base64 decoder does not require it to be present unless `strict: true` is specified; however, if it is present, it must be well-formed (i.e., once stripped of whitespace the length of the string must be a multiple of 4, and there can be 1 or 2 padding `=` characters).
 
-Possibly we should have asynchronous versions for working with large data. That is not currently included. For the moment you can use the streaming API to chunk the work.
+### How are the extra padding bits handled?
+
+If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits, which means you have an extra 4 or 2 bits which don't encode anything.
+
+Per [the RFC](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5), decoders MAY reject input strings where the padding bits are non-zero. Here, non-zero padding bits are silently ignored when `strict: false` (the default), and are an error when `strict: true`.
+
+### How is whitespace handled?
+
+The encoders do not output whitespace. The hex decoder does not allow it as input. The base64 decoder allows [ASCII whitespace](https://infra.spec.whatwg.org/#ascii-whitespace) anywhere in the string as long as `strict: true` is not specified.
+
+### How are other characters handled?
+
+The presence of any other characters causes an exception.
+
+### Why are these synchronous?
+
+In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous.
 
-### What other encodings should be included, if any?
+### Why just these encodings?
 
-I think base64 and hex are the only encodings which make sense, and those are currently included.
+While other string encodings exist, none are nearly as commonly used as these two.
 
 See issues [#7](https://github.com/tc39/proposal-arraybuffer-base64/issues/7), [#8](https://github.com/tc39/proposal-arraybuffer-base64/issues/8), and [#11](https://github.com/tc39/proposal-arraybuffer-base64/issues/11).
 

diff --git a/base64.md b/base64.md
@@ -0,0 +1,88 @@
+# Notes on Base64 as it exists
+
+Towards an implementation in JavaScript.
+
+## The RFCs
+
+There are two RFCs which are still generally relevant in modern times: [4648](https://datatracker.ietf.org/doc/html/rfc4648), which defines only the base64 and base64url encodings, and [2045](https://datatracker.ietf.org/doc/html/rfc2045#section-6.8), which defines [MIME](https://en.wikipedia.org/wiki/MIME) and includes a base64 encoding.
+
+RFC 4648 is "the base64 RFC". It obsoletes RFC [3548](https://datatracker.ietf.org/doc/html/rfc3548).
+
+- It defines both the standard (`+/`) and url-safe (`-_`) alphabets.
+- "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise." Certain malformed padding MAY be ignored.
+- "Decoders MAY chose to reject an encoding if the pad bits have not been set to zero"
+- "Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise."
+
+RFC 2045 is not usually relevant, but it's worth summarizing its behavior anyway:
+
+- Only the standard (`+/`) alphabet is supported.
+- It defines only an encoding. The encoding is specified to include `=`. No direction is given for decoders which encounter data which is not padded with `=`, or which has non-zero padding bits. In practice, decoders seem to ignore both.
+- "Any characters outside of the base64 alphabet are to be ignored in base64-encoded data."
+- MIME requires lines of length at most 76 characters, seperated by CRLF.
+
+RFCs [1421](https://datatracker.ietf.org/doc/html/rfc1421) and [7468](https://datatracker.ietf.org/doc/html/rfc7468), which define "Privacy-Enhanced Mail" and related things (think `-----BEGIN PRIVATE KEY-----`), are basically identical to the above except that they mandate lines of exactly 64 characters, except that the last line may be shorter.
+
+RFC [4880](https://datatracker.ietf.org/doc/html/rfc4880#section-6) defines OpenPGP messages and is just the RFC 2045 format plus a checksum. In practice, only whitespace is ignored, not all non-base64 characters.
+
+No other variations are contemplated in any other RFC or implementation that I'm aware of. That is, we have the following ways that base64 encoders can vary:
+
+- Standard or URL-safe alphabet
+- Whether `=` is included in output
+- Whether to add linebreaks after a certain number of characters
+
+and the following ways that base64 decoders can vary:
+
+- Standard or URL-safe alphabet
+- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`)
+- Whether to fail on non-zero padding bits
+- Whether lines must be of a limited length
+- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace)
+
+## Programming languages
+
+Note that neither C++ nor Rust have built-in base64 support. In C++ the Boost library is quite common in large projects and parts sometimes get pulled in to the standard library, and in Rust the [base64 crate](https://docs.rs/base64/latest/base64/) is the clear choice of everyone, so I'm mentioning those as well.
+
+"✅ / ⚙️" means the default is yes but it's configurable. A bare "⚙️" means it's configurable and there is no default.
+
+|                     | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input |
+| ------------------- | ---------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- |
+| C++ (Boost)         | ❌               | ❌             | ❌                   | ?[^cpp]                | ?                              | ❌                                     | ❌                           |
+| Ruby                | ✅               | ✅ / ⚙️[^ruby] | ✅ / ⚙️[^ruby2]      | ✅ / ⚙️                | ✅ / ⚙️                        | ❌                                     | ✅ / ⚙️                      |
+| Python              | ✅               | ✅             | ❌                   | ❌                     | ✅                             | ✅ / ⚙️                                | ✅ / ⚙️                      |
+| Rust (base64 crate) | ✅               | ⚙️             | ❌                   | ⚙️                     | ⚙️                             | ❌                                     | ❌                           |
+| Java                | ✅               | ✅ / ⚙️        | ❌ / ⚙️[^java]       | ✅                     | ✅                             | ❌                                     | ❌ / ⚙️                      |
+| Go                  | ✅               | ✅             | ❌                   | ✅ / ⚙️                | ✅ / ⚙️                        | ❌                                     | ✅[^go]                      |
+| C#                  | ❌               | ✅             | ❌                   | ❌                     | ✅                             | ❌                                     | ✅                           |
+| PHP                 | ❌               | ✅             | ❌                   | ✅                     | ✅                             | ✅ / ⚙️                                | ✅ / ⚙️                      |
+| Swift               | ❌               | ✅             | ❌ / ⚙️              | ❌                     | ✅                             | ❌ / ⚙️                                | ❌ / ⚙️                      |
+
+[^cpp]: Boost adds extra null bytes to the output when padding is present, and treats non-zero padding bits as meaningful (i.e. it produces more output when they are present)
+[^ruby]: Ruby only allows configuring padding with the urlsafe alphabet
+[^ruby2]: Ruby adds linebreaks every 60 characters
+[^java]: Java allows MIME-format output, with `\r\n` sequences after every 76 characters of output
+[^go]: Go only allows linebreaks specifically
+
+## JS libraries
+
+Only including libraries with a least a million downloads per week and at least 100 distinct dependents.
+
+|                             | supports urlsafe  | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input |
+| --------------------------- | ----------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- |
+| `atob`/`btoa`               | ❌                | ✅             | ❌                   | ✅                     | ✅                             | ❌                                     | ✅                           |
+| Node's Buffer               | ✅[^node]         | ✅             | ❌                   | ✅                     | ✅                             | ✅                                     | ✅                           |
+| base64-js (38m/wk)          | ✅ (for decoding) | ✅             | ❌                   | ❌                     | ✅                             | ❌[^base64-js]                         | ❌                           |
+| @smithy/util-base64 (8m/wk) | ❌                | ✅             | ❌                   | ❌                     | ✅                             | ❌                                     | ❌                           |
+| crypto-js (6m/wk)           | ✅                | ✅             | ❌                   | ✅                     | ✅                             | ❌[^crypto-js]                         | ❌                           |
+| js-base64 (5m/wk)           | ✅                | ✅ / ⚙️        | ❌                   | ✅                     | ✅                             | ✅                                     | ✅                           |
+| base64-arraybuffer (4m/wk)  | ❌                | ✅             | ❌                   | ✅                     | ✅                             | ❌[^base64-arraybuffer]                | ❌                           |
+| base64url (2m/wk)           | ✅                | ❌ / ⚙️        | ❌                   | ✅                     | ✅                             | ✅                                     | ✅                           |
+| base-64 (2m/wk)             | ❌                | ✅             | ❌                   | ✅                     | ✅                             | ✅                                     | ✅                           |
+
+[^node]: Node allows mixing alphabets within the same string in input
+[^base64-js]: Illegal characters are interpreted as `A`
+[^crypto-js]: Illegal characters are interpreted as `A`
+[^base64-arraybuffer]: Illegal characters are interpreted as `A`
+
+## "Whitespace"
+
+In all of the above, "whitespace" means only _ASCII_ whitespace. I don't think anyone has special handling for Unicode but non-ASCII whitespace.
diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -3,14 +3,14 @@
   "name": "proposal-arraybuffer-base64",
   "scripts": {
     "build-playground": "mkdir -p dist && cp playground/* dist && node scripts/static-highlight.js playground/index-raw.html > dist/index.html && rm dist/index-raw.html",
-    "build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose --js-out dist/spec/ecmarkup.js --css-out dist/spec/ecmarkup.css spec.html dist/spec/index.html",
+    "build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose spec.html --assets-dir dist/spec dist/spec/index.html",
     "build": "npm run build-playground && npm run build-spec",
     "format": "emu-format --write spec.html",
     "check-format": "emu-format --check spec.html"
   },
   "dependencies": {
-    "@tc39/ecma262-biblio": "2.1.2553",
-    "ecmarkup": "^17.0.0",
+    "@tc39/ecma262-biblio": "2.1.2653",
+    "ecmarkup": "^18.0.0",
     "jsdom": "^21.1.1",
     "prismjs": "^1.29.0"
   }