Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove streaming API, fill in spec text, etc #27

Merged
merged 8 commits into from
Nov 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 50 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ It is currently at Stage 2 of [the TC39 process](https://tc39.es/process-documen

Try it out on [the playground](https://tc39.github.io/proposal-arraybuffer-base64/).

Initial spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/).
Spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/).

## Basic API

Expand All @@ -32,27 +32,65 @@ This would add `Uint8Array.prototype.toBase64`/`Uint8Array.prototype.toHex` and

## Options

An options bag argument for the base64 methods could allow specifying additional details such as the alphabet (to include at least `base64` and `base64url`), whether to generate / enforce padding, and how to handle whitespace.
An options bag argument for the base64 methods allows specifying the alphabet as either `base64` or `base64url`.

## Streaming API
When encoding, the options bag also allows specifying `strict: false` (the default) or `strict: true`. When using `strict: false`, whitespace is legal and padding is optional. When using `strict: true`, whitespace is forbidden and standard padding (including any overflow bits in the last character being 0) is enforced - i.e., only [canonical](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5) encodings are allowed.

Additional `toPartialBase64` and `fromPartialBase64` methods would allow working with chunks of base64, at the cost of more complexity. See [the playground](https://tc39.github.io/proposal-arraybuffer-base64/) linked above for examples.
## Streaming

Streaming versions of the hex APIs are not included since they are straightforward to do manually.
There is no support for streaming. However, it is [relatively straightforward to do effeciently in userland](./stream.mjs) on top of this API, with support for all the same options as the underlying functions.

See [issue #13](https://github.com/tc39/proposal-arraybuffer-base64/issues/13) for discussion.
## FAQ

## Questions
### What variation exists among base64 implementations in standards, in other languages, and in existing JavaScript libraries?

### Should these be asynchronous?
I have a [whole page on that](./base64.md), with tables and footnotes and everything. There is relatively little room for variation, but languages and libraries manage to explore almost all of the room there is.

In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous.
To summarize, base64 encoders can vary in the following ways:

- Standard or URL-safe alphabet
- Whether `=` is included in output
- Whether to add linebreaks after a certain number of characters

and decoders can vary in the following ways:

- Standard or URL-safe alphabet
- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`)
- Whether to fail on non-zero padding bits
- Whether lines must be of a limited length
- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace)

### What alphabets are supported?

For base64, you can specify either base64 or base64url for both the encoder and the decoder.

For hex, both lowercase and uppercase characters (including mixed within the same string) will decode successfully. Output is always lowercase.

### How is `=` padding handled?

Padding is always generated. The base64 decoder does not require it to be present unless `strict: true` is specified; however, if it is present, it must be well-formed (i.e., once stripped of whitespace the length of the string must be a multiple of 4, and there can be 1 or 2 padding `=` characters).

Possibly we should have asynchronous versions for working with large data. That is not currently included. For the moment you can use the streaming API to chunk the work.
### How are the extra padding bits handled?

If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits, which means you have an extra 4 or 2 bits which don't encode anything.

Per [the RFC](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5), decoders MAY reject input strings where the padding bits are non-zero. Here, non-zero padding bits are silently ignored when `strict: false` (the default), and are an error when `strict: true`.

### How is whitespace handled?

The encoders do not output whitespace. The hex decoder does not allow it as input. The base64 decoder allows [ASCII whitespace](https://infra.spec.whatwg.org/#ascii-whitespace) anywhere in the string as long as `strict: true` is not specified.

### How are other characters handled?

The presence of any other characters causes an exception.

### Why are these synchronous?

In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous.

### What other encodings should be included, if any?
### Why just these encodings?

I think base64 and hex are the only encodings which make sense, and those are currently included.
While other string encodings exist, none are nearly as commonly used as these two.

See issues [#7](https://github.com/tc39/proposal-arraybuffer-base64/issues/7), [#8](https://github.com/tc39/proposal-arraybuffer-base64/issues/8), and [#11](https://github.com/tc39/proposal-arraybuffer-base64/issues/11).

Expand Down
88 changes: 88 additions & 0 deletions base64.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Notes on Base64 as it exists

Towards an implementation in JavaScript.

## The RFCs

There are two RFCs which are still generally relevant in modern times: [4648](https://datatracker.ietf.org/doc/html/rfc4648), which defines only the base64 and base64url encodings, and [2045](https://datatracker.ietf.org/doc/html/rfc2045#section-6.8), which defines [MIME](https://en.wikipedia.org/wiki/MIME) and includes a base64 encoding.

RFC 4648 is "the base64 RFC". It obsoletes RFC [3548](https://datatracker.ietf.org/doc/html/rfc3548).

- It defines both the standard (`+/`) and url-safe (`-_`) alphabets.
- "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise." Certain malformed padding MAY be ignored.
- "Decoders MAY chose to reject an encoding if the pad bits have not been set to zero"
- "Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise."

RFC 2045 is not usually relevant, but it's worth summarizing its behavior anyway:

- Only the standard (`+/`) alphabet is supported.
- It defines only an encoding. The encoding is specified to include `=`. No direction is given for decoders which encounter data which is not padded with `=`, or which has non-zero padding bits. In practice, decoders seem to ignore both.
- "Any characters outside of the base64 alphabet are to be ignored in base64-encoded data."
- MIME requires lines of length at most 76 characters, seperated by CRLF.

RFCs [1421](https://datatracker.ietf.org/doc/html/rfc1421) and [7468](https://datatracker.ietf.org/doc/html/rfc7468), which define "Privacy-Enhanced Mail" and related things (think `-----BEGIN PRIVATE KEY-----`), are basically identical to the above except that they mandate lines of exactly 64 characters, except that the last line may be shorter.

RFC [4880](https://datatracker.ietf.org/doc/html/rfc4880#section-6) defines OpenPGP messages and is just the RFC 2045 format plus a checksum. In practice, only whitespace is ignored, not all non-base64 characters.

No other variations are contemplated in any other RFC or implementation that I'm aware of. That is, we have the following ways that base64 encoders can vary:

- Standard or URL-safe alphabet
- Whether `=` is included in output
- Whether to add linebreaks after a certain number of characters

and the following ways that base64 decoders can vary:

- Standard or URL-safe alphabet
- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`)
- Whether to fail on non-zero padding bits
- Whether lines must be of a limited length
- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace)

## Programming languages

Note that neither C++ nor Rust have built-in base64 support. In C++ the Boost library is quite common in large projects and parts sometimes get pulled in to the standard library, and in Rust the [base64 crate](https://docs.rs/base64/latest/base64/) is the clear choice of everyone, so I'm mentioning those as well.

"✅ / ⚙️" means the default is yes but it's configurable. A bare "⚙️" means it's configurable and there is no default.

| | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input |
| ------------------- | ---------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- |
| C++ (Boost) | ❌ | ❌ | ❌ | ?[^cpp] | ? | ❌ | ❌ |
| Ruby | ✅ | ✅ / ⚙️[^ruby] | ✅ / ⚙️[^ruby2] | ✅ / ⚙️ | ✅ / ⚙️ | ❌ | ✅ / ⚙️ |
| Python | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ / ⚙️ | ✅ / ⚙️ |
| Rust (base64 crate) | ✅ | ⚙️ | ❌ | ⚙️ | ⚙️ | ❌ | ❌ |
| Java | ✅ | ✅ / ⚙️ | ❌ / ⚙️[^java] | ✅ | ✅ | ❌ | ❌ / ⚙️ |
| Go | ✅ | ✅ | ❌ | ✅ / ⚙️ | ✅ / ⚙️ | ❌ | ✅[^go] |
| C# | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ |
| PHP | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ / ⚙️ | ✅ / ⚙️ |
| Swift | ❌ | ✅ | ❌ / ⚙️ | ❌ | ✅ | ❌ / ⚙️ | ❌ / ⚙️ |

[^cpp]: Boost adds extra null bytes to the output when padding is present, and treats non-zero padding bits as meaningful (i.e. it produces more output when they are present)
[^ruby]: Ruby only allows configuring padding with the urlsafe alphabet
[^ruby2]: Ruby adds linebreaks every 60 characters
[^java]: Java allows MIME-format output, with `\r\n` sequences after every 76 characters of output
[^go]: Go only allows linebreaks specifically

## JS libraries

Only including libraries with a least a million downloads per week and at least 100 distinct dependents.

| | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input |
| --------------------------- | ----------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- |
| `atob`/`btoa` | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ |
| Node's Buffer | ✅[^node] | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| base64-js (38m/wk) | ✅ (for decoding) | ✅ | ❌ | ❌ | ✅ | ❌[^base64-js] | ❌ |
| @smithy/util-base64 (8m/wk) | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
| crypto-js (6m/wk) | ✅ | ✅ | ❌ | ✅ | ✅ | ❌[^crypto-js] | ❌ |
| js-base64 (5m/wk) | ✅ | ✅ / ⚙️ | ❌ | ✅ | ✅ | ✅ | ✅ |
| base64-arraybuffer (4m/wk) | ❌ | ✅ | ❌ | ✅ | ✅ | ❌[^base64-arraybuffer] | ❌ |
| base64url (2m/wk) | ✅ | ❌ / ⚙️ | ❌ | ✅ | ✅ | ✅ | ✅ |
| base-64 (2m/wk) | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |

[^node]: Node allows mixing alphabets within the same string in input
[^base64-js]: Illegal characters are interpreted as `A`
[^crypto-js]: Illegal characters are interpreted as `A`
[^base64-arraybuffer]: Illegal characters are interpreted as `A`

## "Whitespace"

In all of the above, "whitespace" means only _ASCII_ whitespace. I don't think anyone has special handling for Unicode but non-ASCII whitespace.
17 changes: 9 additions & 8 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
"name": "proposal-arraybuffer-base64",
"scripts": {
"build-playground": "mkdir -p dist && cp playground/* dist && node scripts/static-highlight.js playground/index-raw.html > dist/index.html && rm dist/index-raw.html",
"build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose --js-out dist/spec/ecmarkup.js --css-out dist/spec/ecmarkup.css spec.html dist/spec/index.html",
"build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose spec.html --assets-dir dist/spec dist/spec/index.html",
"build": "npm run build-playground && npm run build-spec",
"format": "emu-format --write spec.html",
"check-format": "emu-format --check spec.html"
},
"dependencies": {
"@tc39/ecma262-biblio": "2.1.2553",
"ecmarkup": "^17.0.0",
"@tc39/ecma262-biblio": "2.1.2653",
"ecmarkup": "^18.0.0",
"jsdom": "^21.1.1",
"prismjs": "^1.29.0"
}
Expand Down
Loading