Support for writing to an existing buffer #26

bakkot · 2023-06-26T21:03:43Z

Fixes #21 (at least for base64).

This adds into, outputOffset, and inputOffset parameters, as well as read and written return values, to the fromPartialBase64 static method.

It also tweaks the extra parameter so we can keep track of individual extra bits, not just characters; thanks @jridgewell for the suggestion. Doing so allows us to completely fill passed buffers, even when they have lengths which are not a multiple of 3, which is a useful property. In addition, this lets us get rid of the more parameter for this method.

This only solves the problem for base64-string->Uint8Array. The other direction doesn't need it, since we don't have StringBuilder or any equivalent. But we probably do want to solve the problem for hex->Uint8Array, which will mean adding a fromPartialHex static method.

Also the names are bad but see #25 for that.

Here's the two complete examples from the playground in this PR:

Single input:

let input = 'SGVsbG8gV29ybGQ=';
let buffer = new Uint8Array(4);
let outputOffset = 0;
let inputOffset = 0;
let extra, written, read;

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
    into: buffer,
    inputOffset,
    outputOffset,
  });

  inputOffset += read;
  outputOffset += written;

  if (outputOffset === buffer.length) {
    // output is full; consume it
    console.log([...buffer]);
    outputOffset = 0;
  }
}
if (outputOffset > 0) {
  console.log([...buffer].slice(0, outputOffset));
}

Chunked input:

let chunks = ['VGhpcyB', 'pcyBzb2', '1lIGV4YW1w', 'bGUgZGF0YS4='];
let output = new Uint8Array(4);
let outputOffset = 0;
let extra;
for (let chunk of chunks) {
  let written, read;
  let inputOffset = 0;

  while (inputOffset < chunk.length) {
    0, { extra, written, read } = Uint8Array.fromPartialBase64(chunk, {
      extra,
      into: output,
      inputOffset,
      outputOffset,
    });

    inputOffset += read;
    outputOffset += written;

    if (outputOffset === output.length) {
      // output is full; consume it
      console.log([...output]);
      outputOffset = 0;
    }
  }
}
if (outputOffset > 0) {
  console.log([...output].slice(0, outputOffset));
}

domenic · 2023-06-27T04:49:23Z

On the web we've tried to encourage people to use array buffer views when they want a (buffer, offset, length) tuple. This has mostly been successful, e.g. in the Encoding and Streams APIs. Some people argued that GPU APIs were so performance-critical that they required positional arguments to avoid the garbage (i.e. someMethod(buffer, offset, length)).

What is proposed here seems less good. I'd encourage you to align with TextEncoder's encodeInto as much as possible, or ReadableStreamBYOBReader's read() if you want to support non-Uint8Arrays.

bakkot · 2023-06-27T05:01:34Z

@domenic Do you mean having just into: Uint8Array instead of both into: Uint8Array, outputOffset: number? That's what I did originally, but a couple people in the linked thread suggested having an explicit offset parameter (here and the following comment). But I'd be happy to simplify in that way if implementations are on board.

It does make using it to decode long strings a little bit more awkward, mostly because you have to use subarray and do a calculation to get the final chunk. Concretely, that first example would become

let input = 'SGVsbG8gV29ybGQ=';
let buffer = new Uint8Array(4);
let inputOffset = 0;
let currentView = buffer;
let extra, written, read;

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
    into: buffer,
    inputOffset,
  });

  inputOffset += read;
  currentView = currentView.subarray(written);
  console.log(written, currentView.length)

  if (currentView.length === 0) {
    // output is full; consume it
    console.log([...buffer]);
    currentView = buffer;
  }
}
if (currentView.length < buffer.length) {
  console.log([...buffer.subarray(0, buffer.length - currentView.length)]);
}

bakkot · 2023-06-27T05:04:58Z

I'd also be happy to get rid of inputOffset if implementations aren't worried about the cost of users doing str.slice(read) every iteration (which I would expect to be relatively cheap).

jridgewell · 2023-06-27T05:35:08Z

I think @domenic's suggestion is the into: Uint8Array and outputOffset: number (not inputOffset) being tied into a single into: Uint8Array:

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
-   into: buffer,
-   outputOffset,
+   into: buffer.subarray(outputOffset),
    inputOffset,
  });

  inputOffset += read;
  outputOffset += written;

  if (outputOffset === buffer.length) {
    // output is full; consume it
    console.log([...buffer]);
    outputOffset = 0;
  }
}

Which is not really better, IMHO. Allocating a throwaway buffer view just to pass it to a method that's going to destruct it back into the real [buffer, offset, length] just seems unnecessary when I already have to keep track of all everything anyways (can't get rid of buffer because we don't want to reallocate another large buffer, outputOffset remains so we can subarray the buffer. Even the view = view.subarray(written) is just another variable instead tracking outputOffset.

jridgewell

I don't think we can remove more, because there's no guarantee that the stream ended in a consistent state. There could still unflushed bits left in extra, and the sample code's just ignoring them.

bakkot · 2023-06-27T05:42:32Z

Unflushed bits in extra at the end of the stream are analagous (or identical) to extra non-zero bits in the final characters in a base64 string, and most implementations will silently ignore those - e.g. atob('YQ==') === atob('YW=='). I generally like to be pedantic about padding, but it would be a break from the usual treatment, and the user can throw if extra?.bits is nonzero after decoding, if they want to. I don't think it's worth complicating the API (and adding another call) just to enforce padding here.

domenic · 2023-06-27T08:39:54Z

I think I am suggesting the following:

const { extra, written, read } = Uint8Array.fromPartialBase64Into(input.substring(inputOffset), new Uint8Array(buffer, outputOffset), { extra });

Although this would be even better if you had a proper class to hold your stateful extra bit:

const decoder = new Base64Decoder();

const { written, read } = decoder.decodeInto(input.substring(inputOffset), new Uint8Array(buffer, outputOffset));

Then it would be fully congruent with existing APIs.

The intuition here is to use the data structures we have already in JS (substrings and Uint8Arrays) to represent spans (over strings and buffers, respectively) instead of packaging up everything into a single options object.

Additionally I think it's a good idea if the into variant is a separate method, since its behavior is so different.

bakkot · 2023-06-27T15:39:01Z

The class-based version is very painful, so I'm not going to go with that unless there's absolutely no other way forward. Other changes I'm neutral-to-positive on, but we'll see what the rest of the committee thinks.

jridgewell

Unflushed bits in extra at the end of the stream are analagous (or identical) to extra non-zero bits in the final characters in a base64 string, and most implementations will silently ignore those - e.g. atob('YQ==') === atob('YW==').

Ahh, I thought padding always referred to the = chars and didn't realize they apply to partial bits as well.

Although this would be even better if you had a proper class to hold your stateful extra bit:

I would also prefer a stateful class, but I've written encoder/decoders both ways and either is acceptable.

The intuition here is to use the data structures we have already in JS (substrings and Uint8Arrays) to represent spans (over strings and buffers, respectively) instead of packaging up everything into a single options object.

I think the beauty of the current approach is that you could do either and both would work. inputOffset and outputOffset both default to zero, so applying a view to your input/buffer and omitting the two offsets gets you exactly the API you want. And if you don't want to allocate throwaway objects, you can pass the offsets.

domenic · 2023-06-28T01:32:37Z

I think the concern over throwaway objects is misplaced; this kind of operation will generally not be performed in a hot loop, and even in hot loops we have not seen good evidence that such objects cause slowdowns due to modern generational GCs.

Jarred-Sumner · 2023-07-03T04:54:57Z

I think the concern over throwaway objects is misplaced; this kind of operation will generally not be performed in a hot loop, and even in hot loops we have not seen good evidence that such objects cause slowdowns due to modern generational GCs.

In JSC/Safari's implementation, accessing the underlying ArrayBuffer of a TypedArray for the first time internally clones the data into a different heap and sets the internal type to be a WastefulTypedArray. I haven't read the equivalent code for V8.

That means .subarray is expensive for typed arrays created without initially passing an ArrayBuffer instance in the constructor (at least in JSC)

You can see the implementation of slowDownAndWasteMemory and the buffer getter

bakkot · 2023-11-01T15:43:29Z

Closing as not planned, since we're probably not doing streaming at all (#27) and just keeping the simple API.

bakkot · 2023-12-13T05:53:27Z

Revived in #33. V8 continues to be concerned with throwaway objects, and per Jarred's comment it sounds like it matters to JSC as well, so I've included an outputOffset parameter. Please continue discussion on the linked thread.

bakkot added 4 commits June 25, 2023 23:05

wip rewrite decoding logic

93db57c

wip argument validation

b11ec58

update playground

ca7722c

address todos

d5f90d0

bakkot mentioned this pull request Jun 26, 2023

encoding into an existing buffer #21

Closed

jridgewell suggested changes Jun 27, 2023

View reviewed changes

jridgewell approved these changes Jun 27, 2023

View reviewed changes

bakkot closed this Nov 1, 2023

ljharb deleted the write-to-existing branch November 1, 2023 16:43

bakkot mentioned this pull request Dec 15, 2023

allow writing to an existing Uint8Array, and add lastChunkHandling option #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for writing to an existing buffer #26

Support for writing to an existing buffer #26

bakkot commented Jun 26, 2023 •

edited

Loading

domenic commented Jun 27, 2023

bakkot commented Jun 27, 2023 •

edited

Loading

bakkot commented Jun 27, 2023

jridgewell commented Jun 27, 2023 •

edited

Loading

jridgewell left a comment

bakkot commented Jun 27, 2023

domenic commented Jun 27, 2023

bakkot commented Jun 27, 2023 •

edited

Loading

jridgewell left a comment

domenic commented Jun 28, 2023

Jarred-Sumner commented Jul 3, 2023 •

edited

Loading

bakkot commented Nov 1, 2023

bakkot commented Dec 13, 2023

Support for writing to an existing buffer #26

Support for writing to an existing buffer #26

Conversation

bakkot commented Jun 26, 2023 • edited Loading

domenic commented Jun 27, 2023

bakkot commented Jun 27, 2023 • edited Loading

bakkot commented Jun 27, 2023

jridgewell commented Jun 27, 2023 • edited Loading

jridgewell left a comment

Choose a reason for hiding this comment

bakkot commented Jun 27, 2023

domenic commented Jun 27, 2023

bakkot commented Jun 27, 2023 • edited Loading

jridgewell left a comment

Choose a reason for hiding this comment

domenic commented Jun 28, 2023

Jarred-Sumner commented Jul 3, 2023 • edited Loading

bakkot commented Nov 1, 2023

bakkot commented Dec 13, 2023

bakkot commented Jun 26, 2023 •

edited

Loading

bakkot commented Jun 27, 2023 •

edited

Loading

jridgewell commented Jun 27, 2023 •

edited

Loading

bakkot commented Jun 27, 2023 •

edited

Loading

Jarred-Sumner commented Jul 3, 2023 •

edited

Loading