Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for writing to an existing buffer #26

Closed
wants to merge 4 commits into from
Closed

Conversation

bakkot
Copy link
Collaborator

@bakkot bakkot commented Jun 26, 2023

Fixes #21 (at least for base64).

This adds into, outputOffset, and inputOffset parameters, as well as read and written return values, to the fromPartialBase64 static method.

It also tweaks the extra parameter so we can keep track of individual extra bits, not just characters; thanks @jridgewell for the suggestion. Doing so allows us to completely fill passed buffers, even when they have lengths which are not a multiple of 3, which is a useful property. In addition, this lets us get rid of the more parameter for this method.

This only solves the problem for base64-string->Uint8Array. The other direction doesn't need it, since we don't have StringBuilder or any equivalent. But we probably do want to solve the problem for hex->Uint8Array, which will mean adding a fromPartialHex static method.

Also the names are bad but see #25 for that.

Here's the two complete examples from the playground in this PR:

Single input:

let input = 'SGVsbG8gV29ybGQ=';
let buffer = new Uint8Array(4);
let outputOffset = 0;
let inputOffset = 0;
let extra, written, read;

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
    into: buffer,
    inputOffset,
    outputOffset,
  });

  inputOffset += read;
  outputOffset += written;

  if (outputOffset === buffer.length) {
    // output is full; consume it
    console.log([...buffer]);
    outputOffset = 0;
  }
}
if (outputOffset > 0) {
  console.log([...buffer].slice(0, outputOffset));
}

Chunked input:

let chunks = ['VGhpcyB', 'pcyBzb2', '1lIGV4YW1w', 'bGUgZGF0YS4='];
let output = new Uint8Array(4);
let outputOffset = 0;
let extra;
for (let chunk of chunks) {
  let written, read;
  let inputOffset = 0;

  while (inputOffset < chunk.length) {
    0, { extra, written, read } = Uint8Array.fromPartialBase64(chunk, {
      extra,
      into: output,
      inputOffset,
      outputOffset,
    });

    inputOffset += read;
    outputOffset += written;

    if (outputOffset === output.length) {
      // output is full; consume it
      console.log([...output]);
      outputOffset = 0;
    }
  }
}
if (outputOffset > 0) {
  console.log([...output].slice(0, outputOffset));
}

@domenic
Copy link
Member

domenic commented Jun 27, 2023

On the web we've tried to encourage people to use array buffer views when they want a (buffer, offset, length) tuple. This has mostly been successful, e.g. in the Encoding and Streams APIs. Some people argued that GPU APIs were so performance-critical that they required positional arguments to avoid the garbage (i.e. someMethod(buffer, offset, length)).

What is proposed here seems less good. I'd encourage you to align with TextEncoder's encodeInto as much as possible, or ReadableStreamBYOBReader's read() if you want to support non-Uint8Arrays.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 27, 2023

@domenic Do you mean having just into: Uint8Array instead of both into: Uint8Array, outputOffset: number? That's what I did originally, but a couple people in the linked thread suggested having an explicit offset parameter (here and the following comment). But I'd be happy to simplify in that way if implementations are on board.

It does make using it to decode long strings a little bit more awkward, mostly because you have to use subarray and do a calculation to get the final chunk. Concretely, that first example would become

let input = 'SGVsbG8gV29ybGQ=';
let buffer = new Uint8Array(4);
let inputOffset = 0;
let currentView = buffer;
let extra, written, read;

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
    into: buffer,
    inputOffset,
  });

  inputOffset += read;
  currentView = currentView.subarray(written);
  console.log(written, currentView.length)

  if (currentView.length === 0) {
    // output is full; consume it
    console.log([...buffer]);
    currentView = buffer;
  }
}
if (currentView.length < buffer.length) {
  console.log([...buffer.subarray(0, buffer.length - currentView.length)]);
}

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 27, 2023

I'd also be happy to get rid of inputOffset if implementations aren't worried about the cost of users doing str.slice(read) every iteration (which I would expect to be relatively cheap).

@jridgewell
Copy link
Member

jridgewell commented Jun 27, 2023

I think @domenic's suggestion is the into: Uint8Array and outputOffset: number (not inputOffset) being tied into a single into: Uint8Array:

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
-   into: buffer,
-   outputOffset,
+   into: buffer.subarray(outputOffset),
    inputOffset,
  });

  inputOffset += read;
  outputOffset += written;

  if (outputOffset === buffer.length) {
    // output is full; consume it
    console.log([...buffer]);
    outputOffset = 0;
  }
}

Which is not really better, IMHO. Allocating a throwaway buffer view just to pass it to a method that's going to destruct it back into the real [buffer, offset, length] just seems unnecessary when I already have to keep track of all everything anyways (can't get rid of buffer because we don't want to reallocate another large buffer, outputOffset remains so we can subarray the buffer. Even the view = view.subarray(written) is just another variable instead tracking outputOffset.

Copy link
Member

@jridgewell jridgewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can remove more, because there's no guarantee that the stream ended in a consistent state. There could still unflushed bits left in extra, and the sample code's just ignoring them.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 27, 2023

Unflushed bits in extra at the end of the stream are analagous (or identical) to extra non-zero bits in the final characters in a base64 string, and most implementations will silently ignore those - e.g. atob('YQ==') === atob('YW=='). I generally like to be pedantic about padding, but it would be a break from the usual treatment, and the user can throw if extra?.bits is nonzero after decoding, if they want to. I don't think it's worth complicating the API (and adding another call) just to enforce padding here.

@domenic
Copy link
Member

domenic commented Jun 27, 2023

I think I am suggesting the following:

const { extra, written, read } = Uint8Array.fromPartialBase64Into(input.substring(inputOffset), new Uint8Array(buffer, outputOffset), { extra });

Although this would be even better if you had a proper class to hold your stateful extra bit:

const decoder = new Base64Decoder();

const { written, read } = decoder.decodeInto(input.substring(inputOffset), new Uint8Array(buffer, outputOffset));

Then it would be fully congruent with existing APIs.

The intuition here is to use the data structures we have already in JS (substrings and Uint8Arrays) to represent spans (over strings and buffers, respectively) instead of packaging up everything into a single options object.

Additionally I think it's a good idea if the into variant is a separate method, since its behavior is so different.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 27, 2023

The class-based version is very painful, so I'm not going to go with that unless there's absolutely no other way forward. Other changes I'm neutral-to-positive on, but we'll see what the rest of the committee thinks.

Copy link
Member

@jridgewell jridgewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unflushed bits in extra at the end of the stream are analagous (or identical) to extra non-zero bits in the final characters in a base64 string, and most implementations will silently ignore those - e.g. atob('YQ==') === atob('YW==').

Ahh, I thought padding always referred to the = chars and didn't realize they apply to partial bits as well.

Although this would be even better if you had a proper class to hold your stateful extra bit:

I would also prefer a stateful class, but I've written encoder/decoders both ways and either is acceptable.

The intuition here is to use the data structures we have already in JS (substrings and Uint8Arrays) to represent spans (over strings and buffers, respectively) instead of packaging up everything into a single options object.

I think the beauty of the current approach is that you could do either and both would work. inputOffset and outputOffset both default to zero, so applying a view to your input/buffer and omitting the two offsets gets you exactly the API you want. And if you don't want to allocate throwaway objects, you can pass the offsets.

@domenic
Copy link
Member

domenic commented Jun 28, 2023

I think the concern over throwaway objects is misplaced; this kind of operation will generally not be performed in a hot loop, and even in hot loops we have not seen good evidence that such objects cause slowdowns due to modern generational GCs.

@Jarred-Sumner
Copy link

Jarred-Sumner commented Jul 3, 2023

I think the concern over throwaway objects is misplaced; this kind of operation will generally not be performed in a hot loop, and even in hot loops we have not seen good evidence that such objects cause slowdowns due to modern generational GCs.

In JSC/Safari's implementation, accessing the underlying ArrayBuffer of a TypedArray for the first time internally clones the data into a different heap and sets the internal type to be a WastefulTypedArray. I haven't read the equivalent code for V8.

That means .subarray is expensive for typed arrays created without initially passing an ArrayBuffer instance in the constructor (at least in JSC)

You can see the implementation of slowDownAndWasteMemory and the buffer getter

@bakkot
Copy link
Collaborator Author

bakkot commented Nov 1, 2023

Closing as not planned, since we're probably not doing streaming at all (#27) and just keeping the simple API.

@bakkot bakkot closed this Nov 1, 2023
@ljharb ljharb deleted the write-to-existing branch November 1, 2023 16:43
@bakkot
Copy link
Collaborator Author

bakkot commented Dec 13, 2023

Revived in #33. V8 continues to be concerned with throwaway objects, and per Jarred's comment it sounds like it matters to JSC as well, so I've included an outputOffset parameter. Please continue discussion on the linked thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

encoding into an existing buffer
4 participants