Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: mark String decoder as legacy #39301

Open
1 task
jimmywarting opened this issue Jul 7, 2021 · 22 comments
Open
1 task

doc: mark String decoder as legacy #39301

jimmywarting opened this issue Jul 7, 2021 · 22 comments
Labels
doc Issues and PRs related to the documentations. string_decoder Issues and PRs related to the string_decoder subsystem.

Comments

@jimmywarting
Copy link

jimmywarting commented Jul 7, 2021

📗 API Reference Docs Problem

https://nodejs.org/docs/latest-v16.x/api/string_decoder.html

Description

Like with util inherit and querystring can you also mark string_decoder as legacy and say something in terms of:

The String decoder is considered Legacy. While it is still maintained, new code should use the TextDecoder API instead.

Would be best for cross platform coding... Maybe even possible deprecate it?

  • I would like to work on this issue and submit a pull request.
@jimmywarting jimmywarting added the doc Issues and PRs related to the documentations. label Jul 7, 2021
@aduh95
Copy link
Contributor

aduh95 commented Jul 8, 2021

I think TextDecoder encoding support is limited to utf-8 while String decoder has also support for utf-16, but maybe that's fine.

@targos
Copy link
Member

targos commented Jul 8, 2021

I think TextDecoder encoding support is limited to utf-8 while String decoder has also support for utf-16, but maybe that's fine.

TextDecoder supports a lot of different encodings: https://nodejs.org/dist/latest-v16.x/docs/api/util.html#util_encodings_supported_by_default_with_full_icu_data

@aduh95
Copy link
Contributor

aduh95 commented Jul 8, 2021

My bad, I mixed it with TextEncoder! Please disregard my previous comment.

@mojavelinux
Copy link

In a quick test, TextDecoder is an order of magnitude slower to decode a Uint8Array to a String than StringDecoder using Node 16. And I can't find anything that works faster than StringDecoder. Please don't deprecate StringDecoder until the performance is at least equivalent.

@jimmywarting
Copy link
Author

jimmywarting commented Jul 10, 2021

@mojavelinux would you mind sharing a performence test case? ... with test results?

We maybe won't go as far as deprecating it and eventually removing it.
Maybe we will just mark it as legacy and say it's best avoided if you write cross compatible code in the docs.

Have you also considered that TextEncoder takes longer cuz it can solve some problems StringDecoder dose not handle? It maybe solves something better that StringDecoder dose not do, like BOM for instance

Also curios how the browserified version compares to native TextDecoder, this is the main reason why i want to see less use StringDecoder... To stop ppl from exporting node modules into browsers and increasing the bundle size too much and writing better cross compatible code

string_decoder depends on Buffer so depending on https://www.npmjs.com/package/string_decoder in the browser adds a hole bunch of bloatware to your bundle, and the browser Buffer is not as fast as node's Buffer

@jimmywarting
Copy link
Author

jimmywarting commented Jul 10, 2021

const { StringDecoder } = require('string_decoder');

const data = new Uint8Array([239, 187, 191, 49]) // BOM + 1(49)

// fails 
JSON.parse(new StringDecoder('utf8').write(data))
JSON.parse(Buffer.from(data.buffer).toString())

// works
JSON.parse(new TextDecoder().decode(data))
new Response(data).json()

I belive fetch decodes data similar to how TextDecoder dose it, more correctly according to a spec...
related: node-fetch/node-fetch#541

@mojavelinux
Copy link

mojavelinux commented Jul 10, 2021

A pretty basic test can demonstrate the order of magnitude change:

const iterations = 10000000

const stringDecoder = new (require('string_decoder').StringDecoder)()
const textDecoder = new (require('util').TextDecoder)()

const SAMPLE = new Uint8Array([112,97,103,101,45,111,110,101,46,97,100,111,99])

function variationA (arr) {
  return stringDecoder.write(arr)
}

function variationB (arr) {
  return textDecoder.decode(arr)
}

let result
const variation = process.argv[2] === 'A' ? variationA : variationB
console.time(variation.name)
for (let i = 0; i < iterations; i++) {
  result = variation(SAMPLE)
}
console.timeEnd(variation.name)

On my machine, the results are as follows:

node perf.js A
//=> variationA: 940.728ms
node perf.js B
//=> variationB: 8.244s

In Chrome, I get:

variationB: 5491ms

It maybe solves something better that StringDecoder dose not do, like BOM for instance

Perhaps I don't need that behavior ;) I also don't think BOM processing warrants an order of magnitude change difference in execution time.

@jimmywarting
Copy link
Author

jimmywarting commented Jul 10, 2021

In Chrome, I get:
variationB: 5491ms

Missing variationA in browser

@mojavelinux
Copy link

I don't know what the API in the browser is for StringDecoder.

@jimmywarting
Copy link
Author

I don't know what the API in the browser is for StringDecoder.

something like:

npx browserify file.js > out.js

// file.js
x = require('string_decoder')

@mojavelinux
Copy link

I'm fine with removing the public use of StringDecoder. Perhaps under the covers in Node.js, TextDecoder can use the StringDecoder code when the encoding is utf-8 and ignoreBOM is true. I just don't want to have to bear such a tremendous drop in performance as it will affect the performance of my application (which calls this method hundreds of thousands of times).

@mojavelinux
Copy link

mojavelinux commented Jul 10, 2021

npx browserify file.js > out.js

All that's doing is testing the implementation in Node.js in a browser. I wouldn't expect there to be any difference since it's the same code (and the same JavaScript engine). I thought you were comparing to a native implementation in the browser.

@mojavelinux
Copy link

mojavelinux commented Jul 10, 2021

As I suspected, the performance is just as good in the browser for StringDecoder. In Chrome, I get:

variationA: 1389ms

What we see is that the performance of the native TextDecoder in Chrome is better than the TextDecoder in Node.js. But there is still a measurable difference when compared to StringDecoder.

@targos
Copy link
Member

targos commented Jul 10, 2021

A quick run with the profiler shows most of the time is spent in one function:
image

Actually, another run shows something different...

@jimmywarting
Copy link
Author

jimmywarting commented Jul 10, 2021

I modified the test a bit and run the test on some webpages with a realistic larger sample set

const SAMPLE = new TextEncoder().encode(document.body.innerText)

it shows variantA is much slower... (in chrome after being browserified)

const iterations = 1000

const stringDecoder = new (require('string_decoder').StringDecoder)()
const textDecoder = (new TextDecoder())

const SAMPLE = new TextEncoder().encode(document.body.innerText)

function variationA (arr) {
  return stringDecoder.write(arr)
}

function variationB (arr) {
  return textDecoder.decode(arr)
}

let result


console.time(variationA.name)
for (let i = 0; i < iterations; i++) {
  result = variationA(SAMPLE)
}
console.timeEnd(variationA.name)


console.time(variationB.name)
for (let i = 0; i < iterations; i++) {
  result = variationB(SAMPLE)
}
console.timeEnd(variationB.name)

on nodejs.org innerText

variationA: 31.10400390625 ms
variationB: 1.6748046875 ms

@jimmywarting
Copy link
Author

jimmywarting commented Jul 10, 2021

here is my browser bundle: https://pastebin.com/DDMKi3yH (for those who do not want to run browserify and wants to test it out for themself) the test are at the bottom, the fact that it requires ~2400 lines of code in the browser and run much slower is exactly the reason why i want to discourage use of it

EDIT:

here is a cdn solution:

const {StringDecoder} = await import('https://jspm.dev/string_decoder')
const iterations = 1000
const stringDecoder = new StringDecoder()
const textDecoder = new TextDecoder()

const SAMPLE = new TextEncoder().encode(document.body.innerText)

function variationA (arr) {
  return stringDecoder.write(arr)
}

function variationB (arr) {
  return textDecoder.decode(arr)
}

let result


console.time(variationA.name)
for (let i = 0; i < iterations; i++) {
  result = variationA(SAMPLE)
}
console.timeEnd(variationA.name)


console.time(variationB.name)
for (let i = 0; i < iterations; i++) {
  result = variationB(SAMPLE)
}
console.timeEnd(variationB.name)

@mojavelinux
Copy link

I can confirm that TextDecoder does seem to perform better for larger sample sizes. Perhaps there is a problem with the implementation that it performs so poorly for a small sample size. I don't know enough at this point to begin to be able to explain what's going on. These are certainly useful observations.

@targos
Copy link
Member

targos commented Jul 10, 2021

I did other runs (of #39301 (comment)) with the code from master instead of a release and here's what I get:

image

The 60% of "self time" in TextDecoder.decode actually come from the call to the C++ function here:

void ConverterObject::Decode(const FunctionCallbackInfo<Value>& args) {

@jasnell
Copy link
Member

jasnell commented Jul 10, 2021

@mojavelinux:

Please don't deprecate StringDecode...

Marking an API as legacy is not the same as deprecating it. APIs that are marked legacy are unlikely to ever change or be deprecated.

@mojavelinux
Copy link

Thanks for pointing that out. Sorry for creating any confusion by mixing up the terminology.

@targos targos added encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. string_decoder Issues and PRs related to the string_decoder subsystem. and removed encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. labels Aug 9, 2021
@zcbenz
Copy link
Contributor

zcbenz commented Jul 18, 2024

The node:string_decoder has a feature that it holds the bytes until a full unicode character can be printed, which I don't think TextDecoder provides:

When a Buffer instance is written to the StringDecoder instance, an internal buffer is used to ensure that the decoded string does not contain any incomplete multibyte characters. These are held in the buffer until the next call to stringDecoder.write() or until stringDecoder.end() is called.

This feature is very convenient when printing outputs byte by byte, for example when implementing the streaming interface of LLM inference.

@targos
Copy link
Member

targos commented Jul 18, 2024

You can do that with TextDecoder as well:

> var d = new TextDecoder
undefined
> d.decode(Uint8Array.of(0xe2), { stream: true })
''
> d.decode(Uint8Array.of(0x82), { stream: true })
''
> d.decode(Uint8Array.of(0xAC), { stream: true })
'€'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Issues and PRs related to the documentations. string_decoder Issues and PRs related to the string_decoder subsystem.
Projects
None yet
Development

No branches or pull requests

6 participants