Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: improve buffer.transcode performance #54153

Merged

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Aug 1, 2024

Let's improve buffer.transcode performance for UTF8 to UTF16le and UTF16le to UTF8 encodings. I'm working on similar implementation on workerd at cloudflare/workerd#2462.

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1591/

                                                                                          confidence improvement accuracy (*)   (**)  (***)
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='latin1'                   -0.73 %       ±1.12% ±1.49% ±1.94%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='ucs2'             ***      2.23 %       ±1.18% ±1.57% ±2.05%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ascii' fromEncoding='utf8'                      0.50 %       ±1.16% ±1.55% ±2.01%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='ascii'                   -1.05 %       ±1.13% ±1.51% ±1.97%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='ucs2'                     0.45 %       ±0.76% ±1.01% ±1.31%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='latin1' fromEncoding='utf8'                    -0.59 %       ±1.50% ±2.00% ±2.61%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='ascii'             ***     26.86 %       ±1.08% ±1.44% ±1.87%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='latin1'            ***     28.85 %       ±0.82% ±1.09% ±1.43%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='ucs2' fromEncoding='utf8'                *      0.82 %       ±0.77% ±1.03% ±1.34%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='utf8' fromEncoding='ascii'               *     -1.15 %       ±1.13% ±1.51% ±1.96%
buffers/buffer-transcode.js n=100000 length=1 toEncoding='utf8' fromEncoding='latin1'                    -0.80 %       ±1.20% ±1.60% ±2.08%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='latin1'                   0.01 %       ±1.21% ±1.61% ±2.10%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='ucs2'            ***      1.57 %       ±0.63% ±0.84% ±1.09%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ascii' fromEncoding='utf8'                    -0.12 %       ±1.04% ±1.39% ±1.81%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='ascii'                  -0.66 %       ±1.00% ±1.33% ±1.73%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='ucs2'                   -0.47 %       ±0.66% ±0.88% ±1.14%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='latin1' fromEncoding='utf8'                    0.47 %       ±1.44% ±1.92% ±2.50%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='ascii'            ***     27.27 %       ±0.86% ±1.15% ±1.50%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='latin1'           ***     28.94 %       ±1.06% ±1.41% ±1.85%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='ucs2' fromEncoding='utf8'                     -0.87 %       ±0.92% ±1.23% ±1.60%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='utf8' fromEncoding='ascii'                    -0.23 %       ±1.08% ±1.44% ±1.87%
buffers/buffer-transcode.js n=100000 length=10 toEncoding='utf8' fromEncoding='latin1'             *     -1.51 %       ±1.19% ±1.58% ±2.06%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='latin1'          *     -1.00 %       ±0.81% ±1.07% ±1.39%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='ucs2'          ***      1.79 %       ±0.36% ±0.49% ±0.64%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ascii' fromEncoding='utf8'                  -0.60 %       ±0.79% ±1.05% ±1.36%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='ascii'                 0.04 %       ±0.65% ±0.86% ±1.12%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='ucs2'                  0.02 %       ±0.38% ±0.51% ±0.68%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='latin1' fromEncoding='utf8'           *     -0.83 %       ±0.82% ±1.09% ±1.42%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='ascii'          ***     17.87 %       ±0.69% ±0.93% ±1.22%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='latin1'         ***      8.76 %       ±0.71% ±0.95% ±1.24%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='ucs2' fromEncoding='utf8'           ***     27.92 %       ±0.62% ±0.82% ±1.07%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='utf8' fromEncoding='ascii'                   0.26 %       ±0.34% ±0.46% ±0.60%
buffers/buffer-transcode.js n=100000 length=1000 toEncoding='utf8' fromEncoding='latin1'                 -0.01 %       ±0.46% ±0.61% ±0.79%

@anonrig anonrig requested a review from jasnell August 1, 2024 02:12
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. needs-ci PRs that need a full CI run. labels Aug 1, 2024
@anonrig anonrig changed the title Improve transcode performance src: improve buffer.transcode performance Aug 1, 2024
@anonrig anonrig requested a review from lemire August 1, 2024 02:15
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig requested a review from mcollina August 1, 2024 02:19
@nodejs-github-bot
Copy link
Collaborator

Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 64.51613% with 11 lines in your changes missing coverage. Please review.

Project coverage is 87.32%. Comparing base (2226155) to head (aab9d64).
Report is 10 commits behind head on main.

Files Patch % Lines
src/node_i18n.cc 64.51% 7 Missing and 4 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #54153   +/-   ##
=======================================
  Coverage   87.32%   87.32%           
=======================================
  Files         648      648           
  Lines      182321   182297   -24     
  Branches    34973    34969    -4     
=======================================
- Hits       159210   159195   -15     
- Misses      16376    16384    +8     
+ Partials     6735     6718   -17     
Files Coverage Δ
src/node_i18n.cc 73.06% <64.51%> (-2.83%) ⬇️

... and 24 files with indirect coverage changes

@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig force-pushed the improve-transcode-performance branch from 956e92c to 1c4812e Compare August 1, 2024 20:22
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig force-pushed the improve-transcode-performance branch 2 times, most recently from 5f9f633 to 514b0b2 Compare August 1, 2024 20:43
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig force-pushed the improve-transcode-performance branch from 514b0b2 to 63092af Compare August 3, 2024 17:11
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig force-pushed the improve-transcode-performance branch 4 times, most recently from 63a202d to 7b2b153 Compare August 3, 2024 17:46
@nodejs-github-bot

This comment was marked as outdated.

@anonrig anonrig force-pushed the improve-transcode-performance branch from 7b2b153 to 3071797 Compare August 4, 2024 15:07
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@targos
Copy link
Member

targos commented Aug 6, 2024

Still fails on big endian platforms

@richardlau
Copy link
Member

@anonrig Try

diff --git a/src/node_i18n.cc b/src/node_i18n.cc
index fab3d7c1c8..a3dee5ee69 100644
--- a/src/node_i18n.cc
+++ b/src/node_i18n.cc
@@ -178,7 +178,7 @@ MaybeLocal<Object> TranscodeLatin1ToUcs2(Environment* env,
                                          UErrorCode* status) {
   MaybeStackBuffer<UChar> destbuf(source_length);
   auto actual_length =
-      simdutf::convert_latin1_to_utf16(source, source_length, destbuf.out());
+      simdutf::convert_latin1_to_utf16le(source, source_length, destbuf.out());
   if (actual_length == 0) {
     *status = U_INVALID_CHAR_FOUND;
     return {};
@@ -222,7 +222,7 @@ MaybeLocal<Object> TranscodeUcs2FromUtf8(Environment* env,
       simdutf::utf16_length_from_utf8(source, source_length);
   MaybeStackBuffer<UChar> destbuf(expected_utf16_length);
   auto actual_length =
-      simdutf::convert_utf8_to_utf16(source, source_length, destbuf.out());
+      simdutf::convert_utf8_to_utf16le(source, source_length, destbuf.out());

   if (actual_length == 0) {
     *status = U_INVALID_CHAR_FOUND;
@@ -239,12 +239,12 @@ MaybeLocal<Object> TranscodeUtf8FromUcs2(Environment* env,
                                          const size_t source_length,
                                          UErrorCode* status) {
   const size_t length_in_chars = source_length / sizeof(UChar);
-  size_t expected_utf8_length = simdutf::utf8_length_from_utf16(
+  size_t expected_utf8_length = simdutf::utf8_length_from_utf16le(
       reinterpret_cast<const char16_t*>(source), length_in_chars);

   MaybeStackBuffer<char> destbuf(expected_utf8_length);
   auto actual_length =
-      simdutf::convert_utf16_to_utf8(reinterpret_cast<const char16_t*>(source),
+      simdutf::convert_utf16le_to_utf8(reinterpret_cast<const char16_t*>(source),
                                      length_in_chars,
                                      destbuf.out());

This is based on V8 internally storing utf16 as little endian (even on big endian platforms). e.g.

node/doc/api/buffer.md

Lines 149 to 152 in 0260bbe

* `'utf16le'` (alias: `'utf-16le'`): Multi-byte encoded Unicode characters.
Unlike `'utf8'`, each character in the string will be encoded using either 2
or 4 bytes. Node.js only supports the [little-endian][endianness] variant of
[UTF-16][].

case UCS2: return "utf16le";

@anonrig anonrig force-pushed the improve-transcode-performance branch from c198be0 to b1f95c5 Compare August 6, 2024 19:10
@nodejs-github-bot
Copy link
Collaborator

@lemire
Copy link
Member

lemire commented Aug 6, 2024

@richardlau Good catch. The generic utf16 is system-dependent.

@anonrig anonrig force-pushed the improve-transcode-performance branch from b1f95c5 to d712db2 Compare August 7, 2024 16:16
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@RedYetiDev RedYetiDev added buffer Issues and PRs related to the buffer subsystem. performance Issues and PRs related to the performance of Node.js. labels Aug 9, 2024
@richardlau
Copy link
Member

CI: https://ci.nodejs.org/job/node-test-pull-request/60981/

The added benchmark fails with the withoutintl build (configure --without-intl).

00:18:35 not ok 4059 benchmark/test-benchmark-buffer
00:18:35   ---
00:18:35   duration_ms: 3419.22200
00:18:35   severity: fail
00:18:35   exitcode: 1
00:18:35   stack: |-
00:18:35     /home/iojs/build/workspace/node-test-commit-linux-containered/benchmark/buffers/buffer-transcode.js:24
00:18:35         const dest = buffer.transcode(input, fromEncoding, toEncoding);
00:18:35                             ^
00:18:35     
00:18:35     TypeError: buffer.transcode is not a function
00:18:35         at main (/home/iojs/build/workspace/node-test-commit-linux-containered/benchmark/buffers/buffer-transcode.js:24:25)
00:18:35         at /home/iojs/build/workspace/node-test-commit-linux-containered/benchmark/common.js:54:9
00:18:35         at process.processTicksAndRejections (node:internal/process/task_queues:85:11)
00:18:35     
00:18:35     Node.js v23.0.0-pre
00:18:35     node:assert:126
00:18:35       throw new AssertionError(obj);
00:18:35       ^
00:18:35     
00:18:35     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
00:18:35     
00:18:35     1 !== 0
00:18:35     
00:18:35         at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux-containered/test/common/benchmark.js:28:12)
00:18:35         at ChildProcess.emit (node:events:520:28)
00:18:35         at ChildProcess._handle.onexit (node:internal/child_process:294:12) {
00:18:35       generatedMessage: true,
00:18:35       code: 'ERR_ASSERTION',
00:18:35       actual: 1,
00:18:35       expected: 0,
00:18:35       operator: 'strictEqual'
00:18:35     }
00:18:35     
00:18:35     Node.js v23.0.0-pre
00:18:35   ...

@anonrig anonrig force-pushed the improve-transcode-performance branch from d712db2 to dd548de Compare August 18, 2024 19:22
@anonrig anonrig force-pushed the improve-transcode-performance branch from dd548de to aab9d64 Compare August 18, 2024 19:23
@anonrig
Copy link
Member Author

anonrig commented Aug 18, 2024

Thank you @richardlau for all the help.

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@anonrig anonrig added the commit-queue Add this label to land a pull request using GitHub Actions. label Aug 20, 2024
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Aug 20, 2024
@nodejs-github-bot nodejs-github-bot merged commit 7c76fa0 into nodejs:main Aug 20, 2024
52 checks passed
@nodejs-github-bot
Copy link
Collaborator

Landed in 7c76fa0

RafaelGSS pushed a commit that referenced this pull request Aug 25, 2024
PR-URL: #54153
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@RafaelGSS RafaelGSS mentioned this pull request Aug 25, 2024
targos pushed a commit that referenced this pull request Sep 21, 2024
PR-URL: #54153
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. needs-ci PRs that need a full CI run. performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants