Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: fix surrogate handling in encodeAuth() #11161

Closed
wants to merge 2 commits into from

Conversation

TimothyGu
Copy link
Member

@TimothyGu TimothyGu commented Feb 4, 2017

Currently, the legacy URL stringifier miscalculates the offset of an extra surrogate, causing the high surrogate to be included unescaped:

> url.format({ auth: '\uD83D\uDE00', hostname: 'a' })
'%F0%9F%98%80�@a'
> Array.from(url.format({ auth: '\uD83D\uDE00', hostname: 'a' })).map(ch => ch.charCodeAt(0).toString(16)).join(' ')
'25 46 30 25 39 46 25 39 38 25 38 30 de00 40 61'
//%  F  0  %  9  F  %  9  8  %  8  0    �  @  a

Also included in the PR is a commit that moves the function, which is only used in the legacy url module, to that file. Certain generic percent encoding handling functions and tables are also moved to a separate lib/internal/querystring.js.

CI: https://ci.nodejs.org/job/node-test-pull-request/6212/

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

url, querystring

@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. dont-land-on-v4.x querystring Issues and PRs related to the built-in querystring module. url Issues and PRs related to the legacy built-in url module. whatwg-url Issues and PRs related to the WHATWG URL implementation. labels Feb 4, 2017
const hexTable = [];
for (var i = 0; i < 256; ++i)
hexTable[i] = '%' + ((i < 16 ? '0' : '') + i.toString(16)).toUpperCase();

// These characters do not need escaping when generating query strings:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving other private functions to internal/querystring too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only moved this array because it is used in multiple places (querystring and url). The only other entity in querystring that is somewhat shared is ParsedQueryString, a copy of which is in url (as ParsedQueryString as well) and in internal/url (as StorageObject). I've moved this as well. (There is also an analog of it in events as EventHandlers, but I kept it as it is since it'd be weird for that module to import from internal/querystring).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think placing private bits intointernal would make the boundaries clearer, whether its shared or not, like BufferList for _stream_readable.js (only used once there).

lib/url.js Outdated
@@ -948,3 +948,67 @@ function spliceOne(list, index) {
list[i] = list[k];
list.pop();
}

function encodeAuth(str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this originally came from querystring.escape(), but that function's implementation has recently improved performance-wise and we should probably incorporate those changes here (specifically the ASCII value checking -- it uses a lookup table now).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 3% improvement on url/legacy-vs-whatwg-url-serialize.js with type=auth method=legacy n=5e5.

@@ -15,6 +15,7 @@ const QueryString = module.exports = {
decode: parse
};
const Buffer = require('buffer').Buffer;
const hexTable = require('internal/querystring').hexTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more consistent about how we import single items like this. Here we're using the pre-ES6 style, but in other places we are destructuring like const { hexTable } = require('internal/querystring');

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially followed the style of existing imports in the file (see Buffer above), but since I'm importing multiple items from this module now I started using destructuring.

@TimothyGu
Copy link
Member Author

@joyeecheung and @mscdex, comments addressed.

New CI: https://ci.nodejs.org/job/node-test-pull-request/6213/

@mscdex, I also factored out the entire percent encoding function so that it can be shared between querystring's qsEscape() and url's encodeAuth. Only trivial performance drop is observed in querystring.

Benchmark results
                                                                            improvement confidence      p.value
 querystring/querystring-stringify.js n=300000 type="encodelast"                -1.59 %            0.3240629986
 querystring/querystring-stringify.js n=300000 type="encodemany"                -0.23 %            0.6514424297
 querystring/querystring-stringify.js n=300000 type="noencode"                  -2.94 %          * 0.0165429734
 url/legacy-vs-whatwg-url-serialize.js n=500000 method="legacy" type="auth"      3.80 %        *** 0.0006922195

@TimothyGu
Copy link
Member Author

}
}

// TODO support returning arbitrary buffers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-topic: what does this TODO mean? Take a buffer and escape its content?

Copy link
Member Author

@TimothyGu TimothyGu Feb 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it means using a user-provided buffer. However, at the same time, this comment goes all the way back to 57d8172 (v0.3.2), so whether this TODO is still applicable isn't clear by any means.

Anyone is welcome to initiate a formal deprecation process of these exports in another PR.


// QueryString.escape() replaces encodeURIComponent()
// http://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3.4
function percentEscape(noEscape, str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/noEscape/noEscapeTable would be more intuitive, noEscape sounds like a boolean..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

StorageObject,
percentEscape,
unescapeBuffer
} = require('internal/querystring');
const QueryString = module.exports = {
unescapeBuffer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-topic: unescapeBuffer, encode and decode are all removed from the docs, mind adding a comment?
(Not sure about which stage of deprecation they are in)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode and decode were added in 4ce100f (v0.1.99). unescapeBuffer is slightly newer, but by not much, being added in 6ff12c4 (v0.3.2). None of them have ever been documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..mind adding a comment about that they have never been documented, so technically not public APIs, just exposed by accident?

Copy link
Contributor

@mscdex mscdex Feb 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd almost rather just document them IMHO, at least unescapeBuffer() since it's not an alias to an already documented method. I've personally had uses for unescapeBuffer() in my own projects. *shrug*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to deprecations or anything, just now that we are refactoring things, might as well :P

@mscdex
Copy link
Contributor

mscdex commented Feb 6, 2017

I just benchmarked the current changes in this PR and there is a 3-4% performance regression (with high "confidence" ratings) in the querystring-stringify benchmarks. With that in mind, I think we should avoid extracting the logic into a separate function (percentEscape()), unless something can be done to avoid the regressions. I don't know if binding the function with the table once in each module will make any difference or not, but it's another idea to try I suppose.

@jasnell
Copy link
Member

jasnell commented Feb 6, 2017

Ouch, yeah, with numbers like that I have to agree.

@TimothyGu
Copy link
Member Author

@mscdex, okay for now I left percent encoding alone in their respective files.

CI: https://ci.nodejs.org/job/node-test-pull-request/6362/

StorageObject.prototype = Object.create(null);

// a safe fast alternative to decodeURIComponent
function unescapeBuffer(s, decodeSpaces) {
Copy link
Contributor

@mscdex mscdex Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved back to lib/querystring.js as well, since it's not used anywhere else, along with unhexTable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one reason to keep it in internal is that unescapeBuffer is exported to user land by accident, putting it in internal should make it clear(if there is no performance regression caused by this though).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should just document unescapeBuffer(), since it can be useful (and I've used it in some of my projects before for decoding binary/latin1 strings).

Copy link
Member

@joyeecheung joyeecheung Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so probably move it back to lib/querystring.js for now, and let another PR sort this issue out, no matter we want to make it really internal or start to document it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this PR does not modify the exports on querystring: unescapeBuffer() is still exported on require('querystring'), just the physical location is changed.

That said, I thought this function would be of interest for other modules like url as well, which was why I moved it to internal/url.js in the first place. As it doesn't seem like this decision is popular, I've reverted it. PTAL.

return out.slice(0, outIndex);
}

exports.hexTable = hexTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be switched to the module.exports = { ... } syntax instead (for faster property access)?

@TimothyGu
Copy link
Member Author

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 14, 2017

@mscdex, any more comments? Note, I am not against documenting qs.unescapeBuffer(), but it falls outside the scope of this PR which is really about fixing a corner case in url.format().

@mscdex
Copy link
Contributor

mscdex commented Feb 14, 2017

@TimothyGu I recognize that, I was just advocating for returning it back to lib/querystring.js.

Current changes LGTM.

@TimothyGu TimothyGu removed the whatwg-url Issues and PRs related to the WHATWG URL implementation. label Feb 14, 2017
TimothyGu added a commit to TimothyGu/node that referenced this pull request Feb 14, 2017
Also factor out common parts in querystring and url.

PR-URL: nodejs#11161
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
@TimothyGu
Copy link
Member Author

Landed in c6b12d0.

@TimothyGu TimothyGu closed this Feb 14, 2017
@TimothyGu TimothyGu deleted the url-fix branch February 14, 2017 20:15
TimothyGu added a commit to TimothyGu/node that referenced this pull request Feb 18, 2017
Also factor out common parts in querystring and url.

Backport-of: nodejs#11161
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 20, 2017
Also factor out common parts in querystring and url.

Backport-of: nodejs#11161
italoacasas pushed a commit that referenced this pull request Feb 22, 2017
Also factor out common parts in querystring and url.

Backport-of: #11161
TimothyGu added a commit to TimothyGu/node that referenced this pull request Mar 4, 2017
Currently, the legacy URL stringifier miscalculates the offset of an
extra surrogate, causing the high surrogate to be included unescaped:

    > url.format({ auth: '\uD83D\uDE00', hostname: 'a' })
    '//%F0%9F%98%80�@A'

PR-URL: nodejs#11387
Backport-of: nodejs#11161
MylesBorins pushed a commit that referenced this pull request Mar 8, 2017
Currently, the legacy URL stringifier miscalculates the offset of an
extra surrogate, causing the high surrogate to be included unescaped:

    > url.format({ auth: '\uD83D\uDE00', hostname: 'a' })
    '//%F0%9F%98%80�@A'

PR-URL: #11387
Backport-of: #11161
MylesBorins pushed a commit that referenced this pull request Mar 9, 2017
Currently, the legacy URL stringifier miscalculates the offset of an
extra surrogate, causing the high surrogate to be included unescaped:

    > url.format({ auth: '\uD83D\uDE00', hostname: 'a' })
    '//%F0%9F%98%80�@A'

PR-URL: #11387
Backport-of: #11161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. querystring Issues and PRs related to the built-in querystring module. url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants