Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: add urlSearchParams.sort() #11098

Closed
wants to merge 2 commits into from

Conversation

TimothyGu
Copy link
Member

@TimothyGu TimothyGu commented Feb 1, 2017

The URL Standard requires sort() to be stable, which precludes us from using the V8-native sort() function.

I originally wanted to make the function a simple insertion sort only, but I realized it had the potential for a DoS attack because of its nature of being O(n2).

The algorithm for the function follows the norm: insertion sort for small arrays, divide-and-conquer (in this case, a stable merge sort) for larger arrays. In this specific case, the cutoff is set at 59 items, though if performance improvements are made to either sort the cutoff may be adjusted.

Fixes: #10760
Ref: whatwg/url#26
Ref: whatwg/url#199
Ref: web-platform-tests/wpt#4531

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

url

@nodejs-github-bot nodejs-github-bot added dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels Feb 1, 2017
@jasnell jasnell requested a review from mscdex February 1, 2017 16:48
@jasnell
Copy link
Member

jasnell commented Feb 1, 2017

@mscdex ... I would appreciate you taking a look at this from a perf point of view

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there

with a [stable sorting algorithm][], so relative order between name-value pairs
with the same name is preserved.

This method can be used, in particular, to increase cache hits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example may be helpful here.

}

// arbitrary number found through testing
if (len < 118) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look to benchmark this on a number of different systems to see if this number holds up.

const {URL, URLSearchParams} = require('url');

[
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also include a test with an empty key? e.g. z=a&=b&c=d

Copy link
Member

@joyeecheung joyeecheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting bit looks like C++ code itself really, and it creates two holey arrays, so probably worth just implement those in C++?


update(this[context], this);

function merge(out, start, mid, end, lBuffer, rBuffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved outside of sort

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 1, 2017

@joyeecheung said:

so probably worth just implement those in C++?

Just tried doing something like that:

C++ diff
diff --git a/src/node_url.cc b/src/node_url.cc
index 0d5e695..3ab7108 100644
--- a/src/node_url.cc
+++ b/src/node_url.cc
@@ -1389,6 +1389,49 @@ namespace url {
                             v8::NewStringType::kNormal).ToLocalChecked());
   }

+  static void SortParams(const FunctionCallbackInfo<Value>& args) {
+    Environment* env = Environment::GetCurrent(args);
+    CHECK_GE(args.Length(), 1);
+    CHECK(args[0]->IsArray());
+
+    Local<Array> src = args[0].As<Array>();
+    uint32_t len = src->Length();
+
+    std::vector<std::string> a;
+    Copy(env, src, &a);
+    std::vector<std::string> tmp(len);
+
+    for (uint32_t step = 2; step < len; step *= 2) {
+      for (uint32_t start = 0; start < len - 2; start += 2 * step) {
+        uint32_t mid = start + step;
+        uint32_t end = mid + step;
+        end = end < len ? end : len;
+        if (mid > end)
+          continue;
+
+        uint32_t l = start;
+        uint32_t r = mid;
+        uint32_t t = start;
+        while (l < mid && r < end) {
+          if (a[l] <= a[r]) {
+            tmp[t++] = a[l++];
+            tmp[t++] = a[l++];
+          } else {
+            tmp[t++] = a[r++];
+            tmp[t++] = a[r++];
+          }
+        }
+        while (l < mid)
+          tmp[t++] = a[l++];
+        while (r < end)
+          tmp[t++] = a[r++];
+      }
+      tmp.swap(a);
+    }
+
+    args.GetReturnValue().Set(Copy(env, a));
+  }
+
   static void Init(Local<Object> target,
                    Local<Value> unused,
                    Local<Context> context,
@@ -1398,6 +1441,7 @@ namespace url {
     env->SetMethod(target, "encodeAuth", EncodeAuthSet);
     env->SetMethod(target, "domainToASCII", DomainToASCII);
     env->SetMethod(target, "domainToUnicode", DomainToUnicode);
+    env->SetMethod(target, "sortParams", SortParams);

 #define XX(name, _) NODE_DEFINE_CONSTANT(target, name);
     FLAGS(XX)

It's much slower than the JS version: in fact, the two Copys alone are several times slower than the JS sort(). I suspect I might be doing something gravely wrong, but I'm not very interested in pushing this further. Feel free to try to improve on this though.

@TimothyGu
Copy link
Member Author

@jasnell said:

+    // arbitrary number found through testing
+    if (len < 118) {

We should look to benchmark this on a number of different systems to see if this number holds up.

Agreed. I've made a simple benchmark specifically for this purpose.

I'm unfortunately not too well-versed in data analysis languages, so my method of measuring is rather crude. A regression model is probably better-fit for this purpose.

How to run
# create `merge` and `insertion` binaries, with insertion and merge sort blocks resp.
# commented out

# wait. it's gonna run for a while (~16 minutes)
$ node benchmark/compare.js --old ./insertion --new ./merge --filter rrrr --runs 10 url > sort.csv
$ Rscript benchmark/compare.R < sort
My results
                                                      improvement confidence      p.value
 url/url-searchparams-sort-rrrr.js n=100000 type="02"    -27.63 %        *** 2.033434e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="03"    -36.96 %        *** 5.757754e-07
 url/url-searchparams-sort-rrrr.js n=100000 type="04"    -43.56 %        *** 1.783939e-07
 url/url-searchparams-sort-rrrr.js n=100000 type="05"    -45.19 %        *** 7.799419e-06
 url/url-searchparams-sort-rrrr.js n=100000 type="06"    -50.19 %        *** 1.354454e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="07"    -42.88 %        *** 2.282991e-05
 url/url-searchparams-sort-rrrr.js n=100000 type="08"    -44.04 %        *** 1.352008e-06
 url/url-searchparams-sort-rrrr.js n=100000 type="09"    -54.74 %        *** 4.523282e-05
 url/url-searchparams-sort-rrrr.js n=100000 type="10"    -49.68 %        *** 8.312344e-09
 url/url-searchparams-sort-rrrr.js n=100000 type="12"    -54.26 %        *** 3.827682e-15
 url/url-searchparams-sort-rrrr.js n=100000 type="14"    -46.84 %        *** 7.272478e-06
 url/url-searchparams-sort-rrrr.js n=100000 type="16"    -31.54 %        *** 1.889203e-05
 url/url-searchparams-sort-rrrr.js n=100000 type="18"    -27.55 %        *** 2.764037e-06
 url/url-searchparams-sort-rrrr.js n=100000 type="20"    -26.76 %        *** 4.446811e-05
 url/url-searchparams-sort-rrrr.js n=100000 type="22"    -25.42 %        *** 9.297027e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="24"    -24.46 %         ** 1.730617e-03
 url/url-searchparams-sort-rrrr.js n=100000 type="26"    -16.97 %         ** 2.375029e-03
 url/url-searchparams-sort-rrrr.js n=100000 type="28"    -19.62 %        *** 4.361287e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="30"    -18.92 %          * 2.231800e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="32"    -10.70 %        *** 3.871433e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="34"    -23.76 %        *** 2.304507e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="36"     -5.79 %            2.411327e-01
 url/url-searchparams-sort-rrrr.js n=100000 type="38"    -20.30 %         ** 7.036384e-03
 url/url-searchparams-sort-rrrr.js n=100000 type="40"    -10.63 %            7.832040e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="42"      6.09 %            1.150079e-01
 url/url-searchparams-sort-rrrr.js n=100000 type="44"     -1.04 %            7.316006e-01
 url/url-searchparams-sort-rrrr.js n=100000 type="46"    -14.07 %          * 1.335973e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="48"     -6.45 %          * 4.874718e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="50"     11.42 %        *** 7.072396e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="52"      7.85 %          * 1.622392e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="54"     21.93 %         ** 1.438206e-03
 url/url-searchparams-sort-rrrr.js n=100000 type="56"     16.34 %         ** 9.748436e-03
 url/url-searchparams-sort-rrrr.js n=100000 type="58"      6.05 %          * 3.993576e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="60"     28.51 %        *** 1.892919e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="62"     30.69 %         ** 8.003372e-03
 url/url-searchparams-sort-rrrr.js n=100000 type="64"     56.55 %        *** 1.434437e-06
 url/url-searchparams-sort-rrrr.js n=100000 type="66"     17.48 %        *** 1.588903e-04
 url/url-searchparams-sort-rrrr.js n=100000 type="68"     65.50 %        *** 2.107887e-07
 url/url-searchparams-sort-rrrr.js n=100000 type="70"     38.83 %        *** 3.505777e-07
 url/url-searchparams-sort-rrrr.js n=100000 type="72"      5.23 %            7.135306e-02
 url/url-searchparams-sort-rrrr.js n=100000 type="74"     56.92 %        *** 5.389008e-08
 url/url-searchparams-sort-rrrr.js n=100000 type="76"     74.65 %        *** 5.051242e-10
 url/url-searchparams-sort-rrrr.js n=100000 type="78"     24.76 %        *** 3.956218e-06
 url/url-searchparams-sort-rrrr.js n=100000 type="80"     30.39 %        *** 9.959930e-05

@TimothyGu
Copy link
Member Author

@TimothyGu TimothyGu added the semver-minor PRs that contain new features and should be released in the next minor version. label Feb 3, 2017
@TimothyGu
Copy link
Member Author

Added Web Platform Tests and rebased.

New CI: https://ci.nodejs.org/job/node-test-pull-request/6186/

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

doc/api/url.md Outdated
params.sort();
console.log(params.toString());
// Prints query%5B%5D=123&query%5B%5D=abc&type=search
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to swap the query[]=123 and query[]=abc positions to show that the sort will not affect the ordering of the abc and 123 positions given that 123 would typically appear before abc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

@TimothyGu
Copy link
Member Author

@mscdex, have you had a chance yet to look at this? Frankly, I'm more than happy with its performance, as my makeshift insertion/merge sort hybrid is able to outperform Array.prototype.sort() (which isn't even stable).

} else {
// Bottom-up iterative stable merge sort
const lBuffer = Array(len);
const rBuffer = Array(len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new Array

Copy link
Contributor

@mscdex mscdex Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @targos.

@joyeecheung
Copy link
Member

@TimothyGu There is a cost of memory though, Array.prototype.sort is in-place.

// 8 elements
short: 'm&t&d&c&z&v&a&n',
// 88 elements
long: 'g&r&t&h&s&r&d&w&b&n&h&k&x&m&k&h&o&e&x&c&c&g&e&b&p&p&s&n&j&b&y&z&' +
Copy link
Contributor

@mscdex mscdex Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be 88 parameters because only unique keys are kept by querystring.parse(). One suggestion is to use unique keys first:

long: 'g&r&t&h&s&rr&d&w&b&n&hh&k&x&m&kk&hhh&o&e&xx&c&cc&gg&ee&bb&' +
      'p&pp&ss&nn&j&bbb&y&z&u&l&oo&rrr&ww&a&uu&ll&mm&f&jj&q&ppp&ff&' +
      'eee&yy&eeee&nnn&eeeee&lll&mmm&www&uuu&wwww&tt&nnnn&ttt&qq&v&' +
      'yyy&ccc&ooo&kkk&fff&jjj&i&llll&mmmm&ggg&jjjj&dd&ii&zz&qqq&pppp&' +
      'xxx&qqqq&qqqqq&ddd&nnnnn&yyyy&wwwww&gggg&iii&vv&rrrr'

and then add this before the timed loop to create duplicates:

if (conf.type === 'long') {
  // Make `array` contain duplicate keys, useful for benchmarking stable-ness
  // of sorting
  for (i = 0; i < array.length; i += 2) {
    if (array[i].length > 1)
      array[i] = array[i][0];
  }
}

This should trigger the current alternate code path that uses merge sort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but querystring.parse returns something like { g: [ '', '' ] } in this case, and this object is converted to [ 'g', '', 'g', '' ]. Either way, I added a small ad-hoc parser to skip querystring.parse, and to also maintain the order of the input.

Copy link
Contributor

@mscdex mscdex Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's just a bug in the current querystring.parse() in master. For example, querystring.parse('a&a&a&a') currently returns { a: [ '', '' ] } instead of { a: [ '', '', '', '' ] }. This bug is probably due to 4e259b2 which may/may not be fixed by #11171. At any rate, using unique keys in the beginning will avoid such possible bugs.

almostsorted: 'a&b&c&d&e&f&g&i&h&j&k&l&m&n&o&p&q&r&s&t&u&w&v&x&y&z',
reversed: 'z&y&x&w&v&u&t&s&r&q&p&o&n&m&l&k&j&i&h&g&f&e&d&c&b&a',
random: 'm&t&d&c&z&v&a&n&p&y&u&o&h&l&f&j&e&q&b&i&s&x&k&w&r&g',
// 8 elements
Copy link
Contributor

@mscdex mscdex Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name here would be 'search parameters' or similar. The term 'elements' can be confusing if you're familiar with the underlying URLSearchParams implementation (e.g. does 'elements' mean number of pairs or params[searchParams].length?).


const bench = common.createBenchmark(main, {
type: Object.keys(inputs),
n: [1e5]
Copy link
Contributor

@mscdex mscdex Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a little low IMHO. The results came back awfully fast on my machine, which could mean V8 didn't have enough time to properly optimize functions. For example, I used ~5e6 when testing type=long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 1e6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the results come back too fast there is a possibility that this got loop-invariant code-motioned. I usually get alarmed when I see op/s come close to the op/s of an empty loop on my machine.

var j;
for (j = i - 2; j >= 0; j -= 2) {
var tmpKey = a[j];
var tmpVal = a[j + 1];
Copy link
Contributor

@mscdex mscdex Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove these two temporary variables (use the values as-is where needed), you will get a small perf boost.

@mscdex
Copy link
Contributor

mscdex commented Feb 5, 2017

Without diving into the multitude of sort algorithms out there to refresh my memory, these changes are probably fine, except for a few minor nits.

One unrelated nit I noticed is that because the whatwg URL module uses querystring, the order of parameters is not preserved. From what I read the spec does not explicitly state the order in which parameters are returned (for example in params.toString()). At least Chrome preserves the original order (as long as .sort() is not used of course), but node's implementation groups duplicate keys together. I just thought I'd throw that out there.

@mscdex
Copy link
Contributor

mscdex commented Feb 5, 2017

I should also say I'm a bit confused about the "cutoff" being 59. In the code the conditional is len < 90 which would mean 45 pairs. Either way, 90 or 45 are not 59. It's not obvious which count the "cutoff" is taking into account. @TimothyGu Did you perhaps change the "cutoff" value after your original post?

@TimothyGu
Copy link
Member Author

@mscdex, yes I changed the cutoff after moving merge() out of sort(), which helped the performance a bit. Sorry for not making that clear. I'll look at your comments and see if there's anything more I can do.

@TimothyGu
Copy link
Member Author

@joyeecheung

There is a cost of memory though, Array.prototype.sort is in-place.

For most of the cases (when there are fewer than 50 queries), the in-place selection sort is used. For >= 50, merge sort at most makes two more copies of the params array, so O(n) space complexity. (It is also the algorithm used by SpiderMonkey's Array.prototype.sort.) At the end of the day, chances are, sort() probably isn't the function using up the most memory in an app either way.

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 5, 2017

@mscdex
Copy link
Contributor

mscdex commented Feb 5, 2017

For >= 50, merge sort at most makes two more copies of the params array, so O(n) space complexity.

You mean O(2n)?

@TimothyGu
Copy link
Member Author

You mean O(2n)?

Coefficients are irrelevant in Big O notation 😺

const searchParams = require('internal/url').searchParamsSymbol;
const input = inputs[conf.type];
const n = conf.n | 0;
const params = new URLSearchParams;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO () should be added after the constructor name.

@TimothyGu
Copy link
Member Author

@jasnell, @targos, @mscdex, can I get some LGTMs please :)

@mscdex
Copy link
Contributor

mscdex commented Feb 8, 2017

@TimothyGu
Copy link
Member Author

Rebased. Going to apply on Monday if there are no more objections.

CI: https://ci.nodejs.org/job/node-test-pull-request/6356/

@TimothyGu
Copy link
Member Author

Landed in 02d1e32 and 781eb90.

@TimothyGu TimothyGu closed this Feb 14, 2017
TimothyGu added a commit that referenced this pull request Feb 14, 2017
PR-URL: #11098
Fixes: #10760
Ref: whatwg/url#26
Ref: whatwg/url#199
Ref: web-platform-tests/wpt#4531
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
TimothyGu added a commit that referenced this pull request Feb 14, 2017
PR-URL: #11098
Fixes: #10760
Ref: whatwg/url#26
Ref: whatwg/url#199
Ref: web-platform-tests/wpt#4531
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
@TimothyGu TimothyGu deleted the urlsearchparams-sort branch February 14, 2017 04:50
TimothyGu added a commit to TimothyGu/node that referenced this pull request Feb 18, 2017
TimothyGu added a commit to TimothyGu/node that referenced this pull request Feb 18, 2017
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 20, 2017
italoacasas pushed a commit to italoacasas/node that referenced this pull request Feb 20, 2017
italoacasas pushed a commit that referenced this pull request Feb 22, 2017
italoacasas pushed a commit that referenced this pull request Feb 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor PRs that contain new features and should be released in the next minor version. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants