Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache conscious hashmap table #36692

Merged
merged 1 commit into from
Oct 14, 2016
Merged

Conversation

arthurprs
Copy link
Contributor

@arthurprs arthurprs commented Sep 24, 2016

Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at #21973.

benefits

This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a no-brainer because of how the RH algorithm works and that's unchanged.

Lookups: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout).
Inserts/Deletes/Resize: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions.

Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/

drawbacks

The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower.

Total wasted padding between items (C being the capacity of the table).

  • Old layout: C * (K-K padding) + C * (V-V padding)
  • Proposed: C * (K-V padding) + C * (V-K padding)

In practice padding between K-K and V-V can be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, hardly the average case in practice).

Starting from the worst case the memory overhead is:

  • HashMap<u64, u8> 46% memory overhead. (aka worst case)
  • HashMap<u64, u16> 33% memory overhead.
  • HashMap<u64, u32> 20% memory overhead.
  • HashMap<T, T> 0% memory overhead
  • Worst case based on sizeof K + sizeof V:
x 16 24 32 64 128
(8+x+7)/(8+x) 1.29 1.22 1.18 1.1 1.05

benchmarks

I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout

 ➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt
 name                            hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff % 
 grow_10_000                     922,064           783,933               -138,131  -14.98% 
 grow_big_value_10_000           1,901,909         1,171,862             -730,047  -38.38% 
 grow_fnv_10_000                 443,544           418,674                -24,870   -5.61% 
 insert_100                      2,469             2,342                     -127   -5.14% 
 insert_1000                     23,331            21,536                  -1,795   -7.69% 
 insert_100_000                  4,748,048         3,764,305             -983,743  -20.72% 
 insert_10_000                   321,744           290,126                -31,618   -9.83% 
 insert_int_bigvalue_10_000      749,764           407,547               -342,217  -45.64% 
 insert_str_10_000               337,425           334,009                 -3,416   -1.01% 
 insert_string_10_000            788,667           788,262                   -405   -0.05% 
 iter_keys_100_000               394,484           374,161                -20,323   -5.15% 
 iter_keys_big_value_100_000     402,071           620,810                218,739   54.40% 
 iter_values_100_000             424,794           373,004                -51,790  -12.19% 
 iterate_100_000                 424,297           389,950                -34,347   -8.10% 
 lookup_100_000                  189,997           186,554                 -3,443   -1.81% 
 lookup_100_000_bigvalue         192,509           189,695                 -2,814   -1.46% 
 lookup_10_000                   154,251           145,731                 -8,520   -5.52% 
 lookup_10_000_bigvalue          162,315           146,527                -15,788   -9.73% 
 lookup_10_000_exist             132,769           128,922                 -3,847   -2.90% 
 lookup_10_000_noexist           146,880           144,504                 -2,376   -1.62% 
 lookup_1_000_000                137,167           132,260                 -4,907   -3.58% 
 lookup_1_000_000_bigvalue       141,130           134,371                 -6,759   -4.79% 
 lookup_1_000_000_bigvalue_unif  567,235           481,272                -85,963  -15.15% 
 lookup_1_000_000_unif           589,391           453,576               -135,815  -23.04% 
 merge_shuffle                   1,253,357         1,207,387              -45,970   -3.67% 
 merge_simple                    40,264,690        37,996,903          -2,267,787   -5.63% 
 new                             6                 5                           -1  -16.67% 
 with_capacity_10e5              3,214             3,256                       42    1.31%
➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt                                           
 name                           hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff % 
 iter_keys_100_000              391,677           382,839                 -8,838   -2.26% 
 iter_keys_1_000_000            10,797,360        10,209,898            -587,462   -5.44% 
 iter_keys_big_value_100_000    414,736           662,255                247,519   59.68% 
 iter_keys_big_value_1_000_000  10,147,837        12,067,938           1,920,101   18.92% 
 iter_values_100_000            440,445           377,080                -63,365  -14.39% 
 iter_values_1_000_000          10,931,844        9,979,173             -952,671   -8.71% 
 iterate_100_000                428,644           388,509                -40,135   -9.36% 
 iterate_1_000_000              11,065,419        10,042,427          -1,022,992   -9.24%

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@@ -371,8 +370,7 @@ impl<K, V, M> EmptyBucket<K, V, M>
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
ptr::write(self.raw.key as *mut K, key);
ptr::write(self.raw.val as *mut V, value);
ptr::write(self.raw.pair as *mut (K, V), (key, value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would feel more natural to have two writes here and skip making a tuple. Does it matter either direction for performance?

Copy link
Contributor Author

@arthurprs arthurprs Sep 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the disassembler and the end result seems to be the same

for (usize, usize) it's both MOVDQU

for (usize, [u64; 10]) it's

    pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
        unsafe {
            *self.raw.hash = hash.inspect();
    ec1d:       49 89 39                mov    %rdi,(%r9)
    ec20:       48 8b 8d 10 fe ff ff    mov    -0x1f0(%rbp),%rcx
    ec27:       49 89 0c 24             mov    %rcx,(%r12)
    ec2b:       0f 28 85 50 ff ff ff    movaps -0xb0(%rbp),%xmm0
    ec32:       41 0f 11 44 24 48       movups %xmm0,0x48(%r12)
    ec38:       0f 28 85 10 ff ff ff    movaps -0xf0(%rbp),%xmm0
    ec3f:       0f 28 8d 20 ff ff ff    movaps -0xe0(%rbp),%xmm1
    ec46:       0f 28 95 30 ff ff ff    movaps -0xd0(%rbp),%xmm2
    ec4d:       0f 28 9d 40 ff ff ff    movaps -0xc0(%rbp),%xmm3
    ec54:       41 0f 11 5c 24 38       movups %xmm3,0x38(%r12)
    ec5a:       41 0f 11 54 24 28       movups %xmm2,0x28(%r12)
    ec60:       41 0f 11 4c 24 18       movups %xmm1,0x18(%r12)
    ec66:       41 0f 11 44 24 08       movups %xmm0,0x8(%r12)
    ec6c:       4c 8b 75 b8             mov    -0x48(%rbp),%r14
            let pair_mut = self.raw.pair as *mut (K, V);
            ptr::write(&mut (*pair_mut).0, key);
            ptr::write(&mut (*pair_mut).1, value);
    pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
        unsafe {
            *self.raw.hash = hash.inspect();
    ec1d:       49 89 39                mov    %rdi,(%r9)
    ec20:       48 8b 8d 10 fe ff ff    mov    -0x1f0(%rbp),%rcx
    ec27:       49 89 0c 24             mov    %rcx,(%r12)
    ec2b:       0f 28 85 50 ff ff ff    movaps -0xb0(%rbp),%xmm0
    ec32:       41 0f 11 44 24 48       movups %xmm0,0x48(%r12)
    ec38:       0f 28 85 10 ff ff ff    movaps -0xf0(%rbp),%xmm0
    ec3f:       0f 28 8d 20 ff ff ff    movaps -0xe0(%rbp),%xmm1
    ec46:       0f 28 95 30 ff ff ff    movaps -0xd0(%rbp),%xmm2
    ec4d:       0f 28 9d 40 ff ff ff    movaps -0xc0(%rbp),%xmm3
    ec54:       41 0f 11 5c 24 38       movups %xmm3,0x38(%r12)
    ec5a:       41 0f 11 54 24 28       movups %xmm2,0x28(%r12)
    ec60:       41 0f 11 4c 24 18       movups %xmm1,0x18(%r12)
    ec66:       41 0f 11 44 24 08       movups %xmm0,0x8(%r12)
    ec6c:       4c 8b 75 b8             mov    -0x48(%rbp),%r14
            let pair_mut = self.raw.pair as *mut (K, V);
            ptr::write(pair_mut, (key, value));

for (String, usize) it's

    pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
        unsafe {
            *self.raw.hash = hash.inspect();
    f670:   4d 89 20                mov    %r12,(%r8)
    f673:   48 8b 45 90             mov    -0x70(%rbp),%rax
    f677:   49 89 06                mov    %rax,(%r14)
    f67a:   f3 41 0f 7f 46 08       movdqu %xmm0,0x8(%r14)
    f680:   49 89 5e 18             mov    %rbx,0x18(%r14)
    f684:   48 8b 5d c8             mov    -0x38(%rbp),%rbx
            let pair_mut = self.raw.pair as *mut (K, V);
            ptr::write(pair_mut, (key, value));
    pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
        unsafe {
            *self.raw.hash = hash.inspect();
    f670:   4d 89 20                mov    %r12,(%r8)
    f673:   48 8b 45 90             mov    -0x70(%rbp),%rax
    f677:   49 89 06                mov    %rax,(%r14)
    f67a:   f3 41 0f 7f 46 08       movdqu %xmm0,0x8(%r14)
    f680:   49 89 5e 18             mov    %rbx,0x18(%r14)
    f684:   48 8b 5d c8             mov    -0x38(%rbp),%rbx
            let pair_mut = self.raw.pair as *mut (K, V);
            // ptr::write(pair_mut, (key, value));
            ptr::write(&mut (*pair_mut).0, key);
            ptr::write(&mut (*pair_mut).1, value);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. As usual, the compiler is very smart.

@bluss
Copy link
Member

bluss commented Sep 24, 2016

This subset are the .contains_key() benchmarks (retrieving key but not value).
The change to be slower by a few percent reproduces on my machine too (mine is closer to 2%).

 lookup_10_000_exist         127,965           131,684                  3,719    2.91% 
 lookup_10_000_noexist       143,792           143,362                   -430   -0.30% 

So this would be the drawback, where the old layout had better cache usage. It seems ok to give this up in return for the rest?

@arthurprs
Copy link
Contributor Author

.keys() and .values() should be slower in this layout, but I can't reproduce it.

@arthurprs
Copy link
Contributor Author

arthurprs commented Sep 24, 2016

Results for x86

➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: x86.txt
 name                        hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff % 
 grow_10_000                 1,298,744         1,197,093             -101,651   -7.83% 
 grow_big_value_10_000       4,285,679         3,887,095             -398,584   -9.30% 
 grow_fnv_10_000             434,184           419,664                -14,520   -3.34% 
 insert_100                  5,256             4,897                     -359   -6.83% 
 insert_1000                 47,448            47,906                     458    0.97% 
 insert_100_000              6,955,971         6,586,020             -369,951   -5.32% 
 insert_10_000               544,478           530,413                -14,065   -2.58% 
 insert_int_bigvalue_10_000  1,441,801         1,178,893             -262,908  -18.23% 
 insert_str_10_000           631,572           596,395                -35,177   -5.57% 
 insert_string_10_000        1,413,129         1,384,202              -28,927   -2.05% 
 iter_keys_10_000            56,995            55,921                  -1,074   -1.88% 
BUSTED iter_keys_big_value_10_000  67,816            60,087                  -7,729  -11.40% 
 iter_values_10_000          62,525            55,809                  -6,716  -10.74% 
 iterate_10_000              62,070            53,937                  -8,133  -13.10% 
 lookup_100_000              334,076           313,012                -21,064   -6.31% 
 lookup_100_000_bigvalue     325,324           319,972                 -5,352   -1.65% 
 lookup_10_000               270,232           263,861                 -6,371   -2.36% 
 lookup_10_000_bigvalue      288,415           270,581                -17,834   -6.18% 
 lookup_10_000_exist         252,338           248,224                 -4,114   -1.63% 
 lookup_10_000_noexist       273,254           272,914                   -340   -0.12% 
 lookup_1_000_000            262,000           259,096                 -2,904   -1.11% 
 lookup_1_000_000_bigvalue   275,820           265,966                 -9,854   -3.57% 
 merge_shuffle               1,664,975         1,542,400             -122,575   -7.36% 
 merge_simple                47,805,889        36,244,422         -11,561,467  -24.18% 
 new                         10                9                           -1  -10.00% 
 with_capacity_10e5          2,496             2,555                       59    2.36%

x86 again (with usize hashes from #36595, thus 31 hash bits)
This one is interesting because I was expecting a regression, but that's not the case.

➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: x86.txt                 
 name                        hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff % 
 grow_10_000                 1,274,315         1,123,181             -151,134  -11.86% 
 grow_big_value_10_000       4,303,715         4,018,353             -285,362   -6.63% 
 grow_fnv_10_000             382,259           352,470                -29,789   -7.79% 
 insert_100                  4,923             4,792                     -131   -2.66% 
 insert_1000                 46,183            44,468                  -1,715   -3.71% 
 insert_100_000              7,096,014         6,078,014           -1,018,000  -14.35% 
 insert_10_000               517,265           507,215                -10,050   -1.94% 
 insert_int_bigvalue_10_000  1,401,856         1,175,129             -226,727  -16.17% 
 insert_str_10_000           598,338           586,628                -11,710   -1.96% 
 insert_string_10_000        1,365,544         1,358,503               -7,041   -0.52% 
 iter_keys_10_000            59,629            53,578                  -6,051  -10.15% 
BUSTED iter_keys_big_value_10_000  73,169            65,105                  -8,064  -11.02% 
 iter_values_10_000          81,068            52,079                 -28,989  -35.76% 
 iterate_10_000              85,855            53,962                 -31,893  -37.15% 
 lookup_100_000              313,490           299,432                -14,058   -4.48% 
 lookup_100_000_bigvalue     309,488           302,861                 -6,627   -2.14% 
 lookup_10_000               256,165           250,370                 -5,795   -2.26% 
 lookup_10_000_bigvalue      270,559           256,912                -13,647   -5.04% 
 lookup_10_000_exist         249,432           241,687                 -7,745   -3.11% 
 lookup_10_000_noexist       272,390           272,683                    293    0.11% 
 lookup_1_000_000            261,079           252,781                 -8,298   -3.18% 
 lookup_1_000_000_bigvalue   265,090           253,114                -11,976   -4.52% 
 merge_shuffle               1,580,698         1,423,032             -157,666   -9.97% 
 merge_simple                44,202,722        29,126,884         -15,075,838  -34.11% 
 new                         9                 9                            0    0.00% 
 with_capacity_10e5          1,264             1,349                       85    6.72%

@bluss
Copy link
Member

bluss commented Sep 24, 2016

Maybe with bigger hashmaps? To make sure it's well out of the cpu cache size.

@arthurprs
Copy link
Contributor Author

arthurprs commented Sep 24, 2016

After the 3000x look I finally saw that iter_keys_big_value was busted, here are several others for good measure:

➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt                                           
 name                           hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff % 
 iter_keys_100_000              391,677           382,839                 -8,838   -2.26% 
 iter_keys_1_000_000            10,797,360        10,209,898            -587,462   -5.44% 
 iter_keys_big_value_100_000    414,736           662,255                247,519   59.68% 
 iter_keys_big_value_1_000_000  10,147,837        12,067,938           1,920,101   18.92% 
 iter_values_100_000            440,445           377,080                -63,365  -14.39% 
 iter_values_1_000_000          10,931,844        9,979,173             -952,671   -8.71% 
 iterate_100_000                428,644           388,509                -40,135   -9.36% 
 iterate_1_000_000              11,065,419        10,042,427          -1,022,992   -9.24%

Copy link
Contributor

@durka durka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the potential downside is wasted space, shouldn't there be some memory benchmarks as well?

///
/// This design uses less memory and is a lot faster than the naive
/// This design uses is a lot faster than the naive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"uses is"

///
/// This design uses less memory and is a lot faster than the naive
/// This design uses is a lot faster than the naive
/// `Vec<Option<u64, K, V>>`, because we don't pay for the overhead of an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to say Vec<Option<(u64, K, V)>>?

@@ -48,12 +48,14 @@ const EMPTY_BUCKET: u64 = 0;
/// which will likely map to the same bucket, while not being confused
/// with "empty".
///
/// - All three "arrays represented by pointers" are the same length:
/// - All two "arrays represented by pointers" are the same length:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Both"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed all three.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Sep 26, 2016
@alexcrichton
Copy link
Member

cc @pczarn

@arthurprs
Copy link
Contributor Author

@Veedrac PTAL

@Veedrac
Copy link
Contributor

Veedrac commented Sep 29, 2016

@arthurprs Seems like a solid improvement.

@alexcrichton
Copy link
Member

@rfcbot fcp merge

Looks like we've got solid wins all around to consider merging?

@rfcbot
Copy link

rfcbot commented Oct 3, 2016

FCP proposed with disposition to merge. Review requested from:

No concerns currently listed.
See this document for info about what commands tagged team members can give me.

@aturon
Copy link
Member

aturon commented Oct 4, 2016

I'm happy to go along with the experts here.

@bluss
Copy link
Member

bluss commented Oct 5, 2016

Has anyone evaluated this on a real workload? The first one that comes to mind is of course rustc.

@arthurprs
Copy link
Contributor Author

I'm not familiar enough with the bootstrap process, but if somebody provide some guidance I could do it.

@bluss
Copy link
Member

bluss commented Oct 5, 2016

Tip from simulacrum, that we can use https://github.com/rust-lang-nursery/rustc-benchmarks to test rustc impact. Rustc building itself is a heavier (and more important?) benchmark, don't know exactly what to time there

@alexcrichton
Copy link
Member

@arthurprs short of timing an execution make followed by applying your patch and doing it again there's not a great way to benchmark the bootstrap. I'd be fine assuming that this'll be a win and we can always revert if it causes a regression. We'll just want to keep a close eye on the online numbers

@pczarn
Copy link
Contributor

pczarn commented Oct 5, 2016

@arthurprs You can run make TIME_PASSES=1 before and after the patch, then compare the results side-by-side. Keep in mind that compilatons of libstd may not be comparable because the patch changes libstd's code.

I agree with changing the memory layout. However, the tradeoffs are subtle. The benefits and drawbacks of this change depend on circumstances such as the sizes of keys and values.

There is one more drawback that you didn't describe in detail. Let's say the user wants to iterate through HashMap's keys. The user will access every key, which will waste some memory and cache bandwidth on loading the map's values. So neither layout is truly cache conscious. Both are cache conscious in different ways. Of course you have to decide if the efficiency of the keys() and values() iterators is important enough to give the change to the layout a second thought.

I think the benefits outweigh the drawbacks, because accessing single map entries is very common.

@arthurprs
Copy link
Contributor Author

I don't think those tests will be feasible in my laptop. Specially considering the trial and error involved.

I think the benefits far outweighs the drawbacks, there's potential to waste some padding but in the real world it's frequently not the case (try using github search in rust repo and skim some pages). We shouldn't optimize for keys() and values() and those will definitely take a hit (as per benchmarks).

@bors
Copy link
Contributor

bors commented Oct 7, 2016

☔ The latest upstream changes (presumably #36753) made this pull request unmergeable. Please resolve the merge conflicts.

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Oct 10, 2016
@brson
Copy link
Contributor

brson commented Oct 10, 2016

Nice work @arthurprs !

bors added a commit that referenced this pull request Oct 12, 2016
Rollup of 10 pull requests

- Successful merges: #36692, #36743, #36762, #36991, #37023, #37050, #37056, #37064, #37066, #37067
- Failed merges:
@bors
Copy link
Contributor

bors commented Oct 12, 2016

⌛ Testing commit c5068a4 with merge e33334f...

@bors
Copy link
Contributor

bors commented Oct 12, 2016

💔 Test failed - auto-linux-64-nopt-t

@arthurprs
Copy link
Contributor Author

I'll fix it.

@arthurprs
Copy link
Contributor Author

Travis is happy again.

@bluss
Copy link
Member

bluss commented Oct 12, 2016

@arthurprs Have you looked into why the buildbot tests failed? log link Since it's in the big testsuite and I don't see the PR changing anything there. It was unfortunately green on travis before, and still the buildbot build failed.

@bluss
Copy link
Member

bluss commented Oct 12, 2016

It's the nopt builder, so presumably related to debug assertions?

@arthurprs
Copy link
Contributor Author

Yes, I should have said "CI should be happy".
I had to use a couple of wrapping Ops to make sure the overflow happened at a specific spot on debug builds.

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Oct 13, 2016

📌 Commit c435821 has been approved by alexcrichton

@bors
Copy link
Contributor

bors commented Oct 13, 2016

⌛ Testing commit c435821 with merge 2e0a3dc...

@bors
Copy link
Contributor

bors commented Oct 13, 2016

💔 Test failed - auto-linux-cross-opt

@arthurprs
Copy link
Contributor Author

I'm not sure it's related to the PR.

@alexcrichton
Copy link
Member

@bors: retry

@bors
Copy link
Contributor

bors commented Oct 14, 2016

⌛ Testing commit c435821 with merge 2353987...

@bors
Copy link
Contributor

bors commented Oct 14, 2016

💔 Test failed - auto-win-gnu-32-opt-rustbuild

@arielb1
Copy link
Contributor

arielb1 commented Oct 14, 2016

error: pretty-printing failed in round 0 revision None
status: exit code: 3221225477 (STATUS_ACCESS_VIOLATION)

@arielb1
Copy link
Contributor

arielb1 commented Oct 14, 2016

@bors retry

@bors
Copy link
Contributor

bors commented Oct 14, 2016

⌛ Testing commit c435821 with merge 40cd1fd...

bors added a commit that referenced this pull request Oct 14, 2016
Cache conscious hashmap table

Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at #21973.

This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a _no-brainer_ because of how the RH algorithm works and that's unchanged.

**Lookups**: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout).
**Inserts/Deletes/Resize**: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions.

Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/

The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower.

Total wasted padding between items (C being the capacity of the table).
* Old layout: C * (K-K padding) + C * (V-V padding)
* Proposed: C * (K-V padding) + C * (V-K padding)

In practice padding between K-K and V-V *can* be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, _hardly the average case in practice_).

Starting from the worst case the memory overhead is:
* `HashMap<u64, u8>` 46% memory overhead. (aka *worst case*)
* `HashMap<u64, u16>` 33% memory overhead.
* `HashMap<u64, u32>` 20% memory overhead.
* `HashMap<T, T>` 0% memory overhead
* Worst case based on sizeof K + sizeof V:

| x              |  16    |  24    |  32    |  64   |  128  |
|----------------|--------|--------|--------|-------|-------|
| (8+x+7)/(8+x)  |  1.29  |  1.22  |  1.18  |  1.1  |  1.05 |

I've a test repo here to run benchmarks  https://github.com/arthurprs/hashmap2/tree/layout

```
 ➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt
 name                            hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff %
 grow_10_000                     922,064           783,933               -138,131  -14.98%
 grow_big_value_10_000           1,901,909         1,171,862             -730,047  -38.38%
 grow_fnv_10_000                 443,544           418,674                -24,870   -5.61%
 insert_100                      2,469             2,342                     -127   -5.14%
 insert_1000                     23,331            21,536                  -1,795   -7.69%
 insert_100_000                  4,748,048         3,764,305             -983,743  -20.72%
 insert_10_000                   321,744           290,126                -31,618   -9.83%
 insert_int_bigvalue_10_000      749,764           407,547               -342,217  -45.64%
 insert_str_10_000               337,425           334,009                 -3,416   -1.01%
 insert_string_10_000            788,667           788,262                   -405   -0.05%
 iter_keys_100_000               394,484           374,161                -20,323   -5.15%
 iter_keys_big_value_100_000     402,071           620,810                218,739   54.40%
 iter_values_100_000             424,794           373,004                -51,790  -12.19%
 iterate_100_000                 424,297           389,950                -34,347   -8.10%
 lookup_100_000                  189,997           186,554                 -3,443   -1.81%
 lookup_100_000_bigvalue         192,509           189,695                 -2,814   -1.46%
 lookup_10_000                   154,251           145,731                 -8,520   -5.52%
 lookup_10_000_bigvalue          162,315           146,527                -15,788   -9.73%
 lookup_10_000_exist             132,769           128,922                 -3,847   -2.90%
 lookup_10_000_noexist           146,880           144,504                 -2,376   -1.62%
 lookup_1_000_000                137,167           132,260                 -4,907   -3.58%
 lookup_1_000_000_bigvalue       141,130           134,371                 -6,759   -4.79%
 lookup_1_000_000_bigvalue_unif  567,235           481,272                -85,963  -15.15%
 lookup_1_000_000_unif           589,391           453,576               -135,815  -23.04%
 merge_shuffle                   1,253,357         1,207,387              -45,970   -3.67%
 merge_simple                    40,264,690        37,996,903          -2,267,787   -5.63%
 new                             6                 5                           -1  -16.67%
 with_capacity_10e5              3,214             3,256                       42    1.31%
```

```
➜  hashmap2 git:(layout) ✗ cargo benchcmp hhkkvv:: hhkvkv:: bench.txt
 name                           hhkkvv:: ns/iter  hhkvkv:: ns/iter  diff ns/iter   diff %
 iter_keys_100_000              391,677           382,839                 -8,838   -2.26%
 iter_keys_1_000_000            10,797,360        10,209,898            -587,462   -5.44%
 iter_keys_big_value_100_000    414,736           662,255                247,519   59.68%
 iter_keys_big_value_1_000_000  10,147,837        12,067,938           1,920,101   18.92%
 iter_values_100_000            440,445           377,080                -63,365  -14.39%
 iter_values_1_000_000          10,931,844        9,979,173             -952,671   -8.71%
 iterate_100_000                428,644           388,509                -40,135   -9.36%
 iterate_1_000_000              11,065,419        10,042,427          -1,022,992   -9.24%
```
@bors bors merged commit c435821 into rust-lang:master Oct 14, 2016
@rfcbot
Copy link

rfcbot commented Oct 17, 2016

All relevant subteam members have reviewed. No concerns remain.

@rfcbot
Copy link

rfcbot commented Oct 24, 2016

It has been one week since all blocks to the FCP were resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.