Use a translated bitmap to enable more efficient marking #2927

ggreif · 2021-11-22T15:23:27Z

Since the mark-and-compact GC is currently less performant than the copying one, it makes sense to tweak its performance. Here I take the idea from @ulan and run with it. Basically, we now pass the object's absolute word number to g/set_bit and thus avoid some arithmetic and passing the heap_base along.

The marking was sped up by 4 instructions. This is compounded by the fact that now we don't pass the heap base to the marking routine mark_object any more:

mark_object: 120 now vs. old 123 instructions
every callsite to mark_object 1 instruction less.

Also speeds up the iteration over the bitmap after marking (BitmapIter::next has now 92 instructions compared to formerly 110). In particular, the inner loop is now more compact.

Fixes #2892.

Benchmarks will be included in #2927 when available.

Further optimisation opportunities

show that due to 2-word object allocation the bitmap can be halved (not clear how)
pass tag (and first word?) to object allocator — Gabor/tag on alloc #2998
when marking, use the skewed pointer and use pointer shifting, buffer padding tricks
inner loop in BitmapIter::next can be eliminated.

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

osa1 · 2021-11-23T12:21:35Z

As far as I understand, the idea here is to move bitmap address back so that when we compute the bit byte index of an object with object_addr / WORD_SIZE and add that to bitmap address, we will have a byte in the bitmap.

This way we don't want to allocate bits for the static heap, as suggested in #2892. It's faster, but doesn't come with space overhead. I like it.

github-actions · 2021-11-23T13:20:16Z

Comparing from 7c07c21 to 2989412:
In terms of gas, no changes are observed in 3 tests.
In terms of size, 3 tests regressed and the mean change is +0.0%.

ggreif · 2021-11-23T18:47:27Z

Currently I am fighting with the heap's alignment when doing the tests. Here is the principal problem:
When heap_base is not divisible by 32 then adding heap_base % WORD_SIZE (effectively) to the mark location (set_mark's argument) can cause two effects

marking can happen beyond the allocated mark blob
mark bits appear in shifted positions when iterated over in 64-bit chunks.

The latter will cause the sweep phase to reclaim the wrong objects (or reach into the middle of them, looking for heap tags). (The former could be compensated easily by bumping the mark blob's size by one.)

Aligning the heap_base on the IC is pretty much trivial, but in the test environment it is trickier, since a Vec[u8]'s payload can end up being allocated at any address that is divisible by 8.

One could take the mis-alignment into account when finding the addressed objects from the bitmap, but that would add computational overhead to the sweep phase and we don't want that.

UPDATE: I finally settled the matter by allocating a vector that is bigger, so that we can find a comfortable location for the intended heap in it which is by construction properly realigned.

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

crusso · 2021-11-23T22:09:39Z

Looking good! Thanks for grasping the nettle!

ggreif · 2021-11-23T23:27:50Z

rts/motoko-rts/src/memory/ic.rs

@@ -24,13 +24,18 @@ pub(crate) static mut LAST_HP: u32 = 0;

 // Provided by generated code
 extern "C" {
-    pub(crate) fn get_heap_base() -> u32;
+    fn get_heap_base() -> u32;


@osa1 which macro generates this function? Maybe we can generate get_aligned_heap_base as well?

This function is generated by the codegen.

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

osa1

Thanks @ggreif, I really like this idea.

Since the whole point here is performance, it would be good to see benchmark results.

One thing that is worse in this version than the previous is iteration of bitmap words. It can be improved as explained in my inline comments. I think the performance may still be better than master as it is, but bitmap can be a few MiBs (1 MiB of bitmap addresses 32 MiB of heap). For a heap near full, it will be near 134M, which requires ~16k 64-bit word iterations. So I think improving cycles there may worth it.

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

osa1 · 2021-11-24T06:47:17Z

rts/motoko-rts/src/memory/ic.rs

@@ -24,13 +24,18 @@ pub(crate) static mut LAST_HP: u32 = 0;

 // Provided by generated code
 extern "C" {
-    pub(crate) fn get_heap_base() -> u32;
+    fn get_heap_base() -> u32;


This function is generated by the codegen.

rts/motoko-rts/src/memory/ic.rs

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

(eventually) fixes #2892

…gned

THIS PASSES TESTS!

and instead fill the heap array in such that it gets realigned

perf-delta.nix

ggreif · 2021-11-29T23:43:52Z

The PR looks good to me, but I think I'll need to defer to someone who actually speaks rust.

@ulan can you have a look?

osa1

Thanks @ggreif. Added inline comments.

perf-delta.nix

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

rts/motoko-rts/src/memory/ic.rs

rts/motoko-rts-tests/src/gc/heap.rs

Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>

review feedback

osa1

LGTM, but would be good to see benchmarks. Marking should be more efficient now, but I can't tell if the new bitmap iterator will be faster or slower, and what will be the effect of it to the overall GC.

It would be helpful to see the changes in the PR description. For example, number of instructions in marking or one iteration of bitmap iterator, or even better, actual benchmark results.

What happens to TODOs in the PR description? If they're for another PR should we remove them? They will be in the commit message when this is merged.

ggreif · 2021-12-01T10:01:34Z

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

                    return bit_idx;
                } else {
                    let shift_amt = self.current_word.trailing_zeros();
                    self.current_word >>= shift_amt;
-                    self.bits_left -= shift_amt;
+                    self.current_bit_idx += shift_amt;


Wow, we missed an optimisation! No inner loop necessary, just an if, since at this place we know that the next bit is set (and self.current_word != 0), so we can anticipate the next decisions. We can just advance and return.

Here is a quick implementation of the concept, for posterity.

diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs index 2607a4baa..234f74de8 100644 --- a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs +++ b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs @@ -167,8 +167,8 @@ impl BitmapIter { // Outer loop iterates 64-bit words loop { - // Inner loop iterates bits in the current word - while self.current_word != 0 { + // Inner conditional examines the least significant bit(s) in the current word + if self.current_word != 0 { if self.current_word & 0b1 != 0 { let bit_idx = self.current_bit_idx; self.current_word >>= 1; @@ -177,7 +177,10 @@ impl BitmapIter { } else { let shift_amt = self.current_word.trailing_zeros(); self.current_word >>= shift_amt; - self.current_bit_idx += shift_amt; + self.current_word >>= 1; + let bit_idx = self.current_bit_idx + shift_amt; + self.current_bit_idx = bit_idx + 1; + return bit_idx; } }

I won't commit this, as the double shift looks ugly. Maybe I can come up with a more elegant way. But this already passes the tests.
NB: the double shift is necessary, because it can happen that ctz gives 63 and 63 + 1 == 64 shifts are undefined behaviour. Rust traps on this.

ggreif · 2021-12-01T18:26:56Z

LGTM, but would be good to see benchmarks. Marking should be more efficient now, but I can't tell if the new bitmap iterator will be faster or slower, and what will be the effect of it to the overall GC.

Yes, there can be surprises, although unlikely. Benchmarks will be added to the MR. I thought I can do that today, but unfortunately I had a canine emergency.

It would be helpful to see the changes in the PR description. For example, number of instructions in marking or one iteration of bitmap iterator, or even better, actual benchmark results.

Added instruction balances.

What happens to TODOs in the PR description? If they're for another PR should we remove them? They will be in the commit message when this is merged.

I nuked the TODOs as these were meant to track my progress, and are now irrelevant.

ggreif · 2021-12-08T22:59:13Z

As promised, here come the performance counters for the change

baseline is release 0.6.16
compared to with a7868cd reversed

[nix-shell:~/motoko]$ tail compacting-baseline_perf.csv compacting-a7868cd1f13_perf.csv  
==> compacting-baseline_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29548225,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29572634,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29318871,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29305196,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29326984,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29186706,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29178490,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29159337,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29173610,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,32425028,224,200,16

==> compacting-a7868cd1f13_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30624396,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30649781,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30397508,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30384817,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30407620,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30268366,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30261620,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30243489,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30258762,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,33511297,224,200,16

A few quick percentages:

[nix-shell:~/motoko]$ ghc -e '33511297/32425028*100'
103.35009425435192

[nix-shell:~/motoko]$ ghc -e '30258762/29173610*100'
103.71963565702016

[nix-shell:~/motoko]$ ghc -e '30624396/29548225*100'
103.6420834077174

We would regress more than 3% (on performing pure GC) with this PR reverted!

Here is a more intuitive graph (with the methodology from #2967)

For more profiles present the GC effort clearly dominates the profile addition work.

This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former. This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word. Unrolling comes at the cost of a few more instructions in the else leg, but since we `return`, we can eliminate the inner `loop` altogether, which is a win especially for sparse bitmaps. ## Benchmarks This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the `cancan` profile creation benchmark is available, and the GC-relevant cycle-count improvement is ``` shell [nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100" 2.5960501090954153 ``` about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952. ## Implementation concerns We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that. We could also refrain from testing the lowest bit and go for the `ctz` directly, but that could result in worse generated code by `wasmtime` (?). OTOH that would probably eliminate the `if`, and branchless code is good! N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.

osa1 reviewed Nov 23, 2021

View reviewed changes

rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved

ggreif commented Nov 23, 2021

View reviewed changes

rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved

ggreif commented Nov 23, 2021

View reviewed changes

rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved

osa1 reviewed Nov 24, 2021

View reviewed changes

ggreif force-pushed the gabor/translate branch from b8f7f46 to eaec04d Compare November 24, 2021 11:21

ggreif marked this pull request as ready for review November 24, 2021 15:00

ggreif requested review from ulan, crusso and osa1 November 24, 2021 15:20

ggreif commented Nov 24, 2021

View reviewed changes

rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved

ggreif changed the title ~~WIP: translated bitmap~~ Use a translated bitmap to enable more efficient marking Nov 24, 2021

ggreif added 14 commits November 24, 2021 18:03

WIP: translated bitmap

954d81a

(eventually) fixes #2892

fmt

9b76fed

WIP: try allocating the heap such that the dynamic heap is 32-bit ali…

e22bc43

…gned

ruthlessly increment the static_heap_size_bytes

b40f6b6

THIS PASSES TESTS!

tweak assertions

3736517

cleanup

1873a2e

fmt

01a0cc8

WIP: minor refactoring

a8b5310

fmt

b5c6d82

WIP: remove the allocator based alignment

922d330

and instead fill the heap array in such that it gets realigned

rename

be85715

clean up

3712928

undo debugging

20defb8

remove unneeded stuff

64efc2e

ggreif force-pushed the gabor/translate branch from 52da308 to 9872e4d Compare November 29, 2021 18:34

ggreif commented Nov 29, 2021

View reviewed changes

perf-delta.nix Show resolved Hide resolved

osa1 reviewed Nov 30, 2021

View reviewed changes

ggreif and others added 8 commits November 30, 2021 14:18

Update rts/motoko-rts/src/gc/mark_compact/bitmap.rs

a9bca2b

Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>

Update rts/motoko-rts/src/memory/ic.rs

2d9f509

Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>

Update rts/motoko-rts/src/gc/mark_compact/bitmap.rs

d3a2610

Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>

Update rts/motoko-rts/src/gc/mark_compact/bitmap.rs

a3eb0e3

Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>

Update rts/motoko-rts/src/gc/mark_compact/bitmap.rs

adb8c0a

Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>

fix BM start

e7f6c34

tweak

c3f1147

doc and code tweaks

cf98977

review feedback

ggreif force-pushed the gabor/translate branch from 2c75bf6 to cf98977 Compare November 30, 2021 14:34

ggreif added 2 commits November 30, 2021 17:25

typo

b94cf22

oooops

86fcfd4

ggreif requested a review from osa1 November 30, 2021 19:44

ggreif added 2 commits November 30, 2021 20:48

simplify

f37b9af

Merge branch 'master' into gabor/translate

aec40ce

osa1 approved these changes Dec 1, 2021

View reviewed changes

ggreif commented Dec 1, 2021

View reviewed changes

ggreif added the automerge-squash When ready, merge (using squash) label Dec 1, 2021

Merge branch 'master' into gabor/translate

2989412

mergify bot merged commit a7868cd into master Dec 1, 2021

mergify bot deleted the gabor/translate branch December 1, 2021 18:46

mergify bot removed the automerge-squash When ready, merge (using squash) label Dec 1, 2021

ggreif mentioned this pull request Dec 9, 2021

Simplify the looping structure of bitmap scanning #2952

Merged

ggreif added the opportunity More optimisation opportunities inside label Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a translated bitmap to enable more efficient marking #2927

Use a translated bitmap to enable more efficient marking #2927

ggreif commented Nov 22, 2021 •

edited

Loading

osa1 commented Nov 23, 2021

github-actions bot commented Nov 23, 2021 •

edited

Loading

ggreif commented Nov 23, 2021 •

edited

Loading

crusso commented Nov 23, 2021

ggreif Nov 23, 2021 •

edited

Loading

osa1 Nov 24, 2021

osa1 left a comment

osa1 Nov 24, 2021

ggreif commented Nov 29, 2021

osa1 left a comment

osa1 left a comment

ggreif Dec 1, 2021

ggreif Dec 1, 2021 •

edited

Loading

ggreif commented Dec 1, 2021 •

edited

Loading

ggreif commented Dec 8, 2021 •

edited

Loading

Use a translated bitmap to enable more efficient marking #2927

Use a translated bitmap to enable more efficient marking #2927

Conversation

ggreif commented Nov 22, 2021 • edited Loading

Further optimisation opportunities

osa1 commented Nov 23, 2021

github-actions bot commented Nov 23, 2021 • edited Loading

ggreif commented Nov 23, 2021 • edited Loading

crusso commented Nov 23, 2021

ggreif Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

osa1 Nov 24, 2021

Choose a reason for hiding this comment

osa1 left a comment

Choose a reason for hiding this comment

osa1 Nov 24, 2021

Choose a reason for hiding this comment

ggreif commented Nov 29, 2021

osa1 left a comment

Choose a reason for hiding this comment

osa1 left a comment

Choose a reason for hiding this comment

ggreif Dec 1, 2021

Choose a reason for hiding this comment

ggreif Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

ggreif commented Dec 1, 2021 • edited Loading

ggreif commented Dec 8, 2021 • edited Loading

ggreif commented Nov 22, 2021 •

edited

Loading

github-actions bot commented Nov 23, 2021 •

edited

Loading

ggreif commented Nov 23, 2021 •

edited

Loading

ggreif Nov 23, 2021 •

edited

Loading

ggreif Dec 1, 2021 •

edited

Loading

ggreif commented Dec 1, 2021 •

edited

Loading

ggreif commented Dec 8, 2021 •

edited

Loading