Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a translated bitmap to enable more efficient marking #2927

Merged
merged 60 commits into from
Dec 1, 2021

Conversation

ggreif
Copy link
Contributor

@ggreif ggreif commented Nov 22, 2021

Since the mark-and-compact GC is currently less performant than the copying one, it makes sense to tweak its performance. Here I take the idea from @ulan and run with it. Basically, we now pass the object's absolute word number to g/set_bit and thus avoid some arithmetic and passing the heap_base along.

The marking was sped up by 4 instructions. This is compounded by the fact that now we don't pass the heap base to the marking routine mark_object any more:

  • mark_object: 120 now vs. old 123 instructions
  • every callsite to mark_object 1 instruction less.

Also speeds up the iteration over the bitmap after marking (BitmapIter::next has now 92 instructions compared to formerly 110). In particular, the inner loop is now more compact.

Fixes #2892.

Benchmarks will be included in #2927 when available.

Further optimisation opportunities

  • show that due to 2-word object allocation the bitmap can be halved (not clear how)
  • pass tag (and first word?) to object allocator — Gabor/tag on alloc #2998
  • when marking, use the skewed pointer and use pointer shifting, buffer padding tricks
  • inner loop in BitmapIter::next can be eliminated.

@osa1
Copy link
Contributor

osa1 commented Nov 23, 2021

As far as I understand, the idea here is to move bitmap address back so that when we compute the bit byte index of an object with object_addr / WORD_SIZE and add that to bitmap address, we will have a byte in the bitmap.

This way we don't want to allocate bits for the static heap, as suggested in #2892. It's faster, but doesn't come with space overhead. I like it.

@github-actions
Copy link

github-actions bot commented Nov 23, 2021

Comparing from 7c07c21 to 2989412:
In terms of gas, no changes are observed in 3 tests.
In terms of size, 3 tests regressed and the mean change is +0.0%.

@ggreif
Copy link
Contributor Author

ggreif commented Nov 23, 2021

Currently I am fighting with the heap's alignment when doing the tests. Here is the principal problem:
When heap_base is not divisible by 32 then adding heap_base % WORD_SIZE (effectively) to the mark location (set_mark's argument) can cause two effects

  1. marking can happen beyond the allocated mark blob
  2. mark bits appear in shifted positions when iterated over in 64-bit chunks.

The latter will cause the sweep phase to reclaim the wrong objects (or reach into the middle of them, looking for heap tags). (The former could be compensated easily by bumping the mark blob's size by one.)

Aligning the heap_base on the IC is pretty much trivial, but in the test environment it is trickier, since a Vec[u8]'s payload can end up being allocated at any address that is divisible by 8.

One could take the mis-alignment into account when finding the addressed objects from the bitmap, but that would add computational overhead to the sweep phase and we don't want that.


UPDATE: I finally settled the matter by allocating a vector that is bigger, so that we can find a comfortable location for the intended heap in it which is by construction properly realigned.

@crusso
Copy link
Contributor

crusso commented Nov 23, 2021

Looking good! Thanks for grasping the nettle!

@@ -24,13 +24,18 @@ pub(crate) static mut LAST_HP: u32 = 0;

// Provided by generated code
extern "C" {
pub(crate) fn get_heap_base() -> u32;
fn get_heap_base() -> u32;
Copy link
Contributor Author

@ggreif ggreif Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osa1 which macro generates this function? Maybe we can generate get_aligned_heap_base as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is generated by the codegen.

Copy link
Contributor

@osa1 osa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ggreif, I really like this idea.

Since the whole point here is performance, it would be good to see benchmark results.

One thing that is worse in this version than the previous is iteration of bitmap words. It can be improved as explained in my inline comments. I think the performance may still be better than master as it is, but bitmap can be a few MiBs (1 MiB of bitmap addresses 32 MiB of heap). For a heap near full, it will be near 134M, which requires ~16k 64-bit word iterations. So I think improving cycles there may worth it.

rts/motoko-rts/src/gc/mark_compact/bitmap.rs Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
@@ -24,13 +24,18 @@ pub(crate) static mut LAST_HP: u32 = 0;

// Provided by generated code
extern "C" {
pub(crate) fn get_heap_base() -> u32;
fn get_heap_base() -> u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is generated by the codegen.

rts/motoko-rts/src/memory/ic.rs Show resolved Hide resolved
@ggreif ggreif marked this pull request as ready for review November 24, 2021 15:00
@ggreif ggreif requested review from ulan, crusso and osa1 November 24, 2021 15:20
@ggreif ggreif changed the title WIP: translated bitmap Use a translated bitmap to enable more efficient marking Nov 24, 2021
perf-delta.nix Show resolved Hide resolved
@ggreif
Copy link
Contributor Author

ggreif commented Nov 29, 2021

The PR looks good to me, but I think I'll need to defer to someone who actually speaks rust.

@ulan can you have a look?

Copy link
Contributor

@osa1 osa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ggreif. Added inline comments.

perf-delta.nix Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/gc/mark_compact/bitmap.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/memory/ic.rs Outdated Show resolved Hide resolved
rts/motoko-rts/src/memory/ic.rs Show resolved Hide resolved
rts/motoko-rts-tests/src/gc/heap.rs Show resolved Hide resolved
rts/motoko-rts-tests/src/gc/heap.rs Show resolved Hide resolved
rts/motoko-rts-tests/src/gc/heap.rs Outdated Show resolved Hide resolved
ggreif and others added 8 commits November 30, 2021 14:18
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
review feedback
@ggreif ggreif requested a review from osa1 November 30, 2021 19:44
Copy link
Contributor

@osa1 osa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but would be good to see benchmarks. Marking should be more efficient now, but I can't tell if the new bitmap iterator will be faster or slower, and what will be the effect of it to the overall GC.

It would be helpful to see the changes in the PR description. For example, number of instructions in marking or one iteration of bitmap iterator, or even better, actual benchmark results.

What happens to TODOs in the PR description? If they're for another PR should we remove them? They will be in the commit message when this is merged.

return bit_idx;
} else {
let shift_amt = self.current_word.trailing_zeros();
self.current_word >>= shift_amt;
self.bits_left -= shift_amt;
self.current_bit_idx += shift_amt;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, we missed an optimisation! No inner loop necessary, just an if, since at this place we know that the next bit is set (and self.current_word != 0), so we can anticipate the next decisions. We can just advance and return.

Copy link
Contributor Author

@ggreif ggreif Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a quick implementation of the concept, for posterity.

diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
index 2607a4baa..234f74de8 100644
--- a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
+++ b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
@@ -167,8 +167,8 @@ impl BitmapIter {
 
         // Outer loop iterates 64-bit words
         loop {
-            // Inner loop iterates bits in the current word
-            while self.current_word != 0 {
+            // Inner conditional examines the least significant bit(s) in the current word
+            if self.current_word != 0 {
                 if self.current_word & 0b1 != 0 {
                     let bit_idx = self.current_bit_idx;
                     self.current_word >>= 1;
@@ -177,7 +177,10 @@ impl BitmapIter {
                 } else {
                     let shift_amt = self.current_word.trailing_zeros();
                     self.current_word >>= shift_amt;
-                    self.current_bit_idx += shift_amt;
+                    self.current_word >>= 1;
+                    let bit_idx = self.current_bit_idx + shift_amt;
+                    self.current_bit_idx = bit_idx + 1;
+                    return bit_idx;
                 }
             }

I won't commit this, as the double shift looks ugly. Maybe I can come up with a more elegant way. But this already passes the tests.
NB: the double shift is necessary, because it can happen that ctz gives 63 and 63 + 1 == 64 shifts are undefined behaviour. Rust traps on this.

@ggreif ggreif added the automerge-squash When ready, merge (using squash) label Dec 1, 2021
@ggreif
Copy link
Contributor Author

ggreif commented Dec 1, 2021

LGTM, but would be good to see benchmarks. Marking should be more efficient now, but I can't tell if the new bitmap iterator will be faster or slower, and what will be the effect of it to the overall GC.

Yes, there can be surprises, although unlikely. Benchmarks will be added to the MR. I thought I can do that today, but unfortunately I had a canine emergency.

It would be helpful to see the changes in the PR description. For example, number of instructions in marking or one iteration of bitmap iterator, or even better, actual benchmark results.

Added instruction balances.

What happens to TODOs in the PR description? If they're for another PR should we remove them? They will be in the commit message when this is merged.

I nuked the TODOs as these were meant to track my progress, and are now irrelevant.

@mergify mergify bot merged commit a7868cd into master Dec 1, 2021
@mergify mergify bot deleted the gabor/translate branch December 1, 2021 18:46
@mergify mergify bot removed the automerge-squash When ready, merge (using squash) label Dec 1, 2021
@ggreif
Copy link
Contributor Author

ggreif commented Dec 8, 2021

As promised, here come the performance counters for the change

  • baseline is release 0.6.16
  • compared to with a7868cd reversed
[nix-shell:~/motoko]$ tail compacting-baseline_perf.csv compacting-a7868cd1f13_perf.csv  
==> compacting-baseline_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29548225,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29572634,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29318871,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29305196,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29326984,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29186706,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29178490,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29159337,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29173610,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,32425028,224,200,16

==> compacting-a7868cd1f13_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30624396,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30649781,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30397508,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30384817,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30407620,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30268366,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30261620,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30243489,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,30258762,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,33511297,224,200,16

A few quick percentages:

[nix-shell:~/motoko]$ ghc -e '33511297/32425028*100'
103.35009425435192

[nix-shell:~/motoko]$ ghc -e '30258762/29173610*100'
103.71963565702016

[nix-shell:~/motoko]$ ghc -e '30624396/29548225*100'
103.6420834077174

We would regress more than 3% (on performing pure GC) with this PR reverted!


Here is a more intuitive graph (with the methodology from #2967)
would-regress
For more profiles present the GC effort clearly dominates the profile addition work.

mergify bot pushed a commit that referenced this pull request Dec 9, 2021
This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former.
This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word.
Unrolling comes at the cost of a few more instructions in the else leg, but since we `return`, we can eliminate the inner `loop` altogether, which is a win especially for sparse bitmaps.

## Benchmarks

This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the `cancan` profile creation benchmark is available, and the GC-relevant cycle-count improvement is
``` shell
[nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100"
2.5960501090954153
```
about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952.

## Implementation concerns

We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that.

We could also refrain from testing the lowest bit and go for the `ctz` directly, but that could result in worse generated code by `wasmtime` (?). OTOH that would probably eliminate the `if`, and branchless code is good!
N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.
@ggreif ggreif added the opportunity More optimisation opportunities inside label Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
opportunity More optimisation opportunities inside
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Marking in mark-compact GC can be optimized (one-two instructions per live object)
3 participants