Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(index): lightweight structure #57

Merged
merged 42 commits into from
Aug 27, 2024
Merged

perf(index): lightweight structure #57

merged 42 commits into from
Aug 27, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Aug 21, 2024

Proposed Changes

So far we used ntriples as the type index serialization, and loaded it into a HashMap<subject_uri: String, type_uri: String>. This has two drawbacks:

  • high memory usage: for each instance, we store the full URI of instance and its type
  • limitation with multi-typed instance: hashmap keys are unique, but RDF can have one instance with multiple types. This was not supported.

This PR changes the index structure to HashMap<subject_hash: u64, type_indices: SmallVec<[u64; 1]>>`. This has following advantages:

  • type URIs are stored only once in the index file header, then each reference requires a single integer to point to it.
  • Fixed length hashes of instance URI are stored in the index instead of arbitrarily long URIs
  • Associating each instance with a collection of type indices supports multi-typed instances
  • SmallVec allows to store a fixed length array on stack (length 1 here) and only use the heap if more items are inserted.

Additionally the default channel size of the log is reduced from 5000000 to 1000, as it was causing an overhead of 500MB RAM to log a single line.

Types of Changes

What types of changes does your code introduce? Put an x in the boxes that
apply

  • A bug fix (non-breaking change which fixes an issue).
  • A new feature (non-breaking change which adds functionality).
  • A breaking change (fix or feature that would cause existing
    functionality to not work as expected).
  • A non-productive update (documentation, tooling, etc. if none of the
    other choices apply).

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask. We're here to
help! This is simply a reminder of what we are going to look for before merging
your code.

  • I have read the
    CONTRIBUTING
    guidelines.
  • I have added tests that prove my fix is effective or that my feature
    works.
  • I have added the necessary documentation (if appropriate).

Further Comments

Benchmark

Results below were generated using the benchmarking script added by this PR in tools/bench:

Timings

Run time compared using hyperfine

Indexing

Command Mean [s] Min [s] Max [s] Relative
main 36.678 ± 4.001 33.849 39.508 1.00
perf/index 39.616 ± 1.134 38.814 40.417 1.08 ± 0.12

Pseudonymization

Command Mean [s] Min [s] Max [s] Relative
main 62.365 ± 1.897 61.024 63.707 1.08 ± 0.05
perf/index 57.484 ± 1.814 56.201 58.767 1.00

Memory

Heap memory usage compared using heaptrack

Indexing

main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 344.50M

image

Pseudonymization

main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 2.11G

image

The initial memory peak appears to be caused be deserialization via serde_yml

image

Once the whole index is loaded, we plateau at 265MB which is great!

@cmdoret
Copy link
Member Author

cmdoret commented Aug 26, 2024

Swapping serde_yml -> serde_json for the index fixed the memory consumption issue, here are the new benchmark results after:

  • swapping to serde_json
  • storing hashes as u64 instead of String

Timings

Run time compared using hyperfine

Indexing

Command Mean [s] Min [s] Max [s] Relative
main 43.131 ± 7.867 37.568 48.694 1.10 ± 0.21
perf/index 39.142 ± 2.539 37.346 40.937 1.00

Pseudonymization

Command Mean [s] Min [s] Max [s] Relative
main 64.090 ± 1.421 63.085 65.095 1.00
perf/index 80.817 ± 1.881 79.487 82.148 1.26 ± 0.04

Memory

Heap memory usage compared using heaptrack

Indexing

main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 208.24M

Pseudonymization

main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 208.06M

Memory consumption is no longer peaking on index deserialization, and the plateau is significantly lower than in main.
image

@cmdoret cmdoret marked this pull request as ready for review August 26, 2024 11:37
/// Stores a mapping from hashed instance uri to their types
#[derive(Serialize, Deserialize)]
pub struct TypeIndex {
pub types: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Its not clear from reading the doc string, what types contains and
what SmallVec contains (be specific).

@cmdoret cmdoret self-assigned this Aug 26, 2024
@cmdoret cmdoret linked an issue Aug 26, 2024 that may be closed by this pull request
Copy link
Contributor

@gabyx gabyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome: Some considerations about from_iter and the insert function
Nice job also with the benchmark!!! 💯

src/index.rs Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
tools/bench/benchmark.sh Outdated Show resolved Hide resolved
tools/bench/benchmark.sh Outdated Show resolved Hide resolved
tools/bench/benchmark.sh Outdated Show resolved Hide resolved
tools/bench/benchmark.sh Outdated Show resolved Hide resolved
@cmdoret cmdoret requested a review from gabyx August 26, 2024 23:33
@cmdoret
Copy link
Member Author

cmdoret commented Aug 27, 2024

I took the path of least resistance (and consistency?) in this PR by using <uri> everywhere, including in the rules config.
Here are the new benchmark results after implementing all other suggestions:

date: 2024-08-27

Comparing perf/index against main.

Timings

Run time compared using hyperfine

Indexing

Command Mean [s] Min [s] Max [s] Relative
main 30.673 ± 0.142 30.532 30.902 1.00
perf/index 31.706 ± 2.472 30.355 36.111 1.03 ± 0.08

Pseudonymization

Command Mean [s] Min [s] Max [s] Relative
main 30.295 ± 0.536 29.588 30.884 1.00
perf/index 30.842 ± 0.339 30.337 31.219 1.02 ± 0.02

Memory

Heap memory usage compared using heaptrack

Indexing

main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 208.24M

Pseudonymization

main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 208.06M

@cmdoret cmdoret merged commit 01418d2 into main Aug 27, 2024
7 checks passed
@cmdoret cmdoret deleted the perf/index branch August 27, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf(index): reduce memory usage
2 participants