perf(index): lightweight structure #57

cmdoret · 2024-08-21T12:19:10Z

Proposed Changes

So far we used ntriples as the type index serialization, and loaded it into a HashMap<subject_uri: String, type_uri: String>. This has two drawbacks:

high memory usage: for each instance, we store the full URI of instance and its type
limitation with multi-typed instance: hashmap keys are unique, but RDF can have one instance with multiple types. This was not supported.

This PR changes the index structure to HashMap<subject_hash: u64, type_indices: SmallVec<[u64; 1]>>`. This has following advantages:

type URIs are stored only once in the index file header, then each reference requires a single integer to point to it.
Fixed length hashes of instance URI are stored in the index instead of arbitrarily long URIs
Associating each instance with a collection of type indices supports multi-typed instances
SmallVec allows to store a fixed length array on stack (length 1 here) and only use the heap if more items are inserted.

Additionally the default channel size of the log is reduced from 5000000 to 1000, as it was causing an overhead of 500MB RAM to log a single line.

Types of Changes

What types of changes does your code introduce? Put an x in the boxes that
apply

A bug fix (non-breaking change which fixes an issue).
A new feature (non-breaking change which adds functionality).
A breaking change (fix or feature that would cause existing
functionality to not work as expected).
A non-productive update (documentation, tooling, etc. if none of the
other choices apply).

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask. We're here to
help! This is simply a reminder of what we are going to look for before merging
your code.

I have read the
CONTRIBUTING
guidelines.
I have added tests that prove my fix is effective or that my feature
works.
I have added the necessary documentation (if appropriate).

Further Comments

Benchmark

Results below were generated using the benchmarking script added by this PR in tools/bench:

Timings

Run time compared using hyperfine

Indexing

Command	Mean [s]	Min [s]	Max [s]	Relative
`main`	36.678 ± 4.001	33.849	39.508	1.00
`perf/index`	39.616 ± 1.134	38.814	40.417	1.08 ± 0.12

Pseudonymization

Command	Mean [s]	Min [s]	Max [s]	Relative
`main`	62.365 ± 1.897	61.024	63.707	1.08 ± 0.05
`perf/index`	57.484 ± 1.814	56.201	58.767	1.00

Memory

Heap memory usage compared using heaptrack

Indexing

main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 344.50M

Pseudonymization

main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 2.11G

The initial memory peak appears to be caused be deserialization via serde_yml

Once the whole index is loaded, we plateau at 265MB which is great!

cmdoret · 2024-08-26T11:36:39Z

Swapping serde_yml -> serde_json for the index fixed the memory consumption issue, here are the new benchmark results after:

swapping to serde_json
storing hashes as u64 instead of String

Timings

Run time compared using hyperfine

Indexing

Command	Mean [s]	Min [s]	Max [s]	Relative
`main`	43.131 ± 7.867	37.568	48.694	1.10 ± 0.21
`perf/index`	39.142 ± 2.539	37.346	40.937	1.00

Pseudonymization

Command	Mean [s]	Min [s]	Max [s]	Relative
`main`	64.090 ± 1.421	63.085	65.095	1.00
`perf/index`	80.817 ± 1.881	79.487	82.148	1.26 ± 0.04

Memory

Heap memory usage compared using heaptrack

Indexing

main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 208.24M

Pseudonymization

main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 208.06M

Memory consumption is no longer peaking on index deserialization, and the plateau is significantly lower than in main.

gabyx · 2024-08-26T12:05:08Z

src/index.rs

+/// Stores a mapping from hashed instance uri to their types
+#[derive(Serialize, Deserialize)]
+pub struct TypeIndex {
+    pub types: Vec<String>,


suggestion: Its not clear from reading the doc string, what types contains and
what SmallVec contains (be specific).

gabyx

Awesome: Some considerations about from_iter and the insert function
Nice job also with the benchmark!!! 💯

src/index.rs

tools/bench/benchmark.sh

cmdoret · 2024-08-27T00:31:51Z

I took the path of least resistance (and consistency?) in this PR by using <uri> everywhere, including in the rules config.
Here are the new benchmark results after implementing all other suggestions:

date: 2024-08-27

Comparing perf/index against main.

Timings

Run time compared using hyperfine

Indexing

Command	Mean [s]	Min [s]	Max [s]	Relative
`main`	30.673 ± 0.142	30.532	30.902	1.00
`perf/index`	31.706 ± 2.472	30.355	36.111	1.03 ± 0.08

Pseudonymization

Command	Mean [s]	Min [s]	Max [s]	Relative
`main`	30.295 ± 0.536	29.588	30.884	1.00
`perf/index`	30.842 ± 0.339	30.337	31.219	1.02 ± 0.02

Memory

Heap memory usage compared using heaptrack

Indexing

main: peak heap memory consumption: 520.13M
perf/index: peak heap memory consumption: 208.24M

Pseudonymization

main: peak heap memory consumption: 1.02G
perf/index: peak heap memory consumption: 208.06M

cmdoret added 15 commits August 21, 2024 17:08

feat(index): add dedicated struct+impl

4cdbe79

feat: serde-based index creation

b001c5d

refactor(rules): use index struct [WIP]

fe101a2

feat(rules): type pred matching

16ab28d

style: formatting

9811272

refactor(index): make hasher short-lived

af6fe3f

feat(io): index parser

5f2af98

feat(pseudo): use new index struct

cc31ae6

refactor: better index function naming

4c823fe

test(rules): use new index in rules tests

4032b3b

test(rules): update index macro to use TypeIndex

2fab215

refactor(index): from_map -> from_iter

8cca4c1

chore: rm unused import

32e711c

style: formatting

f612fe6

test(pseudo): update test index

e69a9d1

cmdoret force-pushed the perf/index branch from 80eaaef to e69a9d1 Compare August 21, 2024 15:10

cmdoret added 11 commits August 22, 2024 02:55

chore(deps): add smallvec

d487733

refactor(index): use smallvec

f7d60f4

perf(log): reduce reserved mem

28e3615

test(rules): fix default for index macro

95d6563

perf(bench): add benchmark script

cad8af7

perf(bench): can select binary profile

4f9146c

chore(bench): release as default profile

afd0844

perf(index): hashes as u64 instead of String

7d5ebeb

docs(bench): comment script

0fc975c

chore(deps): add serde_json

132f688

refactor(index): yaml -> json

c496272

cmdoret marked this pull request as ready for review August 26, 2024 11:37

style: format+clippy recommendations

d38acb7

gabyx reviewed Aug 26, 2024

View reviewed changes

cmdoret self-assigned this Aug 26, 2024

cmdoret linked an issue Aug 26, 2024 that may be closed by this pull request

perf(index): reduce memory usage #55

Closed

gabyx requested changes Aug 26, 2024

View reviewed changes

cmdoret added 6 commits August 26, 2024 19:25

refactor(index): simplify impl, always use nt uri serialization

5b7d68f

test(rules): simplify macro, use uri.to_string()

0284072

tests(data): uri -> <uri>

cfb1616

test(index): use <uri> in cases

08b62ee

refactor(rules): use uri.to_string()

5f7ad25

test(rules): use <uri> in cases

4642153

cmdoret mentioned this pull request Aug 26, 2024

perf(index): hash-based type lookup #58

Open

cmdoret added 8 commits August 26, 2024 22:03

style: lint+format

a2c3092

docs(index): better comments

c4e2c77

refactor(index): return &str from TypeIndex.get

f1284fd

style(bench): shellcheck suggestions

c80493c

chore(bench): add nix devshell

9ea47fb

fix(bench): git clone in nix

42cbdcb

refactor(bench): function-based benchmark script

56fc66b

docs(bench): document benchmark execution

b3545d1

cmdoret requested a review from gabyx August 26, 2024 23:33

fix(bench): syntax

8331759

cmdoret merged commit 01418d2 into main Aug 27, 2024
7 checks passed

cmdoret deleted the perf/index branch August 27, 2024 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(index): lightweight structure #57

perf(index): lightweight structure #57

cmdoret commented Aug 21, 2024 •

edited

Loading

cmdoret commented Aug 26, 2024 •

edited

Loading

gabyx Aug 26, 2024

gabyx left a comment

cmdoret commented Aug 27, 2024

perf(index): lightweight structure #57

perf(index): lightweight structure #57

Conversation

cmdoret commented Aug 21, 2024 • edited Loading

Proposed Changes

Types of Changes

Checklist

Further Comments

Benchmark

Timings

Indexing

Pseudonymization

Memory

Indexing

Pseudonymization

cmdoret commented Aug 26, 2024 • edited Loading

Timings

Indexing

Pseudonymization

Memory

Indexing

Pseudonymization

gabyx Aug 26, 2024

Choose a reason for hiding this comment

gabyx left a comment

Choose a reason for hiding this comment

cmdoret commented Aug 27, 2024

Timings

Indexing

Pseudonymization

Memory

Indexing

Pseudonymization

cmdoret commented Aug 21, 2024 •

edited

Loading

cmdoret commented Aug 26, 2024 •

edited

Loading