Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing a loading layer in FAISS native engine. #2139

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

0ctopus13prime
Copy link
Contributor

@0ctopus13prime 0ctopus13prime commented Sep 24, 2024

Description

This PR is the first commit making the loading layer in native engines available.
Please refer to this issue for more details. - #2033

Related Issues

Resolves #2033
#2033

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good to me. Couple of things:

  1. Please add the unit test for the C++ layer.
  2. Add the reference for the benchmarks performed using this code.


final int indexSize = (int) directory.fileLength(logicalIndexPath);

try (IndexInput readStream = directory.openInput(logicalIndexPath, IOContext.READONCE)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check when openInput is called what is the memory spike we are observing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more input for this, which issue are you referring to??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not talking about any issue here. So my question is in your tests when you the queries are run, can we graph the spikes in memory during the query. As IndexInput.openInput maps the files to memory. So want to know if there is any spike we are seeing and how much is that spike actually.

@@ -286,6 +286,7 @@ private Map<Integer, Float> doANNSearch(
try {
indexAllocation = nativeMemoryCacheManager.get(
new NativeMemoryEntryContext.IndexEntryContext(
reader.directory(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have added the directory code here, are we going to remove the dependency from FSDirectory in upcoming PRs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're referring to using FSDirectory to get indexPath right?
If so, yeah, that is the plan! We must deprecate FileWatcher first then moving on cutting FSDirectory entirely.

Passing Directory here to let the internal loading strategy be able to create IndexInput from it, in case there's not yet a vector index was loaded.

src/main/java/org/opensearch/knn/jni/JNIService.java Outdated Show resolved Hide resolved
@navneet1v navneet1v added Enhancements Increases software capabilities beyond original client specifications indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. backport 2.x and removed indexing indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. labels Sep 24, 2024
Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
@0ctopus13prime
Copy link
Contributor Author

Appendix 1. Performance Details

Metric Task Baseline-Value Candidate-Value Diff (B - C) Diff ((B - C) / B) Unit
Cumulative indexing time of primary shards 30.5506 31.8307 -1.2801 -0.0419 min
Min cumulative indexing time across primary shards 0.00012 0.00025 -0.00013 -1.14285 min
Median cumulative indexing time across primary shards 15.2753 15.9153 -0.64 -0.0419 min
Max cumulative indexing time across primary shards 30.5505 31.8304 -1.2799 -0.04189 min
Cumulative indexing throttle time of primary shards 0 0 0 #DIV/0! min
Min cumulative indexing throttle time across primary shards 0 0 0 #DIV/0! min
Median cumulative indexing throttle time across primary shards 0 0 0 #DIV/0! min
Max cumulative indexing throttle time across primary shards 0 0 0 #DIV/0! min
Cumulative merge time of primary shards 182.972 200.509 -17.537 -0.09585 min
Cumulative merge count of primary shards 113 107 6 0.0531
Min cumulative merge time across primary shards 0 0 0 #DIV/0! min
Median cumulative merge time across primary shards 91.4861 100.254 -8.7679 -0.09584 min
Max cumulative merge time across primary shards 182.972 200.509 -17.537 -0.09585 min
Cumulative merge throttle time of primary shards 1.33108 1.67535 -0.34427 -0.25864 min
Min cumulative merge throttle time across primary shards 0 0 0 #DIV/0! min
Median cumulative merge throttle time across primary shards 0.66554 0.83768 -0.17213 -0.25864 min
Max cumulative merge throttle time across primary shards 1.33108 1.67535 -0.34427 -0.25864 min
Cumulative refresh time of primary shards 1.25782 1.2378 0.02002 0.01592 min
Cumulative refresh count of primary shards 92 91 1 0.01087
Min cumulative refresh time across primary shards 3.17E-04 0.00042 -0.0001 -0.31579 min
Median cumulative refresh time across primary shards 0.62891 0.6189 0.01001 0.01591 min
Max cumulative refresh time across primary shards 1.26E+00 1.23738 0.02012 0.016 min
Cumulative flush time of primary shards 11.9259 11.4079 0.518 0.04343 min
Cumulative flush count of primary shards 61 58 3 0.04918
Min cumulative flush time across primary shards 0 0 0 #DIV/0! min
Median cumulative flush time across primary shards 5.96296 5.70395 0.25901 0.04344 min
Max cumulative flush time across primary shards 11.9259 11.4079 0.518 0.04343 min
Total Young Gen GC time 0.288 0.301 -0.013 -0.04514 s
Total Young Gen GC count 18 17 1 0.05556
Total Old Gen GC time 0 0 0 #DIV/0! s
Total Old Gen GC count 0 0 0 #DIV/0!
Store size 29.8163 25.8162 4.0001 0.13416 GB
Translog size 5.82E-07 5.82E-07 0 0 GB
Heap used for segments 0 0 0 #DIV/0! MB
Heap used for doc values 0 0.00E+00 0 #DIV/0! MB
Heap used for terms 0 0 0 #DIV/0! MB
Heap used for norms 0 0 0 #DIV/0! MB
Heap used for points 0 0 0 #DIV/0! MB
Heap used for stored fields 0 0 0 #DIV/0! MB
Segment count 2 2 0 0
Min Throughput custom-vector-bulk 5948.76 5462.82 485.94 0.08169 docs/s
Mean Throughput custom-vector-bulk 10542.8 10833.4 -290.6 -0.02756 docs/s
Median Throughput custom-vector-bulk 9988.27 10243.3 -255.03 -0.02553 docs/s
Max Throughput custom-vector-bulk 17833.4 17447.1 386.3 0.02166 docs/s
50th percentile latency custom-vector-bulk 72.8052 54.5385 18.2667 0.2509 ms
90th percentile latency custom-vector-bulk 161.131 153.681 7.45 0.04624 ms
99th percentile latency custom-vector-bulk 295.278 294.875 0.403 0.00136 ms
99.9th percentile latency custom-vector-bulk 1.60E+03 1623.76 -21.41 -0.01336 ms
99.99th percentile latency custom-vector-bulk 2.42E+03 3373.36 -953.6 -0.39409 ms
100th percentile latency custom-vector-bulk 2.75E+03 8909.9 -6163.93 -2.24472 ms
50th percentile service time custom-vector-bulk 7.28E+01 54.5385 18.2667 0.2509 %
90th percentile service time force-merge-segments 1.61E+02 153.681 7.45 0.04624 ops/s
99th percentile service time force-merge-segments 295.278 294.875 0.403 0.00136 ops/s
99.9th percentile service time force-merge-segments 1602.35 1623.76 -21.41 -0.01336 ops/s
99.99th percentile service time force-merge-segments 2419.76 3373.36 -953.6 -0.39409 ops/s
100th percentile service time force-merge-segments 2745.97 8909.9 -6163.93 -2.24472 ms
error rate force-merge-segments 0 0 0 #DIV/0! ms
Min Throughput force-merge-segments 0 0 0 #DIV/0! %
Mean Throughput warmup-indices 0 0 0 #DIV/0! ops/s
Median Throughput warmup-indices 0 0 0 #DIV/0! ops/s
Max Throughput warmup-indices 0 0.00E+00 0 #DIV/0! ops/s
100th percentile latency warmup-indices 6.63E+06 7.22E+06 -590670 -0.08915 ops/s
100th percentile service time warmup-indices 6.63E+06 7.22E+06 -590670 -0.08915 ms
error rate warmup-indices 0 0.00E+00 0 #DIV/0! ms
Min Throughput warmup-indices 0.27 2.50E-01 0.02 0.07407 %
Mean Throughput prod-queries 0.27 0.25 0.02 0.07407 ops/s
Median Throughput prod-queries 0.27 0.25 0.02 0.07407 ops/s
Max Throughput prod-queries 0.27 0.25 0.02 0.07407 ops/s
100th percentile latency prod-queries 3656.95 3944.55 -287.6 -0.07864 ops/s
100th percentile service time prod-queries 3656.95 3944.55 -287.6 -0.07864 ms
error rate prod-queries 0 0 0 #DIV/0! ms
Min Throughput prod-queries 0.72 0.77 -0.05 -0.06944 ms
Mean Throughput prod-queries 3.26 5.51 -2.25 -0.69018 ms
Median Throughput prod-queries 0.72 0.77 -0.05 -0.06944 ms
Max Throughput prod-queries 8.33 14.97 -6.64 -0.79712 ms
50th percentile latency prod-queries 8.07474 8.12443 -0.04969 -0.00615 ms
90th percentile latency prod-queries 8.96103 9.02022 -0.05919 -0.00661 ms
99th percentile latency prod-queries 25.3968 24.7477 0.6491 0.02556 %
100th percentile latency prod-queries 1382.39 1290.87 91.52 0.0662
50th percentile service time prod-queries 8.07474 8.12443 -0.04969 -0.00615
90th percentile service time prod-queries 8.96103 9.02022 -0.05919 -0.00661
99th percentile service time prod-queries 25.3968 24.7477 0.6491 0.02556
100th percentile service time prod-queries 1382.39 1290.87 91.52 0.0662
error rate prod-queries 0 0 0 #DIV/0!
Mean recall@k prod-queries 0.29 0.24 0.05 0.17241
Mean recall@1 prod-queries 0.42 0.4 0.02 0.04762

@0ctopus13prime
Copy link
Contributor Author

Overall code looks good to me. Couple of things:

  1. Please add the unit test for the C++ layer.
  2. Add the reference for the benchmarks performed using this code.

@navneet1v Address your comment. Can you take a look at it?

@navneet1v
Copy link
Collaborator

@0ctopus13prime please ensure that all CIs are successful

@0ctopus13prime
Copy link
Contributor Author

Seems like KNNCircuitBreakerIT is failing...
Will fix the the testing and update the commit.
Thanks.

…meter.

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
Comment on lines +224 to +226
if (KNNEngine.NMSLIB == knnEngine) {
throw new UnsupportedOperationException("Loading NMSLIB with a read stream is not supported at the moment");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had it here for being conservative.
Also I thought it would helpful for the reader to have a easier understanding why there's only FAISS in this method.
But, yeah! there won't be NMSLIB to be passed down to here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is the last throw is just dead code as it stands now. Would recommend removing if its handled correctly in favor of reducing code since there is no ambiguity here

Copy link
Contributor Author

@0ctopus13prime 0ctopus13prime Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm understanding The thing is the last throw is just dead code as it stands now. .
When we throw it there, the thrown exception will be eaten silently?
Hmm... I think KNNWeight would trigger this, no?

jni/include/faiss_stream_support.h Show resolved Hide resolved
jni/tests/faiss_stream_support_test.cpp Show resolved Hide resolved
@@ -547,6 +547,34 @@ void knn_jni::JNIUtil::SetByteArrayRegion(JNIEnv *env, jbyteArray array, jsize s
this->HasExceptionInStack(env, "Unable to set byte array region");
}

jobject knn_jni::JNIUtil::GetObjectField(JNIEnv * env, jobject obj, jfieldID fieldID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be just wrappers over JNIEnv, is this done for testing purposes? can we avoid this maintenance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is for testing purposes.
I could not come up with a better solution to have it here.
Also I think the motivation of introducing this class is on the testing purposes as well, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but most of the methods here have added logic, would be nice if we can mock JNIEnv. but should be fine if its not possible. One of the reason for this file is also caching which is handled by statics in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in order to let gtest mock, the target has to be a virtual class...
And JNIEnv is a struct, and defined APIs are also not virtual methods.

// Loads an index with a reader implemented IOReader
//
// Returns a pointer of the loaded index
jlong LoadIndexWithStream(faiss::IOReader* ioReader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move these methods to index service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a parity of LoadIndex which will be deprecated soon.
Tried not to modify the whole structure too much.

Copy link
Contributor

@shatejas shatejas Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Load index is a pending refactor/move, if there is enough bandwidth we should move new code to index service to start with

@heemin32 thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I make the changes in the upcoming PR?
Which will be raised right after this - Streaming support NMSLIB.
Will ask you for the review for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have it in index service. Having it in the next PR is fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!
Will move the method in index service in the next PR where I will also ask you to review over the refactoring.
Thank you both.

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
@0ctopus13prime
Copy link
Contributor Author

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index.
I made two different experiments of loading a FAISS vector index with different buffer sizes.

  1. After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
  2. With file cache in the memory.

Conclusion:
Buffer size in InputIndexWithBuffer does not impact loading time.
Then there's no reason to use more than 4KB buffer. If anything, it cost more space and takes more time between JNI critical section.

When a new index file was just created, and it's not system cached, then there's no trivial different to loading time between baseline and streaming fashion.
But the file is already loaded in system cache, then baseline (e.g. the one using fread) is slightly faster than streaming fashion. (3.584 VS 4.664).
But considering such case is rare (except for rebooting an engine during rolling restart), and it is expected that most cases it would load a newly created vector index, I think it would not seriously deteriorate performance overall.
Once an index was loaded, then query would be processed against to in-memory data structure, therefore there wasn't search performance between baseline versus streaming version. (Refer to above table for more details).

Experiment

Index size : 6.4G

1. Baseline (Using fread)

  1. After dropped : 51.097 seconds
  2. With cached : 3.584 seconds

2. Using Stream

2.1. 4KB

  1. After dropped : 51.354 seconds
  2. With cached : 4.664 seconds

2.2. 64KB

  1. After dropped : 51.491 seconds
  2. With cached : 4.318 seconds

2.3. 1M

  1. After dropped : 51.518 seconds
  2. With cached : 4.201 seconds

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
@0ctopus13prime
Copy link
Contributor Author

0ctopus13prime commented Oct 2, 2024

@navneet1v
Hi Navneet, ran a benchmark (including both bulk ingestion + searching), I could not find any evidences that showing memory peak during searching.
Please let me know if it looks good on you and we can merge this!
Thank you.

Streaming

Screenshot 2024-10-01 at 7 24 46 PM

Baseline

Screenshot 2024-10-01 at 7 28 20 PM

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
@0ctopus13prime
Copy link
Contributor Author

Hi @shatejas, included unit tests in the commit!
Could you take a look at it?

@navneet1v
Copy link
Collaborator

navneet1v commented Oct 2, 2024

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index. I made two different experiments of loading a FAISS vector index with different buffer sizes.

  1. After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
  2. With file cache in the memory.

Conclusion: Buffer size in InputIndexWithBuffer does not impact loading time. Then there's no reason to use more than 4KB buffer. If anything, it cost more space and takes more time between JNI critical section.

When a new index file was just created, and it's not system cached, then there's no trivial different to loading time between baseline and streaming fashion. But the file is already loaded in system cache, then baseline (e.g. the one using fread) is slightly faster than streaming fashion. (3.584 VS 4.664). But considering such case is rare (except for rebooting an engine during rolling restart), and it is expected that most cases it would load a newly created vector index, I think it would not seriously deteriorate performance overall. Once an index was loaded, then query would be processed against to in-memory data structure, therefore there wasn't search performance between baseline versus streaming version. (Refer to above table for more details).

Experiment

Index size : 6.4G

1. Baseline (Using fread)

  1. After dropped : 51.097 seconds
  2. With cached : 3.584 seconds

2. Using Stream

2.1. 4KB

  1. After dropped : 51.354 seconds
  2. With cached : 4.664 seconds

2.2. 64KB

  1. After dropped : 51.491 seconds
  2. With cached : 4.318 seconds

2.3. 1M

  1. After dropped : 51.518 seconds
  2. With cached : 4.201 seconds

Thanks @0ctopus13prime this is was some really good benchmarks and also the way you performed the benchmarks. I really like the idea of dropping the file from page cache and then running the benchmarks.

But I am not completely sold on the conclusion of sticking to 4KB as buffer size. I believe we should move to towards little higher buffer size to ensure that we are as close to cached performance. We has a similar buffer kind of thing when we transfer the vectors from Java Heap to off heap. There we use 1% of the heap memory. can we run 1 test with a higher value like 1% of the heap in a 32GB heap JVM.

Another thing, please paste these benchmarks on the RFC too. This is really good benchmarks.

@navneet1v
Copy link
Collaborator

@navneet1v Hi Navneet, ran a benchmark (including both bulk ingestion + searching), I could not find any evidences that showing memory peak during searching. Please let me know if it looks good on you and we can merge this! Thank you.

Streaming

Screenshot 2024-10-01 at 7 24 46 PM

Baseline

Screenshot 2024-10-01 at 7 28 20 PM

I am little surprised by this could you share more on details on what was your testing strategy and what was the metrics gathering technique to build the above graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Introducing Loading/Writing Layer in Native KNN Engines
4 participants