Introducing a loading layer in FAISS native engine. #2139

0ctopus13prime · 2024-09-24T17:57:38Z

Description

This PR is the first commit making the loading layer in native engines available.
Please refer to this issue for more details. - #2033

Related Issues

Resolves #2033
#2033

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

navneet1v

Overall code looks good to me. Couple of things:

Please add the unit test for the C++ layer.
Add the reference for the benchmarks performed using this code.

src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java

navneet1v · 2024-09-24T18:36:35Z

src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java

+
+            final int indexSize = (int) directory.fileLength(logicalIndexPath);
+
+            try (IndexInput readStream = directory.openInput(logicalIndexPath, IOContext.READONCE)) {


Please check when openInput is called what is the memory spike we are observing.

Need more input for this, which issue are you referring to??

I am not talking about any issue here. So my question is in your tests when you the queries are run, can we graph the spikes in memory during the query. As IndexInput.openInput maps the files to memory. So want to know if there is any spike we are seeing and how much is that spike actually.

src/main/java/org/opensearch/knn/index/util/IndexInputWithBuffer.java

navneet1v · 2024-09-24T18:44:11Z

src/main/java/org/opensearch/knn/index/query/KNNWeight.java

@@ -286,6 +286,7 @@ private Map<Integer, Float> doANNSearch(
        try {
            indexAllocation = nativeMemoryCacheManager.get(
                new NativeMemoryEntryContext.IndexEntryContext(
+                    reader.directory(),


we have added the directory code here, are we going to remove the dependency from FSDirectory in upcoming PRs?

I think you're referring to using FSDirectory to get indexPath right?
If so, yeah, that is the plan! We must deprecate FileWatcher first then moving on cutting FSDirectory entirely.

Passing Directory here to let the internal loading strategy be able to create IndexInput from it, in case there's not yet a vector index was loaded.

src/main/java/org/opensearch/knn/jni/JNIService.java

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime · 2024-09-25T22:15:53Z

Appendix 1. Performance Details

Metric	Task	Baseline-Value	Candidate-Value	Diff (B - C)	Diff ((B - C) / B)	Unit
Cumulative indexing time of primary shards		30.5506	31.8307	-1.2801	-0.0419	min
Min cumulative indexing time across primary shards		0.00012	0.00025	-0.00013	-1.14285	min
Median cumulative indexing time across primary shards		15.2753	15.9153	-0.64	-0.0419	min
Max cumulative indexing time across primary shards		30.5505	31.8304	-1.2799	-0.04189	min
Cumulative indexing throttle time of primary shards		0	0	0	#DIV/0!	min
Min cumulative indexing throttle time across primary shards		0	0	0	#DIV/0!	min
Median cumulative indexing throttle time across primary shards		0	0	0	#DIV/0!	min
Max cumulative indexing throttle time across primary shards		0	0	0	#DIV/0!	min
Cumulative merge time of primary shards		182.972	200.509	-17.537	-0.09585	min
Cumulative merge count of primary shards		113	107	6	0.0531
Min cumulative merge time across primary shards		0	0	0	#DIV/0!	min
Median cumulative merge time across primary shards		91.4861	100.254	-8.7679	-0.09584	min
Max cumulative merge time across primary shards		182.972	200.509	-17.537	-0.09585	min
Cumulative merge throttle time of primary shards		1.33108	1.67535	-0.34427	-0.25864	min
Min cumulative merge throttle time across primary shards		0	0	0	#DIV/0!	min
Median cumulative merge throttle time across primary shards		0.66554	0.83768	-0.17213	-0.25864	min
Max cumulative merge throttle time across primary shards		1.33108	1.67535	-0.34427	-0.25864	min
Cumulative refresh time of primary shards		1.25782	1.2378	0.02002	0.01592	min
Cumulative refresh count of primary shards		92	91	1	0.01087
Min cumulative refresh time across primary shards		3.17E-04	0.00042	-0.0001	-0.31579	min
Median cumulative refresh time across primary shards		0.62891	0.6189	0.01001	0.01591	min
Max cumulative refresh time across primary shards		1.26E+00	1.23738	0.02012	0.016	min
Cumulative flush time of primary shards		11.9259	11.4079	0.518	0.04343	min
Cumulative flush count of primary shards		61	58	3	0.04918
Min cumulative flush time across primary shards		0	0	0	#DIV/0!	min
Median cumulative flush time across primary shards		5.96296	5.70395	0.25901	0.04344	min
Max cumulative flush time across primary shards		11.9259	11.4079	0.518	0.04343	min
Total Young Gen GC time		0.288	0.301	-0.013	-0.04514	s
Total Young Gen GC count		18	17	1	0.05556
Total Old Gen GC time		0	0	0	#DIV/0!	s
Total Old Gen GC count		0	0	0	#DIV/0!
Store size		29.8163	25.8162	4.0001	0.13416	GB
Translog size		5.82E-07	5.82E-07	0	0	GB
Heap used for segments		0	0	0	#DIV/0!	MB
Heap used for doc values		0	0.00E+00	0	#DIV/0!	MB
Heap used for terms		0	0	0	#DIV/0!	MB
Heap used for norms		0	0	0	#DIV/0!	MB
Heap used for points		0	0	0	#DIV/0!	MB
Heap used for stored fields		0	0	0	#DIV/0!	MB
Segment count		2	2	0	0
Min Throughput	custom-vector-bulk	5948.76	5462.82	485.94	0.08169	docs/s
Mean Throughput	custom-vector-bulk	10542.8	10833.4	-290.6	-0.02756	docs/s
Median Throughput	custom-vector-bulk	9988.27	10243.3	-255.03	-0.02553	docs/s
Max Throughput	custom-vector-bulk	17833.4	17447.1	386.3	0.02166	docs/s
50th percentile latency	custom-vector-bulk	72.8052	54.5385	18.2667	0.2509	ms
90th percentile latency	custom-vector-bulk	161.131	153.681	7.45	0.04624	ms
99th percentile latency	custom-vector-bulk	295.278	294.875	0.403	0.00136	ms
99.9th percentile latency	custom-vector-bulk	1.60E+03	1623.76	-21.41	-0.01336	ms
99.99th percentile latency	custom-vector-bulk	2.42E+03	3373.36	-953.6	-0.39409	ms
100th percentile latency	custom-vector-bulk	2.75E+03	8909.9	-6163.93	-2.24472	ms
50th percentile service time	custom-vector-bulk	7.28E+01	54.5385	18.2667	0.2509	%
90th percentile service time	force-merge-segments	1.61E+02	153.681	7.45	0.04624	ops/s
99th percentile service time	force-merge-segments	295.278	294.875	0.403	0.00136	ops/s
99.9th percentile service time	force-merge-segments	1602.35	1623.76	-21.41	-0.01336	ops/s
99.99th percentile service time	force-merge-segments	2419.76	3373.36	-953.6	-0.39409	ops/s
100th percentile service time	force-merge-segments	2745.97	8909.9	-6163.93	-2.24472	ms
error rate	force-merge-segments	0	0	0	#DIV/0!	ms
Min Throughput	force-merge-segments	0	0	0	#DIV/0!	%
Mean Throughput	warmup-indices	0	0	0	#DIV/0!	ops/s
Median Throughput	warmup-indices	0	0	0	#DIV/0!	ops/s
Max Throughput	warmup-indices	0	0.00E+00	0	#DIV/0!	ops/s
100th percentile latency	warmup-indices	6.63E+06	7.22E+06	-590670	-0.08915	ops/s
100th percentile service time	warmup-indices	6.63E+06	7.22E+06	-590670	-0.08915	ms
error rate	warmup-indices	0	0.00E+00	0	#DIV/0!	ms
Min Throughput	warmup-indices	0.27	2.50E-01	0.02	0.07407	%
Mean Throughput	prod-queries	0.27	0.25	0.02	0.07407	ops/s
Median Throughput	prod-queries	0.27	0.25	0.02	0.07407	ops/s
Max Throughput	prod-queries	0.27	0.25	0.02	0.07407	ops/s
100th percentile latency	prod-queries	3656.95	3944.55	-287.6	-0.07864	ops/s
100th percentile service time	prod-queries	3656.95	3944.55	-287.6	-0.07864	ms
error rate	prod-queries	0	0	0	#DIV/0!	ms
Min Throughput	prod-queries	0.72	0.77	-0.05	-0.06944	ms
Mean Throughput	prod-queries	3.26	5.51	-2.25	-0.69018	ms
Median Throughput	prod-queries	0.72	0.77	-0.05	-0.06944	ms
Max Throughput	prod-queries	8.33	14.97	-6.64	-0.79712	ms
50th percentile latency	prod-queries	8.07474	8.12443	-0.04969	-0.00615	ms
90th percentile latency	prod-queries	8.96103	9.02022	-0.05919	-0.00661	ms
99th percentile latency	prod-queries	25.3968	24.7477	0.6491	0.02556	%
100th percentile latency	prod-queries	1382.39	1290.87	91.52	0.0662
50th percentile service time	prod-queries	8.07474	8.12443	-0.04969	-0.00615
90th percentile service time	prod-queries	8.96103	9.02022	-0.05919	-0.00661
99th percentile service time	prod-queries	25.3968	24.7477	0.6491	0.02556
100th percentile service time	prod-queries	1382.39	1290.87	91.52	0.0662
error rate	prod-queries	0	0	0	#DIV/0!
Mean recall@k	prod-queries	0.29	0.24	0.05	0.17241
Mean recall@1	prod-queries	0.42	0.4	0.02	0.04762

0ctopus13prime · 2024-09-25T22:17:43Z

Overall code looks good to me. Couple of things:

Please add the unit test for the C++ layer.

Add the reference for the benchmarks performed using this code.

@navneet1v Address your comment. Can you take a look at it?

navneet1v · 2024-09-27T22:16:57Z

@0ctopus13prime please ensure that all CIs are successful

0ctopus13prime · 2024-09-28T17:44:37Z

Seems like KNNCircuitBreakerIT is failing...
Will fix the the testing and update the commit.
Thanks.

…meter. Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

shatejas · 2024-09-30T21:16:27Z

src/main/java/org/opensearch/knn/jni/JNIService.java

+        if (KNNEngine.NMSLIB == knnEngine) {
+            throw new UnsupportedOperationException("Loading NMSLIB with a read stream is not supported at the moment");
+        }


Seems redundant

Had it here for being conservative.
Also I thought it would helpful for the reader to have a easier understanding why there's only FAISS in this method.
But, yeah! there won't be NMSLIB to be passed down to here.

The thing is the last throw is just dead code as it stands now. Would recommend removing if its handled correctly in favor of reducing code since there is no ambiguity here

Not sure I'm understanding The thing is the last throw is just dead code as it stands now. .
When we throw it there, the thrown exception will be eaten silently?
Hmm... I think KNNWeight would trigger this, no?

jni/include/faiss_stream_support.h

jni/tests/faiss_stream_support_test.cpp

shatejas · 2024-10-01T21:07:05Z

jni/src/jni_util.cpp

@@ -547,6 +547,34 @@ void knn_jni::JNIUtil::SetByteArrayRegion(JNIEnv *env, jbyteArray array, jsize s
    this->HasExceptionInStack(env, "Unable to set byte array region");
 }

+jobject knn_jni::JNIUtil::GetObjectField(JNIEnv * env, jobject obj, jfieldID fieldID) {


These seem to be just wrappers over JNIEnv, is this done for testing purposes? can we avoid this maintenance?

Yes, it is for testing purposes.
I could not come up with a better solution to have it here.
Also I think the motivation of introducing this class is on the testing purposes as well, right?

yeah but most of the methods here have added logic, would be nice if we can mock JNIEnv. but should be fine if its not possible. One of the reason for this file is also caching which is handled by statics in this PR

Unfortunately, in order to let gtest mock, the target has to be a virtual class...
And JNIEnv is a struct, and defined APIs are also not virtual methods.

shatejas · 2024-10-01T21:19:43Z

jni/include/faiss_wrapper.h

+        // Loads an index with a reader implemented IOReader
+        //
+        // Returns a pointer of the loaded index
+        jlong LoadIndexWithStream(faiss::IOReader* ioReader);


Lets move these methods to index service?

This is a parity of LoadIndex which will be deprecated soon.
Tried not to modify the whole structure too much.

Load index is a pending refactor/move, if there is enough bandwidth we should move new code to index service to start with

@heemin32 thoughts?

Can I make the changes in the upcoming PR?
Which will be raised right after this - Streaming support NMSLIB.
Will ask you for the review for that.

Better to have it in index service. Having it in the next PR is fine for me.

Sounds good!
Will move the method in index service in the next PR where I will also ask you to review over the refactoring.
Thank you both.

src/main/java/org/opensearch/knn/jni/JNIService.java

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime · 2024-10-01T23:22:30Z

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index.
I made two different experiments of loading a FAISS vector index with different buffer sizes.

After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
With file cache in the memory.

Conclusion:
Buffer size in InputIndexWithBuffer does not impact loading time.
Then there's no reason to use more than 4KB buffer. If anything, it cost more space and takes more time between JNI critical section.

When a new index file was just created, and it's not system cached, then there's no trivial different to loading time between baseline and streaming fashion.
But the file is already loaded in system cache, then baseline (e.g. the one using fread) is slightly faster than streaming fashion. (3.584 VS 4.664).
But considering such case is rare (except for rebooting an engine during rolling restart), and it is expected that most cases it would load a newly created vector index, I think it would not seriously deteriorate performance overall.
Once an index was loaded, then query would be processed against to in-memory data structure, therefore there wasn't search performance between baseline versus streaming version. (Refer to above table for more details).

Experiment

Index size : 6.4G

1. Baseline (Using fread)

After dropped : 51.097 seconds
With cached : 3.584 seconds

2. Using Stream

2.1. 4KB

After dropped : 51.354 seconds
With cached : 4.664 seconds

2.2. 64KB

After dropped : 51.491 seconds
With cached : 4.318 seconds

2.3. 1M

After dropped : 51.518 seconds
With cached : 4.201 seconds

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime · 2024-10-02T02:29:52Z

@navneet1v
Hi Navneet, ran a benchmark (including both bulk ingestion + searching), I could not find any evidences that showing memory peak during searching.
Please let me know if it looks good on you and we can merge this!
Thank you.

Streaming

Baseline

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime · 2024-10-02T03:18:00Z

Hi @shatejas, included unit tests in the commit!
Could you take a look at it?

navneet1v · 2024-10-02T05:21:03Z

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index. I made two different experiments of loading a FAISS vector index with different buffer sizes.

After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')

With file cache in the memory.

Conclusion: Buffer size in InputIndexWithBuffer does not impact loading time. Then there's no reason to use more than 4KB buffer. If anything, it cost more space and takes more time between JNI critical section.

When a new index file was just created, and it's not system cached, then there's no trivial different to loading time between baseline and streaming fashion. But the file is already loaded in system cache, then baseline (e.g. the one using fread) is slightly faster than streaming fashion. (3.584 VS 4.664). But considering such case is rare (except for rebooting an engine during rolling restart), and it is expected that most cases it would load a newly created vector index, I think it would not seriously deteriorate performance overall. Once an index was loaded, then query would be processed against to in-memory data structure, therefore there wasn't search performance between baseline versus streaming version. (Refer to above table for more details).

Experiment

Index size : 6.4G

1. Baseline (Using fread)

After dropped : 51.097 seconds

With cached : 3.584 seconds

2. Using Stream

2.1. 4KB

After dropped : 51.354 seconds

With cached : 4.664 seconds

2.2. 64KB

After dropped : 51.491 seconds

With cached : 4.318 seconds

2.3. 1M

After dropped : 51.518 seconds

With cached : 4.201 seconds

Thanks @0ctopus13prime this is was some really good benchmarks and also the way you performed the benchmarks. I really like the idea of dropping the file from page cache and then running the benchmarks.

But I am not completely sold on the conclusion of sticking to 4KB as buffer size. I believe we should move to towards little higher buffer size to ensure that we are as close to cached performance. We has a similar buffer kind of thing when we transfer the vectors from Java Heap to off heap. There we use 1% of the heap memory. can we run 1 test with a higher value like 1% of the heap in a 32GB heap JVM.

Another thing, please paste these benchmarks on the RFC too. This is really good benchmarks.

navneet1v · 2024-10-02T05:24:06Z

@navneet1v Hi Navneet, ran a benchmark (including both bulk ingestion + searching), I could not find any evidences that showing memory peak during searching. Please let me know if it looks good on you and we can merge this! Thank you.

Streaming

Baseline

I am little surprised by this could you share more on details on what was your testing strategy and what was the metrics gathering technique to build the above graphs.

Introducing a loading layer in FAISS native engine.

5248904

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan and luyuncheng as code owners September 24, 2024 17:57

Update change log.

e88ce04

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

navneet1v reviewed Sep 24, 2024

View reviewed changes

Added unit tests for Faiss stream support.

fc4b60d

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime mentioned this pull request Sep 27, 2024

[RFC] Introducing Loading/Writing Layer in Native KNN Engines #2033

Open

Fix a bug to pass a KB size integer value as a byte size integer para…

fe33151

…meter. Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime force-pushed the m2-loading-layer branch from ceb73cb to fe33151 Compare September 28, 2024 18:29

shatejas reviewed Oct 1, 2024

View reviewed changes

Fix a casting bugs when it tries to laod more than 4G sized index file.

da17102

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

0ctopus13prime force-pushed the m2-loading-layer branch from c995564 to da17102 Compare October 1, 2024 22:58

Added unit tests for new methods in JNIService.

8454231

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>

Fix formatting and removed nmslib_stream_support.

e16a1ff

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>


		final int indexSize = (int) directory.fileLength(logicalIndexPath);

		try (IndexInput readStream = directory.openInput(logicalIndexPath, IOContext.READONCE)) {

Introducing a loading layer in FAISS native engine. #2139

Are you sure you want to change the base?

Introducing a loading layer in FAISS native engine. #2139

Conversation

0ctopus13prime commented Sep 24, 2024 • edited Loading

Description

Related Issues

Check List

navneet1v left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0ctopus13prime commented Sep 25, 2024

Appendix 1. Performance Details

0ctopus13prime commented Sep 25, 2024

navneet1v commented Sep 27, 2024

0ctopus13prime commented Sep 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0ctopus13prime Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatejas Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0ctopus13prime commented Oct 1, 2024

Loading Time Comparison

Experiment

1. Baseline (Using fread)

2. Using Stream

2.1. 4KB

2.2. 64KB

2.3. 1M

0ctopus13prime commented Oct 2, 2024 • edited Loading

Streaming

Baseline

0ctopus13prime commented Oct 2, 2024

navneet1v commented Oct 2, 2024 • edited Loading

Loading Time Comparison

Experiment

1. Baseline (Using fread)

2. Using Stream

2.1. 4KB

2.2. 64KB

2.3. 1M

navneet1v commented Oct 2, 2024

Streaming

Baseline

0ctopus13prime commented Sep 24, 2024 •

edited

Loading

0ctopus13prime Oct 2, 2024 •

edited

Loading

shatejas Oct 2, 2024 •

edited

Loading

0ctopus13prime commented Oct 2, 2024 •

edited

Loading

navneet1v commented Oct 2, 2024 •

edited

Loading