LUCENE-10382: Support filtering in KnnVectorQuery #656

jtibshirani · 2022-02-07T23:54:03Z

This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein jbernste@apache.org

…cs WIP

jtibshirani · 2022-02-08T01:11:30Z

I tried out the around stopping the HNSW search early if it visits too many docs. To test, I modified KnnGraphTester to create acceptDocs uniformly at random with a certain selectivity, then measured recall and QPS. Here are the results on glove-100-angular (~1.2 million docs) with a filter selectivity 0.01:

Baseline

k        Recall    VisitedDocs     QPS  
10        0.774       15957     232.083
50        0.930       63429      58.994
80        0.958       95175      42.470
100       0.967      118891      35.203
500       0.997     1176237       8.136
800       0.999     1183514       5.571

PR

k        Recall    VisitedDocs     QPS  
10        1.000	       22908     190.286
50        1.000	       23607     152.224
80        1.000	       23608     148.036
100       1.000	       23608     145.381
500       1.000	       23608     138.903
800       1.000	       23608     137.882

Since the filter is so selective, HNSW always visits more than 1% of the docs. The adaptive logic in the PR decides to stop the search and switch to an exact search, which bounds the visited docs at 2%. For k=10 this makes the QPS a little worse, but overall prevents QPS from degrading (with the side benefit of perfect recall). I also tested with less restrictive filters, and in these cases the fallback just doesn't kick in, so the QPS remains the same as before.

Overall I like this approach because it doesn't require us to fiddle with thresholds or expose new parameters. It could also help make HNSW more robust in "pathological" cases where even when the filter is not that selective, all the nearest vectors to a query happen to be filtered away.

jtibshirani · 2022-02-08T01:22:31Z

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

+    DocIdSetIterator acceptIterator = null;
+    int visitedLimit = Integer.MAX_VALUE;
+
+    if (acceptDocs instanceof BitSet acceptBitSet) {


This is a temporary hack since I wasn't sure about the right design. I could see a couple possibilities:

Add a new BitSet filter parameter to searchNearestVectors, keeping the fallback logic within the HNSW classes.

Add a new int visitedLimit parameter to LeafReader#searchNearestVectors. Pull the "exact search" logic up into KnnVectorQuery.

Which option is better probably depends on how other algorithms would handle filtering (which I am not sure about), and also if we think visitedLimit is useful in other contexts.

I also played around with having searchNearestVectors take a Collector and using CollectionTerminatedException... but couldn't really see how this made sense.

I think I have a preference for option 2

This feels like a high-level query planning decision, which belongs more to the query API than to the codec API.

My gut feeling is that a limit on the number of considered candidates is something that would be generalizable to most NN algorithms.

Queries might have better options than a BitSet at times, e.g. if the filter is a IndexSortSortedNumericDocValuesRangeQuery, then you could have both a Bits and DocIdSetIterator view of the matches that do not require materializing a BitSet.

Vectors are currently not handled by ExitableDirectoryReader. Option 1 would require adding a BitSet wrapper, while we'd like to keep the number of sub classes of BitSet to exactly 2, a case that the JVM handles better. With option 2 we could go with just a Bits wrapper that would check the timeout whenever Bit#get is called?

I'm not super-familiar with other algorithms, but it does make sense to me that any approximate algorithm is going to have a "tuning" knob that increases recall in exchange for increased cost. This was the idea behind the now-defunct "fanout" parameter we had in the earlier version of the vector search API. So -- it makes sense to me that we are now bringing back some measure of control over this tuning, albeit in a different form.

@jpountz as always brings up interesting points! - I had no idea we were concerned about the number of subclasses of BitSet, nor was I aware of ExitableDirectoryReader! But I wonder if that should determine the approach here -- should we rely on Bits-based termination, or should we instrument VectorValues?

These are all great points. The reasons to prefer option 2 make sense to me (although I'm also not clear on the best strategy for supporting ExitableDirectoryReader). I had a similar intuition to @msokolov that visitedLimit feels like a cost-tradeoff parameter similar to efSearch/ fanout... but I don't yet see how to bridge the gap between these two concepts.

In any case, I feel pretty good about adding a parameter visitedLimit for now. The concept indeed seems general, and we have room to further generalize it later if needed (maybe an approximate costLimit?) or revise it.

msokolov

It's gratifying to see the theory worked out in practice. +1 to expose searchExact and allow the Query to call it if it is selective. I suppose one case this wouldn't cover cleanly would be when there are very large number of deleted docs, although that seems kind of pathlogical and perhaps not worth designing for?

msokolov · 2022-02-08T19:36:47Z

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

+    int numVisited = 0;
+
+    int doc;
+    while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {


should we call advance(vectorValues.docID()) here to enable skipping?

Good point, I'll rework this.

msokolov · 2022-02-08T19:40:59Z

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

+    DocIdSetIterator acceptIterator = null;
+    int visitedLimit = Integer.MAX_VALUE;
+
+    if (acceptDocs instanceof BitSet acceptBitSet) {


I'm not super-familiar with other algorithms, but it does make sense to me that any approximate algorithm is going to have a "tuning" knob that increases recall in exchange for increased cost. This was the idea behind the now-defunct "fanout" parameter we had in the earlier version of the vector search API. So -- it makes sense to me that we are now bringing back some measure of control over this tuning, albeit in a different form.

msokolov · 2022-02-08T19:47:46Z

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

+    DocIdSetIterator acceptIterator = null;
+    int visitedLimit = Integer.MAX_VALUE;
+
+    if (acceptDocs instanceof BitSet acceptBitSet) {


@jpountz as always brings up interesting points! - I had no idea we were concerned about the number of subclasses of BitSet, nor was I aware of ExitableDirectoryReader! But I wonder if that should determine the approach here -- should we rely on Bits-based termination, or should we instrument VectorValues?

jtibshirani · 2022-02-08T21:33:20Z

Thanks for reviewing. I'll work on another iteration and ping you when it's out of "draft" status. One clarification first...

+1 to expose searchExact and allow the Query to call it if it is selective.

I was thinking we would implement exact search within KnnVectorQuery itself, instead of exposing it through LeafReader. I think we already have all the pieces we need through the VectorValues API. What do you think?

msokolov · 2022-02-09T01:48:03Z

I was thinking we would implement exact search within KnnVectorQuery itself, instead of exposing it through LeafReader. I think we already have all the pieces we need through the VectorValues API. What do you think?

Ah I misunderstood. I suppose then in theory the logic can work with any Vector encoding, which would be good, rather than having to reproduce the same brute force approach in each of them

jtibshirani · 2022-02-10T08:06:03Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

@@ -147,6 +165,11 @@ NeighborQueue searchLevel(
          continue;
        }

+        numVisited++;
+        if (numVisited > visitedLimit) {
+          throw new CollectionTerminatedException();


This may be an abuse of CollectionTerminatedException. Another idea would be to try to pass back the information that the search was terminated early in TopDocs.TotalHits (but this also doesn't seem ideal).

msokolov · 2022-02-10T13:22:05Z

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

      IndexSearcher indexSearcher = new IndexSearcher(reader);
-      bitSets = new BitSet[reader.leaves().size()];
-      indexSearcher.search(filter, new BitSetCollector(bitSets));
+      indexSearcher.search(filter, filterCollector);


for another day, but I am realizing that we have no opportunity to make use of per-segment concurrency here, as we ordinarily do in IndexSearcher.search(). To do so, we'd need to consider some API change though. Perhaps instead of using rewrite for this, we could make use of Query's two-phase iteration mode of operation. Just a thought for later - I'll go open an issue elsewhere.

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

msokolov · 2022-02-10T13:52:14Z

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

+        return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit);
+      } catch (
+          @SuppressWarnings("unused")
+          CollectionTerminatedException e) {


I could go either way with this one. I tend to lean towards using TopDocs.totalHits.value since we already use it for returning visited counts; we could return with a null or maybe empty scoreDocs in that case? Or perhaps there could be a use case for returning the "best effort" results obtained by visiting a limited subset of the graph, and we should in fact marshal up the results. Generally I don't favor using Exceptions for expected behavior, but also I think if we do choose this pattern we should create a new Exception type just for this case.

I agree and also prefer not to throw an Exception if possible; it is an expensive operation to throw an Exception in comparison with just returning a value.

I agree, it's nice to avoid using exceptions for normal control flow. I'm not too concerned from a performance perspective though, exceptions aren't thrown in a "hot loop" and I didn't see a perf hit in testing.

If we go the route of using TopDocs, I'd prefer to avoid 'null' since that's a bit overloaded (indicates the field is missing or does not have vectors). Brainstorming ideas:

Just return EMPTY_TOPDOCS.

Still return best score docs and the visited count. But use EQUAL_TO for TotalHits.Relation if the search completed normally, otherwise use GREATER_THAN_OR_EQUAL_TO.

Use a special subtype of TopDocs instead, which has an explicit "complete" flag?

I liked very much of "a special subtype of TopDocs instead, which has an explicit "complete" flag"

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

jtibshirani · 2022-02-15T00:25:35Z

@msokolov @jpountz @mayya-sharipova this is ready for another look. Notable changes:

When computing the filter results, only include documents that actually contain a vector. This gives an accurate estimate of the filter selectivity. To support this I introduced KnnVectorFieldExistsQuery, which seemed useful in its own right.
I stopped using CollectionTerminationException to indicate that the search hit the visited limit. Instead, we pass the information in TopDocs through TotalHits. The value is always the number of visited docs, but the relation is GREATER_THAN_OR_EQUAL_TO iff the search stopped early. This is kind of arbitrary but felt natural -- I'm very open to suggestions here! It's a fairly low-level API and it's marked experimental, so there is also room to refine it later. This update does not change the output of KnnVectorQuery.

msokolov

just a couple of minor comments. Thanks!

msokolov · 2022-02-15T15:47:30Z

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java

+   * <p>The returned {@link TopDocs} will contain a {@link ScoreDoc} for each nearest neighbor, in
+   * order of their similarity to the query vector (decreasing scores). The {@link TotalHits}
+   * contains the number of documents visited during the search. If the search stopped early because
+   * it hit {@code visitedLimit}, it is indicated through the relation {@code


Would it be enough to know that TopDocs.totalHits.value==visitedLimit? Do we need to use the relation as a sentinel?

It's possible (but unlikely) that the search completed normally with exactly numVisited == visitedLimit. The visitedLimit is inclusive. To me it felt more solid and obvious to use the relation.

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

jtibshirani · 2022-02-15T23:26:15Z

Thanks for the review! I'll wait for the others to take a look too. I'm working on adding kNN with filtering to luceneutil.

build.gradle

mayya-sharipova

@jtibshirani Thanks Julie, this is a great enhancement! I've left some minor comments, but overall this looks fantastic!

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

mayya-sharipova · 2022-02-16T09:37:32Z

lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java

+          KnnVectorQuery query = new KnnVectorQuery("field", randomVector(dimension), 5, filter);
+          TopDocs results = searcher.search(query, numDocs);
+          assertEquals(TotalHits.Relation.EQUAL_TO, results.totalHits.relation);
+          assertEquals(results.totalHits.value, 5);


How do we know that we used the exact search? Are we judging by the equality of results.totalHits.value and results.scoreDocs.length? I guess in most cases this is true.

Another idea is always use TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO for the approximate search results as returned in KnnVectorQuery.searchLeaf:

TopDocs results = approximateSearch(ctx, acceptDocs, visitedLimit); if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) { return <results with Relation.GREATER_THAN_OR_EQUAL_TO>; } else {

Thanks for catching this. I actually got confused here and wrote test assertions that are misleading. Since KnnVectorQuery is rewritten to DocAndScoreQuery, none of the information about visited nodes is preserved. Therefore we can't tell if exact or approximate search was used. I will rework this test.

I will open a follow-up issue to discuss this. I don't feel like we have a perfect grasp on what total hits should mean in the context of kNN search, especially since it differs between LeafReader#searchNearestVectors and the output of KnnVectorQuery.

Thanks for the explanation, I missed a part about rewriting to DocAndScoreQuery.

mayya-sharipova

@jtibshirani Thanks Julie, new way to assert exact search in TestKnnVectorQuery LGTM.

This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein <jbernste@apache.org>

* main: migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)

* main: LUCENE-10416: move changes entry to v10.0.0 migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)

This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein <jbernste@apache.org>

joel-bernstein and others added 5 commits February 7, 2022 13:35

LUCENE-10382: Allow KnnVectorQuery to operate over a subset of liveDo…

0a1c480

…cs WIP

LUCENE-10382: Add basic test

b233876

LUCENE-10382: Conditionally build live docs

dfdfb30

Fix spotless

8378f00

Dynamically switch to exact search

291072f

jtibshirani commented Feb 8, 2022

View reviewed changes

msokolov reviewed Feb 8, 2022

View reviewed changes

Introduce visitedLimit parameter

a456a76

jtibshirani force-pushed the hnsw-filter branch 2 times, most recently from 73c06fc to a456a76 Compare February 10, 2022 06:41

jtibshirani added 3 commits February 9, 2022 22:59

Adjust test

4bdbde6

Refactors in KnnVectorQuery

b4cd4ff

Undo refactor in HnswGraphSearcher

fe115dc

jtibshirani marked this pull request as ready for review February 10, 2022 08:04

jtibshirani commented Feb 10, 2022

View reviewed changes

msokolov reviewed Feb 10, 2022

View reviewed changes

jtibshirani commented Feb 10, 2022

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java Outdated Show resolved Hide resolved

jtibshirani mentioned this pull request Feb 10, 2022

LUCENE-10408 Better encoding of doc Ids in vectors #649

Merged

jtibshirani added 5 commits February 14, 2022 16:08

Prepopulate queue for exact search

90d4fa9

Clarify how BitSetIterator is used

79922f5

Improve null handling

705adb3

Filter out documents with no vector early

0d787b7

Avoid using CollectionTerminatedException

4a60dfa

msokolov approved these changes Feb 15, 2022

View reviewed changes

Fix typo

83df502

Update changelog

49d4eba

mayya-sharipova reviewed Feb 16, 2022

View reviewed changes

build.gradle Outdated Show resolved Hide resolved

mayya-sharipova approved these changes Feb 16, 2022

View reviewed changes

jtibshirani added 4 commits February 16, 2022 10:20

Remove accidental change to build.gradle

62ee138

Fix javadoc

61641c2

Fix TestKnnVectorQuery#testRandomFilter

40eebc1

Merge remote-tracking branch 'upstream/main' into hnsw-filter

62c5869

mayya-sharipova approved these changes Feb 17, 2022

View reviewed changes

tteofili approved these changes Feb 17, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into pr

9d70373

jtibshirani merged commit 8ca3725 into apache:main Feb 17, 2022

jtibshirani deleted the hnsw-filter branch February 17, 2022 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10382: Support filtering in KnnVectorQuery #656

LUCENE-10382: Support filtering in KnnVectorQuery #656

jtibshirani commented Feb 7, 2022

jtibshirani commented Feb 8, 2022 •

edited

Loading

jtibshirani Feb 8, 2022

jpountz Feb 8, 2022

msokolov Feb 8, 2022

msokolov Feb 8, 2022

jtibshirani Feb 8, 2022 •

edited

Loading

msokolov left a comment

msokolov Feb 8, 2022

jtibshirani Feb 8, 2022

msokolov Feb 8, 2022

msokolov Feb 8, 2022

jtibshirani commented Feb 8, 2022

msokolov commented Feb 9, 2022

jtibshirani Feb 10, 2022

msokolov Feb 10, 2022

msokolov Feb 10, 2022

mayya-sharipova Feb 10, 2022 •

edited

Loading

jtibshirani Feb 10, 2022

mayya-sharipova Feb 10, 2022

jtibshirani commented Feb 15, 2022

msokolov left a comment

msokolov Feb 15, 2022

jtibshirani Feb 15, 2022

jtibshirani commented Feb 15, 2022

mayya-sharipova left a comment

mayya-sharipova Feb 16, 2022

jtibshirani Feb 16, 2022

mayya-sharipova Feb 17, 2022

mayya-sharipova left a comment

LUCENE-10382: Support filtering in KnnVectorQuery #656

LUCENE-10382: Support filtering in KnnVectorQuery #656

Conversation

jtibshirani commented Feb 7, 2022

jtibshirani commented Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Feb 8, 2022

msokolov commented Feb 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Feb 15, 2022

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Feb 15, 2022

mayya-sharipova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

jtibshirani commented Feb 8, 2022 •

edited

Loading

jtibshirani Feb 8, 2022 •

edited

Loading

mayya-sharipova Feb 10, 2022 •

edited

Loading