Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10382: Support filtering in KnnVectorQuery #656

Merged
merged 21 commits into from
Feb 17, 2022

Conversation

jtibshirani
Copy link
Member

This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein jbernste@apache.org

@jtibshirani
Copy link
Member Author

jtibshirani commented Feb 8, 2022

I tried out the around stopping the HNSW search early if it visits too many docs. To test, I modified KnnGraphTester to create acceptDocs uniformly at random with a certain selectivity, then measured recall and QPS. Here are the results on glove-100-angular (~1.2 million docs) with a filter selectivity 0.01:

Baseline

k        Recall    VisitedDocs     QPS  
10        0.774       15957     232.083
50        0.930       63429      58.994
80        0.958       95175      42.470
100       0.967      118891      35.203
500       0.997     1176237       8.136
800       0.999     1183514       5.571

PR

k        Recall    VisitedDocs     QPS  
10        1.000	       22908     190.286
50        1.000	       23607     152.224
80        1.000	       23608     148.036
100       1.000	       23608     145.381
500       1.000	       23608     138.903
800       1.000	       23608     137.882

Since the filter is so selective, HNSW always visits more than 1% of the docs. The adaptive logic in the PR decides to stop the search and switch to an exact search, which bounds the visited docs at 2%. For k=10 this makes the QPS a little worse, but overall prevents QPS from degrading (with the side benefit of perfect recall). I also tested with less restrictive filters, and in these cases the fallback just doesn't kick in, so the QPS remains the same as before.

Overall I like this approach because it doesn't require us to fiddle with thresholds or expose new parameters. It could also help make HNSW more robust in "pathological" cases where even when the filter is not that selective, all the nearest vectors to a query happen to be filtered away.

DocIdSetIterator acceptIterator = null;
int visitedLimit = Integer.MAX_VALUE;

if (acceptDocs instanceof BitSet acceptBitSet) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary hack since I wasn't sure about the right design. I could see a couple possibilities:

  1. Add a new BitSet filter parameter to searchNearestVectors, keeping the fallback logic within the HNSW classes.
  2. Add a new int visitedLimit parameter to LeafReader#searchNearestVectors. Pull the "exact search" logic up into KnnVectorQuery.

Which option is better probably depends on how other algorithms would handle filtering (which I am not sure about), and also if we think visitedLimit is useful in other contexts.

I also played around with having searchNearestVectors take a Collector and using CollectionTerminatedException... but couldn't really see how this made sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a preference for option 2

  • This feels like a high-level query planning decision, which belongs more to the query API than to the codec API.
  • My gut feeling is that a limit on the number of considered candidates is something that would be generalizable to most NN algorithms.
  • Queries might have better options than a BitSet at times, e.g. if the filter is a IndexSortSortedNumericDocValuesRangeQuery, then you could have both a Bits and DocIdSetIterator view of the matches that do not require materializing a BitSet.
  • Vectors are currently not handled by ExitableDirectoryReader. Option 1 would require adding a BitSet wrapper, while we'd like to keep the number of sub classes of BitSet to exactly 2, a case that the JVM handles better. With option 2 we could go with just a Bits wrapper that would check the timeout whenever Bit#get is called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super-familiar with other algorithms, but it does make sense to me that any approximate algorithm is going to have a "tuning" knob that increases recall in exchange for increased cost. This was the idea behind the now-defunct "fanout" parameter we had in the earlier version of the vector search API. So -- it makes sense to me that we are now bringing back some measure of control over this tuning, albeit in a different form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz as always brings up interesting points! - I had no idea we were concerned about the number of subclasses of BitSet, nor was I aware of ExitableDirectoryReader! But I wonder if that should determine the approach here -- should we rely on Bits-based termination, or should we instrument VectorValues?

Copy link
Member Author

@jtibshirani jtibshirani Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all great points. The reasons to prefer option 2 make sense to me (although I'm also not clear on the best strategy for supporting ExitableDirectoryReader). I had a similar intuition to @msokolov that visitedLimit feels like a cost-tradeoff parameter similar to efSearch/ fanout... but I don't yet see how to bridge the gap between these two concepts.

In any case, I feel pretty good about adding a parameter visitedLimit for now. The concept indeed seems general, and we have room to further generalize it later if needed (maybe an approximate costLimit?) or revise it.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's gratifying to see the theory worked out in practice. +1 to expose searchExact and allow the Query to call it if it is selective. I suppose one case this wouldn't cover cleanly would be when there are very large number of deleted docs, although that seems kind of pathlogical and perhaps not worth designing for?

int numVisited = 0;

int doc;
while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call advance(vectorValues.docID()) here to enable skipping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll rework this.

DocIdSetIterator acceptIterator = null;
int visitedLimit = Integer.MAX_VALUE;

if (acceptDocs instanceof BitSet acceptBitSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super-familiar with other algorithms, but it does make sense to me that any approximate algorithm is going to have a "tuning" knob that increases recall in exchange for increased cost. This was the idea behind the now-defunct "fanout" parameter we had in the earlier version of the vector search API. So -- it makes sense to me that we are now bringing back some measure of control over this tuning, albeit in a different form.

DocIdSetIterator acceptIterator = null;
int visitedLimit = Integer.MAX_VALUE;

if (acceptDocs instanceof BitSet acceptBitSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz as always brings up interesting points! - I had no idea we were concerned about the number of subclasses of BitSet, nor was I aware of ExitableDirectoryReader! But I wonder if that should determine the approach here -- should we rely on Bits-based termination, or should we instrument VectorValues?

@jtibshirani
Copy link
Member Author

Thanks for reviewing. I'll work on another iteration and ping you when it's out of "draft" status. One clarification first...

+1 to expose searchExact and allow the Query to call it if it is selective.

I was thinking we would implement exact search within KnnVectorQuery itself, instead of exposing it through LeafReader. I think we already have all the pieces we need through the VectorValues API. What do you think?

@msokolov
Copy link
Contributor

msokolov commented Feb 9, 2022

I was thinking we would implement exact search within KnnVectorQuery itself, instead of exposing it through LeafReader. I think we already have all the pieces we need through the VectorValues API. What do you think?

Ah I misunderstood. I suppose then in theory the logic can work with any Vector encoding, which would be good, rather than having to reproduce the same brute force approach in each of them

@jtibshirani jtibshirani force-pushed the hnsw-filter branch 2 times, most recently from 73c06fc to a456a76 Compare February 10, 2022 06:41
@jtibshirani jtibshirani marked this pull request as ready for review February 10, 2022 08:04
@@ -147,6 +165,11 @@ NeighborQueue searchLevel(
continue;
}

numVisited++;
if (numVisited > visitedLimit) {
throw new CollectionTerminatedException();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be an abuse of CollectionTerminatedException. Another idea would be to try to pass back the information that the search was terminated early in TopDocs.TotalHits (but this also doesn't seem ideal).

IndexSearcher indexSearcher = new IndexSearcher(reader);
bitSets = new BitSet[reader.leaves().size()];
indexSearcher.search(filter, new BitSetCollector(bitSets));
indexSearcher.search(filter, filterCollector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for another day, but I am realizing that we have no opportunity to make use of per-segment concurrency here, as we ordinarily do in IndexSearcher.search(). To do so, we'd need to consider some API change though. Perhaps instead of using rewrite for this, we could make use of Query's two-phase iteration mode of operation. Just a thought for later - I'll go open an issue elsewhere.

return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit);
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way with this one. I tend to lean towards using TopDocs.totalHits.value since we already use it for returning visited counts; we could return with a null or maybe empty scoreDocs in that case? Or perhaps there could be a use case for returning the "best effort" results obtained by visiting a limited subset of the graph, and we should in fact marshal up the results. Generally I don't favor using Exceptions for expected behavior, but also I think if we do choose this pattern we should create a new Exception type just for this case.

Copy link
Contributor

@mayya-sharipova mayya-sharipova Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and also prefer not to throw an Exception if possible; it is an expensive operation to throw an Exception in comparison with just returning a value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's nice to avoid using exceptions for normal control flow. I'm not too concerned from a performance perspective though, exceptions aren't thrown in a "hot loop" and I didn't see a perf hit in testing.

If we go the route of using TopDocs, I'd prefer to avoid 'null' since that's a bit overloaded (indicates the field is missing or does not have vectors). Brainstorming ideas:

  • Just return EMPTY_TOPDOCS.
  • Still return best score docs and the visited count. But use EQUAL_TO for TotalHits.Relation if the search completed normally, otherwise use GREATER_THAN_OR_EQUAL_TO.
  • Use a special subtype of TopDocs instead, which has an explicit "complete" flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked very much of "a special subtype of TopDocs instead, which has an explicit "complete" flag"

@jtibshirani
Copy link
Member Author

@msokolov @jpountz @mayya-sharipova this is ready for another look. Notable changes:

  • When computing the filter results, only include documents that actually contain a vector. This gives an accurate estimate of the filter selectivity. To support this I introduced KnnVectorFieldExistsQuery, which seemed useful in its own right.
  • I stopped using CollectionTerminationException to indicate that the search hit the visited limit. Instead, we pass the information in TopDocs through TotalHits. The value is always the number of visited docs, but the relation is GREATER_THAN_OR_EQUAL_TO iff the search stopped early. This is kind of arbitrary but felt natural -- I'm very open to suggestions here! It's a fairly low-level API and it's marked experimental, so there is also room to refine it later. This update does not change the output of KnnVectorQuery.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of minor comments. Thanks!

* <p>The returned {@link TopDocs} will contain a {@link ScoreDoc} for each nearest neighbor, in
* order of their similarity to the query vector (decreasing scores). The {@link TotalHits}
* contains the number of documents visited during the search. If the search stopped early because
* it hit {@code visitedLimit}, it is indicated through the relation {@code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be enough to know that TopDocs.totalHits.value==visitedLimit? Do we need to use the relation as a sentinel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible (but unlikely) that the search completed normally with exactly numVisited == visitedLimit. The visitedLimit is inclusive. To me it felt more solid and obvious to use the relation.

@jtibshirani
Copy link
Member Author

Thanks for the review! I'll wait for the others to take a look too. I'm working on adding kNN with filtering to luceneutil.

build.gradle Outdated Show resolved Hide resolved
Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks Julie, this is a great enhancement! I've left some minor comments, but overall this looks fantastic!

KnnVectorQuery query = new KnnVectorQuery("field", randomVector(dimension), 5, filter);
TopDocs results = searcher.search(query, numDocs);
assertEquals(TotalHits.Relation.EQUAL_TO, results.totalHits.relation);
assertEquals(results.totalHits.value, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that we used the exact search? Are we judging by the equality of results.totalHits.value and results.scoreDocs.length? I guess in most cases this is true.

Another idea is always use TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO for the approximate search results as returned in KnnVectorQuery.searchLeaf:

TopDocs results = approximateSearch(ctx, acceptDocs, visitedLimit);
      if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
        return <results with Relation.GREATER_THAN_OR_EQUAL_TO>;
      } else {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I actually got confused here and wrote test assertions that are misleading. Since KnnVectorQuery is rewritten to DocAndScoreQuery, none of the information about visited nodes is preserved. Therefore we can't tell if exact or approximate search was used. I will rework this test.

I will open a follow-up issue to discuss this. I don't feel like we have a perfect grasp on what total hits should mean in the context of kNN search, especially since it differs between LeafReader#searchNearestVectors and the output of KnnVectorQuery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I missed a part about rewriting to DocAndScoreQuery.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks Julie, new way to assert exact search in TestKnnVectorQuery LGTM.

@jtibshirani jtibshirani merged commit 8ca3725 into apache:main Feb 17, 2022
@jtibshirani jtibshirani deleted the hnsw-filter branch February 17, 2022 19:35
jtibshirani added a commit that referenced this pull request Feb 17, 2022
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein <jbernste@apache.org>
wjp719 added a commit to wjp719/lucene that referenced this pull request Feb 23, 2022
* main:
  migrate to temurin (apache#697)
  LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691)
  LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
  Remove deprecated constructors in Nori (apache#695)
  LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693)
  LUCENE-10408: Fix vector values iteration bug (apache#690)
  Temporarily mute TestKnnVectorQuery#testRandomWithFilter
  LUCENE-10382: Support filtering in KnnVectorQuery (apache#656)
  LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677)
  Add CHANGES entry for LUCENE-10398
  LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678)
  LUCENE-10408 Better encoding of doc Ids in vectors (apache#649)
  LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
wjp719 added a commit to wjp719/lucene that referenced this pull request Feb 23, 2022
* main:
  LUCENE-10416: move changes entry to v10.0.0
  migrate to temurin (apache#697)
  LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691)
  LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
  Remove deprecated constructors in Nori (apache#695)
  LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693)
  LUCENE-10408: Fix vector values iteration bug (apache#690)
  Temporarily mute TestKnnVectorQuery#testRandomWithFilter
  LUCENE-10382: Support filtering in KnnVectorQuery (apache#656)
  LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677)
  Add CHANGES entry for LUCENE-10398
  LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678)
  LUCENE-10408 Better encoding of doc Ids in vectors (apache#649)
  LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
dantuzi pushed a commit to SeaseLtd/lucene that referenced this pull request Mar 10, 2022
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein <jbernste@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants