-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10382: Support filtering in KnnVectorQuery #656
Conversation
I tried out the around stopping the HNSW search early if it visits too many docs. To test, I modified Baseline
PR
Since the filter is so selective, HNSW always visits more than 1% of the docs. The adaptive logic in the PR decides to stop the search and switch to an exact search, which bounds the visited docs at 2%. For Overall I like this approach because it doesn't require us to fiddle with thresholds or expose new parameters. It could also help make HNSW more robust in "pathological" cases where even when the filter is not that selective, all the nearest vectors to a query happen to be filtered away. |
DocIdSetIterator acceptIterator = null; | ||
int visitedLimit = Integer.MAX_VALUE; | ||
|
||
if (acceptDocs instanceof BitSet acceptBitSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary hack since I wasn't sure about the right design. I could see a couple possibilities:
- Add a new
BitSet filter
parameter tosearchNearestVectors
, keeping the fallback logic within the HNSW classes. - Add a new
int visitedLimit
parameter toLeafReader#searchNearestVectors
. Pull the "exact search" logic up intoKnnVectorQuery
.
Which option is better probably depends on how other algorithms would handle filtering (which I am not sure about), and also if we think visitedLimit
is useful in other contexts.
I also played around with having searchNearestVectors
take a Collector
and using CollectionTerminatedException
... but couldn't really see how this made sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a preference for option 2
- This feels like a high-level query planning decision, which belongs more to the query API than to the codec API.
- My gut feeling is that a limit on the number of considered candidates is something that would be generalizable to most NN algorithms.
- Queries might have better options than a BitSet at times, e.g. if the filter is a
IndexSortSortedNumericDocValuesRangeQuery
, then you could have both a Bits and DocIdSetIterator view of the matches that do not require materializing a BitSet. - Vectors are currently not handled by
ExitableDirectoryReader
. Option 1 would require adding a BitSet wrapper, while we'd like to keep the number of sub classes ofBitSet
to exactly 2, a case that the JVM handles better. With option 2 we could go with just aBits
wrapper that would check the timeout wheneverBit#get
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super-familiar with other algorithms, but it does make sense to me that any approximate algorithm is going to have a "tuning" knob that increases recall in exchange for increased cost. This was the idea behind the now-defunct "fanout" parameter we had in the earlier version of the vector search API. So -- it makes sense to me that we are now bringing back some measure of control over this tuning, albeit in a different form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz as always brings up interesting points! - I had no idea we were concerned about the number of subclasses of BitSet, nor was I aware of ExitableDirectoryReader! But I wonder if that should determine the approach here -- should we rely on Bits-based termination, or should we instrument VectorValues
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all great points. The reasons to prefer option 2 make sense to me (although I'm also not clear on the best strategy for supporting ExitableDirectoryReader). I had a similar intuition to @msokolov that visitedLimit
feels like a cost-tradeoff parameter similar to efSearch/ fanout... but I don't yet see how to bridge the gap between these two concepts.
In any case, I feel pretty good about adding a parameter visitedLimit
for now. The concept indeed seems general, and we have room to further generalize it later if needed (maybe an approximate costLimit
?) or revise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's gratifying to see the theory worked out in practice. +1 to expose searchExact and allow the Query to call it if it is selective. I suppose one case this wouldn't cover cleanly would be when there are very large number of deleted docs, although that seems kind of pathlogical and perhaps not worth designing for?
int numVisited = 0; | ||
|
||
int doc; | ||
while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call advance(vectorValues.docID())
here to enable skipping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll rework this.
DocIdSetIterator acceptIterator = null; | ||
int visitedLimit = Integer.MAX_VALUE; | ||
|
||
if (acceptDocs instanceof BitSet acceptBitSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super-familiar with other algorithms, but it does make sense to me that any approximate algorithm is going to have a "tuning" knob that increases recall in exchange for increased cost. This was the idea behind the now-defunct "fanout" parameter we had in the earlier version of the vector search API. So -- it makes sense to me that we are now bringing back some measure of control over this tuning, albeit in a different form.
DocIdSetIterator acceptIterator = null; | ||
int visitedLimit = Integer.MAX_VALUE; | ||
|
||
if (acceptDocs instanceof BitSet acceptBitSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz as always brings up interesting points! - I had no idea we were concerned about the number of subclasses of BitSet, nor was I aware of ExitableDirectoryReader! But I wonder if that should determine the approach here -- should we rely on Bits-based termination, or should we instrument VectorValues
?
Thanks for reviewing. I'll work on another iteration and ping you when it's out of "draft" status. One clarification first...
I was thinking we would implement exact search within |
Ah I misunderstood. I suppose then in theory the logic can work with any Vector encoding, which would be good, rather than having to reproduce the same brute force approach in each of them |
73c06fc
to
a456a76
Compare
@@ -147,6 +165,11 @@ NeighborQueue searchLevel( | |||
continue; | |||
} | |||
|
|||
numVisited++; | |||
if (numVisited > visitedLimit) { | |||
throw new CollectionTerminatedException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be an abuse of CollectionTerminatedException
. Another idea would be to try to pass back the information that the search was terminated early in TopDocs.TotalHits
(but this also doesn't seem ideal).
IndexSearcher indexSearcher = new IndexSearcher(reader); | ||
bitSets = new BitSet[reader.leaves().size()]; | ||
indexSearcher.search(filter, new BitSetCollector(bitSets)); | ||
indexSearcher.search(filter, filterCollector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for another day, but I am realizing that we have no opportunity to make use of per-segment concurrency here, as we ordinarily do in IndexSearcher.search()
. To do so, we'd need to consider some API change though. Perhaps instead of using rewrite
for this, we could make use of Query
's two-phase iteration mode of operation. Just a thought for later - I'll go open an issue elsewhere.
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
Outdated
Show resolved
Hide resolved
return ctx.reader().searchNearestVectors(field, target, kPerLeaf, acceptDocs, visitedLimit); | ||
} catch ( | ||
@SuppressWarnings("unused") | ||
CollectionTerminatedException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way with this one. I tend to lean towards using TopDocs.totalHits.value
since we already use it for returning visited counts; we could return with a null or maybe empty scoreDocs
in that case? Or perhaps there could be a use case for returning the "best effort" results obtained by visiting a limited subset of the graph, and we should in fact marshal up the results. Generally I don't favor using Exceptions for expected behavior, but also I think if we do choose this pattern we should create a new Exception type just for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and also prefer not to throw an Exception if possible; it is an expensive operation to throw an Exception in comparison with just returning a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's nice to avoid using exceptions for normal control flow. I'm not too concerned from a performance perspective though, exceptions aren't thrown in a "hot loop" and I didn't see a perf hit in testing.
If we go the route of using TopDocs
, I'd prefer to avoid 'null' since that's a bit overloaded (indicates the field is missing or does not have vectors). Brainstorming ideas:
- Just return
EMPTY_TOPDOCS
. - Still return best score docs and the visited count. But use
EQUAL_TO
forTotalHits.Relation
if the search completed normally, otherwise useGREATER_THAN_OR_EQUAL_TO
. - Use a special subtype of
TopDocs
instead, which has an explicit "complete" flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked very much of "a special subtype of TopDocs instead, which has an explicit "complete" flag"
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
Outdated
Show resolved
Hide resolved
@msokolov @jpountz @mayya-sharipova this is ready for another look. Notable changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of minor comments. Thanks!
* <p>The returned {@link TopDocs} will contain a {@link ScoreDoc} for each nearest neighbor, in | ||
* order of their similarity to the query vector (decreasing scores). The {@link TotalHits} | ||
* contains the number of documents visited during the search. If the search stopped early because | ||
* it hit {@code visitedLimit}, it is indicated through the relation {@code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be enough to know that TopDocs.totalHits.value==visitedLimit
? Do we need to use the relation as a sentinel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible (but unlikely) that the search completed normally with exactly numVisited == visitedLimit
. The visitedLimit
is inclusive. To me it felt more solid and obvious to use the relation.
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
Outdated
Show resolved
Hide resolved
Thanks for the review! I'll wait for the others to take a look too. I'm working on adding kNN with filtering to luceneutil. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks Julie, this is a great enhancement! I've left some minor comments, but overall this looks fantastic!
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
Outdated
Show resolved
Hide resolved
KnnVectorQuery query = new KnnVectorQuery("field", randomVector(dimension), 5, filter); | ||
TopDocs results = searcher.search(query, numDocs); | ||
assertEquals(TotalHits.Relation.EQUAL_TO, results.totalHits.relation); | ||
assertEquals(results.totalHits.value, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that we used the exact search? Are we judging by the equality of results.totalHits.value
and results.scoreDocs.length
? I guess in most cases this is true.
Another idea is always use TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO
for the approximate search results as returned in KnnVectorQuery.searchLeaf
:
TopDocs results = approximateSearch(ctx, acceptDocs, visitedLimit);
if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
return <results with Relation.GREATER_THAN_OR_EQUAL_TO>;
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I actually got confused here and wrote test assertions that are misleading. Since KnnVectorQuery
is rewritten to DocAndScoreQuery
, none of the information about visited nodes is preserved. Therefore we can't tell if exact or approximate search was used. I will rework this test.
I will open a follow-up issue to discuss this. I don't feel like we have a perfect grasp on what total hits should mean in the context of kNN search, especially since it differs between LeafReader#searchNearestVectors
and the output of KnnVectorQuery
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, I missed a part about rewriting to DocAndScoreQuery
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks Julie, new way to assert exact search in TestKnnVectorQuery
LGTM.
This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein <jbernste@apache.org>
* main: migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
* main: LUCENE-10416: move changes entry to v10.0.0 migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein <jbernste@apache.org>
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.
Co-authored-by: Joel Bernstein jbernste@apache.org