Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Should there be a threshold-based vector search API? #12579

Closed
kaivalnp opened this issue Sep 20, 2023 · 6 comments
Closed

[DISCUSS] Should there be a threshold-based vector search API? #12579

kaivalnp opened this issue Sep 20, 2023 · 6 comments

Comments

@kaivalnp
Copy link
Contributor

kaivalnp commented Sep 20, 2023

Context

Almost all vector search algorithms focus on getting the topK highest-scoring results for a given query vector. This however, may not be the best objective to pursue because:

  • Some queries are more suited for vector search: Indexed vectors may not be uniformly distributed, with some parts of the vector space denser than others. If the query is present in a sparse part of the space, the topK results will be low scoring (and not relevant to the query). We may want some kind of score threshold, below which a vector is not considered as a "match"?
  • Conversely, if the query is present in a dense part of the vector space, we may not want to limit vector search to topK when we have some more (slightly lower scoring) results, but still above the threshold. Considering a query and a vector as a "match" if their similarity score is above the threshold, we may want ALL possible matches?

I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be)

Proposal

IF this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is:

  • Maintain a queue of topK results (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scores
  • Starting from the best (highest scoring) candidate, we insert it into the result queue if there are (1) fewer than topK results (2) it is better scoring than any existing result (also removing the least scoring result in this case)
  • If a node is added as a result, consider all its neighbors as further candidates
  • The search stops when the best scoring candidate is worse than the least scoring result

For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further

Can we instead change the graph traversal (and result collection) criteria such that:

  • Consider a node as a candidate if it exceeds a lower, graph-traversal specific score threshold
  • Consider a candidate as a result if it exceeds the actual, query-result similarity score threshold

The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual threshold (and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries)

The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the threshold

This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of heapify calls that maintain the min-max heap property) and just sort them once at the end

Looking for inputs on both the use-case and tweaked algorithm..

@msokolov
Copy link
Contributor

Interesting - so this is kind of like a noisy radius search in high dimensions? It makes sense to me intuitively since we don't generally expect searches to have the same number of results. If I search for asd89HH8!@ I may get only very low-quality results, perhaps none that would exceed the threshold / fall in the radius boundary whereas if I search for something that shares terms in common with many documents, I would expect vector-based search to find them all. It's kind of weird from an IR perspective that we just pick some query-independent K. I wonder if there is some precedent for this in semantic search literature?

I found one system that supports a threshold: https://typesense.org/docs/0.25.0/api/vector-search.html#distance-threshold but I don't think we do support that yet, do we, in KnnVector*Query? Perhaps we ought to lead with tha in order to introduce the utility of radius-searching, since you can simulate this with a threshold + large K, right?

@kaivalnp
Copy link
Contributor Author

Thanks @msokolov, this nicely summarizes what I'm trying to say!

https://typesense.org/docs/0.25.0/api/vector-search.html#distance-threshold

I took a look here: and seems like they first perform kNN searches, and then remove results below the threshold (like a post-filtering step that you mentioned)

since you can simulate this with a threshold + large K

Yes, this would be the easiest way to find all (or a large number of) vectors within a radius using the existing Knn[Byte|Float]VectorQuery classes

Interestingly, the threshold-based search proposed above may not need any invasive changes thanks to the amazing KnnCollector API added in #12434, so I wonder if we can start there directly?

Opening a PR to discuss next steps..

@benwtrent
Copy link
Member

@kaivalnp yes, KnnCollector should be used for something like this :). Glad its useful!

One of the tricky things I can see is that its possible that the bottom layer entry point is so bad, that it itself isn't within the traversial_similarity. Consequently, recall will suffer abysmally.

But, a quick overview of your PR shows that this type of logic fits pretty easily. I look forward to the performance and recall tests!

@benwtrent
Copy link
Member

@kaivalnp one other thing to think about is https://weaviate.io/blog/weaviate-1-20-release#autocut

I wonder if we could do something similar by dynamically adjusting the "traversial_similarity" to indicate that exploring more would be useless.

@kaivalnp
Copy link
Contributor Author

one other thing to think about is https://weaviate.io/blog/weaviate-1-20-release#autocut

Interesting! They seem to make "cuts" when there is an abnormal jump in scores. An "abnormal jump" is defined when the score of the next candidate does not match the gradual decrease in scores (for example: if we have collected 10 vectors with a max and min score of 1.0 and 0.8, the "step" is basically 0.02 per vector, so we make a "cut" if the next best candidate is below a score of 0.78)

Might be a good next step to simplify the API!

benwtrent pushed a commit that referenced this issue Dec 11, 2023
### Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

### Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius

### Advantages

1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`

On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

### Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`

For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
benwtrent pushed a commit that referenced this issue Dec 11, 2023
### Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

### Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius

### Advantages

1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`

On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

### Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`

For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Mar 4, 2024

#12679 added a new experimental [Byte|Float]VectorSimilarityQuery to perform similarity-based vector searches, and was released with Lucene 9.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants