[DISCUSS] Should there be a threshold-based vector search API? #12579

kaivalnp · 2023-09-20T15:23:51Z

Context

Almost all vector search algorithms focus on getting the topK highest-scoring results for a given query vector. This however, may not be the best objective to pursue because:

Some queries are more suited for vector search: Indexed vectors may not be uniformly distributed, with some parts of the vector space denser than others. If the query is present in a sparse part of the space, the topK results will be low scoring (and not relevant to the query). We may want some kind of score threshold, below which a vector is not considered as a "match"?
Conversely, if the query is present in a dense part of the vector space, we may not want to limit vector search to topK when we have some more (slightly lower scoring) results, but still above the threshold. Considering a query and a vector as a "match" if their similarity score is above the threshold, we may want ALL possible matches?

I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be)

Proposal

IF this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is:

Maintain a queue of topK results (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scores
Starting from the best (highest scoring) candidate, we insert it into the result queue if there are (1) fewer than topK results (2) it is better scoring than any existing result (also removing the least scoring result in this case)
If a node is added as a result, consider all its neighbors as further candidates
The search stops when the best scoring candidate is worse than the least scoring result

For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further

Can we instead change the graph traversal (and result collection) criteria such that:

Consider a node as a candidate if it exceeds a lower, graph-traversal specific score threshold
Consider a candidate as a result if it exceeds the actual, query-result similarity score threshold

The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual threshold (and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries)

The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the threshold

This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of heapify calls that maintain the min-max heap property) and just sort them once at the end

Looking for inputs on both the use-case and tweaked algorithm..

The text was updated successfully, but these errors were encountered:

msokolov · 2023-10-12T16:05:34Z

Interesting - so this is kind of like a noisy radius search in high dimensions? It makes sense to me intuitively since we don't generally expect searches to have the same number of results. If I search for asd89HH8!@ I may get only very low-quality results, perhaps none that would exceed the threshold / fall in the radius boundary whereas if I search for something that shares terms in common with many documents, I would expect vector-based search to find them all. It's kind of weird from an IR perspective that we just pick some query-independent K. I wonder if there is some precedent for this in semantic search literature?

I found one system that supports a threshold: https://typesense.org/docs/0.25.0/api/vector-search.html#distance-threshold but I don't think we do support that yet, do we, in KnnVector*Query? Perhaps we ought to lead with tha in order to introduce the utility of radius-searching, since you can simulate this with a threshold + large K, right?

kaivalnp · 2023-10-14T11:19:06Z

Thanks @msokolov, this nicely summarizes what I'm trying to say!

https://typesense.org/docs/0.25.0/api/vector-search.html#distance-threshold

I took a look here: and seems like they first perform kNN searches, and then remove results below the threshold (like a post-filtering step that you mentioned)

since you can simulate this with a threshold + large K

Yes, this would be the easiest way to find all (or a large number of) vectors within a radius using the existing Knn[Byte|Float]VectorQuery classes

Interestingly, the threshold-based search proposed above may not need any invasive changes thanks to the amazing KnnCollector API added in #12434, so I wonder if we can start there directly?

Opening a PR to discuss next steps..

benwtrent · 2023-10-14T15:28:35Z

@kaivalnp yes, KnnCollector should be used for something like this :). Glad its useful!

One of the tricky things I can see is that its possible that the bottom layer entry point is so bad, that it itself isn't within the traversial_similarity. Consequently, recall will suffer abysmally.

But, a quick overview of your PR shows that this type of logic fits pretty easily. I look forward to the performance and recall tests!

benwtrent · 2023-10-14T15:31:21Z

@kaivalnp one other thing to think about is https://weaviate.io/blog/weaviate-1-20-release#autocut

I wonder if we could do something similar by dynamically adjusting the "traversial_similarity" to indicate that exploring more would be useless.

kaivalnp · 2023-10-17T20:23:07Z

one other thing to think about is https://weaviate.io/blog/weaviate-1-20-release#autocut

Interesting! They seem to make "cuts" when there is an abnormal jump in scores. An "abnormal jump" is defined when the score of the next candidate does not match the gradual decrease in scores (for example: if we have collected 10 vectors with a max and min score of 1.0 and 0.8, the "step" is basically 0.02 per vector, so we make a "cut" if the next best candidate is below a score of 0.78)

Might be a good next step to simplify the API!

### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)

kaivalnp · 2024-03-04T09:45:33Z

#12679 added a new experimental [Byte|Float]VectorSimilarityQuery to perform similarity-based vector searches, and was released with Lucene 9.10

kaivalnp mentioned this issue Oct 14, 2023

Add support for similarity-based vector searches #12679

Merged

kaivalnp closed this as completed Mar 4, 2024

alessandrobenedetti added the vector-based-search label May 20, 2024

alessandrobenedetti added this to the 9.10 milestone May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Should there be a threshold-based vector search API? #12579

[DISCUSS] Should there be a threshold-based vector search API? #12579

kaivalnp commented Sep 20, 2023 •

edited

Loading

msokolov commented Oct 12, 2023

kaivalnp commented Oct 14, 2023

benwtrent commented Oct 14, 2023

benwtrent commented Oct 14, 2023

kaivalnp commented Oct 17, 2023

kaivalnp commented Mar 4, 2024

[DISCUSS] Should there be a threshold-based vector search API? #12579

[DISCUSS] Should there be a threshold-based vector search API? #12579

Comments

kaivalnp commented Sep 20, 2023 • edited Loading

Context

Proposal

msokolov commented Oct 12, 2023

kaivalnp commented Oct 14, 2023

benwtrent commented Oct 14, 2023

benwtrent commented Oct 14, 2023

kaivalnp commented Oct 17, 2023

kaivalnp commented Mar 4, 2024

kaivalnp commented Sep 20, 2023 •

edited

Loading