Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust ToParentBlockJoin[Byte|Float]KnnVectorQuery to return highest score child doc ID by parent id #12510

Merged
merged 3 commits into from
Aug 16, 2023

Conversation

benwtrent
Copy link
Member

While integrating, I discovered a frustrating bug :(

The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new ToParentBlockJoin[Byte|Float]KnnVectorQuery to return the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

I realize that this might make the name weird. I am happy to consider a new name. All the "join" names are confusing to me already.

I am happy to change the name.

Since this is iterating on an unreleased query and related to: #12434 I am not adding a change log.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good, I agree that the naming can be confusing.
Here's some possible alternatives:

  • DiversifyingKnn(Collector|VectorQuery)
  • DiversifyingChildrenKnn...
  • CollapsingKnn...
  • CollapsingChildren...
    Naming is hard.

/** kNN byte vector query that joins matching children vector documents with their parent doc id. */
/**
* kNN byte vector query that joins matching children vector documents with their parent doc id. The
* top documents returned are the child document ids and the calculated scores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an example on how to mix with root document queries? Something like:

ToParentBlockJoinByteKnnVectorQuery  childQuery = ...
Query query = new ToParentBlockJoinQuery(childQuery, parentsFilter, ..)
...

?

@@ -38,6 +38,7 @@

/**
* kNN float vector query that joins matching children vector documents with their parent doc id.
* The top documents returned are the child document ids and the calculated scores.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

@benwtrent benwtrent merged commit 4174b52 into apache:main Aug 16, 2023
4 checks passed
@benwtrent benwtrent deleted the feature/fix-parent-block-join-query branch August 16, 2023 17:44
benwtrent added a commit that referenced this pull request Aug 16, 2023
…rn highest score child doc ID by parent id (#12510)

The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

Related to: #12434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants