Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10408 Better encoding of doc Ids in vectors #649

Merged
merged 6 commits into from
Feb 17, 2022

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Feb 4, 2022

Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.

Better encoding of doc Ids in Lucene91HnswVectorsFormat

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by:
- for a case when all documents have vectors, we don't write
document IDs.
- for a case when only certain documents have vectors, we
do delta encoding of doc Ids.
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you are looking into these TODOs while it's easy to make changes to the format (before the 9.1 release). I am curious if you noticed any search performance improvement, or if the motivation is more on space/ memory savings.

@jpountz
Copy link
Contributor

jpountz commented Feb 7, 2022

Optimizing for the case when all docs have a value makes sense to me.

for a case when only certain documents have vectors, we do delta encoding of doc Ids.

In the past we rejected changes that would consist of having the data written in a compressed fashion on disk but still uncompressed in memory.

I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Feb 8, 2022

@jtibshirani @jpountz Thanks for your feedback. I've tried to address in 6bf1aea. I've decided to focus this PR only on optimizing the dense case and keeping the sparse case as was before – uncompressed way.

I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.

@jpountz Thank you for the suggestion, Adrien. I've put this as TODO in the code to explore. I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array on disk? Do we have a precedent for such use case for reading from disk?

@jpountz
Copy link
Contributor

jpountz commented Feb 9, 2022

I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array on disk?

I think so. It's a bit like the terms index, the access pattern is random, but this data should be small enough that it would generally be in the FS cache.

@jpountz
Copy link
Contributor

jpountz commented Feb 9, 2022

Of course if we can have a less random access pattern (e.g. maybe by using exponential search, having skip lists like postings or jump tables like doc values) it would be even better.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes make sense to me (I'm not an expert in these encoding decisions though).

@jtibshirani
Copy link
Member

Additional motivation for this PR: it could help with performance of exact search (in #656). When all docs have vectors, we can avoid a binary search in VectorValues#advance.

@mayya-sharipova
Copy link
Contributor Author

@jpountz @jtibshirani Thank you for your suggestions. Are you ok if we keep this PR as it is now to optimize the dense case only, and explore Adrien's suggestions for a sparse case in subsequent work?

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova that plan makes sense to me. This looks good to me, I just left some tiny last comments.

@@ -204,6 +204,8 @@ Optimizations
* LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of
matching clauses is a constant. (LuYunCheng via Adrien Grand)

* LUCENE-10408 Better encoding of doc Ids in vectors (Mayya Sharipova, Julie Tibshirani, Adrien Grand)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for including me! I'm also fine if you omit me when I'm a reviewer.

@mayya-sharipova mayya-sharipova merged commit f8c5408 into apache:main Feb 17, 2022
@mayya-sharipova mayya-sharipova deleted the hnsw-graph-docs2 branch February 17, 2022 10:34
mayya-sharipova added a commit that referenced this pull request Feb 17, 2022
Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.
wjp719 added a commit to wjp719/lucene that referenced this pull request Feb 23, 2022
* main:
  migrate to temurin (apache#697)
  LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691)
  LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
  Remove deprecated constructors in Nori (apache#695)
  LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693)
  LUCENE-10408: Fix vector values iteration bug (apache#690)
  Temporarily mute TestKnnVectorQuery#testRandomWithFilter
  LUCENE-10382: Support filtering in KnnVectorQuery (apache#656)
  LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677)
  Add CHANGES entry for LUCENE-10398
  LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678)
  LUCENE-10408 Better encoding of doc Ids in vectors (apache#649)
  LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
wjp719 added a commit to wjp719/lucene that referenced this pull request Feb 23, 2022
* main:
  LUCENE-10416: move changes entry to v10.0.0
  migrate to temurin (apache#697)
  LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691)
  LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
  Remove deprecated constructors in Nori (apache#695)
  LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693)
  LUCENE-10408: Fix vector values iteration bug (apache#690)
  Temporarily mute TestKnnVectorQuery#testRandomWithFilter
  LUCENE-10382: Support filtering in KnnVectorQuery (apache#656)
  LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677)
  Add CHANGES entry for LUCENE-10398
  LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678)
  LUCENE-10408 Better encoding of doc Ids in vectors (apache#649)
  LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
dantuzi pushed a commit to SeaseLtd/lucene that referenced this pull request Mar 10, 2022
Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.
@LuXugang
Copy link
Member

LuXugang commented Apr 5, 2022

I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.

@jpountz, could we use IndexedDISI to store docIds and DirectMonotonicWriter to store ordToDoc mapping like doc values did.

@jpountz
Copy link
Contributor

jpountz commented Apr 5, 2022

Yes, something like that sounds like a good fit to store the ordToDoc mapping indeed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants