-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10408 Better encoding of doc Ids in vectors #649
LUCENE-10408 Better encoding of doc Ids in vectors #649
Conversation
Better encoding of doc Ids in Lucene91HnswVectorsFormat Currently we write doc Ids of all documents that have vectors not very efficiently. This improve their encoding by: - for a case when all documents have vectors, we don't write document IDs. - for a case when only certain documents have vectors, we do delta encoding of doc Ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you are looking into these TODOs while it's easy to make changes to the format (before the 9.1 release). I am curious if you noticed any search performance improvement, or if the motivation is more on space/ memory savings.
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
Optimizing for the case when all docs have a value makes sense to me.
In the past we rejected changes that would consist of having the data written in a compressed fashion on disk but still uncompressed in memory. I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches. |
@jtibshirani @jpountz Thanks for your feedback. I've tried to address in 6bf1aea. I've decided to focus this PR only on optimizing the dense case and keeping the sparse case as was before – uncompressed way.
@jpountz Thank you for the suggestion, Adrien. I've put this as TODO in the code to explore. I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array on disk? Do we have a precedent for such use case for reading from disk? |
I think so. It's a bit like the terms index, the access pattern is random, but this data should be small enough that it would generally be in the FS cache. |
Of course if we can have a less random access pattern (e.g. maybe by using exponential search, having skip lists like postings or jump tables like doc values) it would be even better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes make sense to me (I'm not an expert in these encoding decisions though).
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
Additional motivation for this PR: it could help with performance of exact search (in #656). When all docs have vectors, we can avoid a binary search in |
@jpountz @jtibshirani Thank you for your suggestions. Are you ok if we keep this PR as it is now to optimize the dense case only, and explore Adrien's suggestions for a sparse case in subsequent work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova that plan makes sense to me. This looks good to me, I just left some tiny last comments.
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
Show resolved
Hide resolved
lucene/CHANGES.txt
Outdated
@@ -204,6 +204,8 @@ Optimizations | |||
* LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of | |||
matching clauses is a constant. (LuYunCheng via Adrien Grand) | |||
|
|||
* LUCENE-10408 Better encoding of doc Ids in vectors (Mayya Sharipova, Julie Tibshirani, Adrien Grand) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for including me! I'm also fine if you omit me when I'm a reviewer.
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
Show resolved
Hide resolved
Better encoding of doc Ids in Lucene91HnswVectorsFormat for a dense case where all docs have vectors. Currently we write doc Ids of all documents that have vectors not very efficiently. This improve their encoding by for a case when all documents have vectors, we don't write document IDs, but just write a single short value – a dense marker.
* main: migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
* main: LUCENE-10416: move changes entry to v10.0.0 migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
Better encoding of doc Ids in Lucene91HnswVectorsFormat for a dense case where all docs have vectors. Currently we write doc Ids of all documents that have vectors not very efficiently. This improve their encoding by for a case when all documents have vectors, we don't write document IDs, but just write a single short value – a dense marker.
@jpountz, could we use IndexedDISI to store docIds and DirectMonotonicWriter to store |
Yes, something like that sounds like a good fit to store the ordToDoc mapping indeed 👍 |
Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.
Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.