LUCENE-10408 Better encoding of doc Ids in vectors #649

mayya-sharipova · 2022-02-04T21:56:02Z

Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.

Better encoding of doc Ids in Lucene91HnswVectorsFormat Currently we write doc Ids of all documents that have vectors not very efficiently. This improve their encoding by: - for a case when all documents have vectors, we don't write document IDs. - for a case when only certain documents have vectors, we do delta encoding of doc Ids.

jtibshirani

Great that you are looking into these TODOs while it's easy to make changes to the format (before the 9.1 release). I am curious if you noticed any search performance improvement, or if the motivation is more on space/ memory savings.

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java

jpountz · 2022-02-07T17:59:11Z

Optimizing for the case when all docs have a value makes sense to me.

for a case when only certain documents have vectors, we do delta encoding of doc Ids.

In the past we rejected changes that would consist of having the data written in a compressed fashion on disk but still uncompressed in memory.

I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.

mayya-sharipova · 2022-02-08T23:47:44Z

@jtibshirani @jpountz Thanks for your feedback. I've tried to address in 6bf1aea. I've decided to focus this PR only on optimizing the dense case and keeping the sparse case as was before – uncompressed way.

I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.

@jpountz Thank you for the suggestion, Adrien. I've put this as TODO in the code to explore. I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array on disk? Do we have a precedent for such use case for reading from disk?

jpountz · 2022-02-09T12:52:50Z

I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array on disk?

I think so. It's a bit like the terms index, the access pattern is random, but this data should be small enough that it would generally be in the FS cache.

jpountz · 2022-02-09T12:54:00Z

Of course if we can have a less random access pattern (e.g. maybe by using exponential search, having skip lists like postings or jump tables like doc values) it would be even better.

jtibshirani

The new changes make sense to me (I'm not an expert in these encoding decisions though).

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java

jtibshirani · 2022-02-10T23:42:50Z

Additional motivation for this PR: it could help with performance of exact search (in #656). When all docs have vectors, we can avoid a binary search in VectorValues#advance.

mayya-sharipova · 2022-02-13T18:11:48Z

@jpountz @jtibshirani Thank you for your suggestions. Are you ok if we keep this PR as it is now to optimize the dense case only, and explore Adrien's suggestions for a sparse case in subsequent work?

jtibshirani

@mayya-sharipova that plan makes sense to me. This looks good to me, I just left some tiny last comments.

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java

jtibshirani · 2022-02-16T22:18:07Z

lucene/CHANGES.txt

@@ -204,6 +204,8 @@ Optimizations
 * LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of
  matching clauses is a constant. (LuYunCheng via Adrien Grand)

+* LUCENE-10408 Better encoding of doc Ids in vectors (Mayya Sharipova, Julie Tibshirani, Adrien Grand)


Thanks for including me! I'm also fine if you omit me when I'm a reviewer.

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

Better encoding of doc Ids in Lucene91HnswVectorsFormat for a dense case where all docs have vectors. Currently we write doc Ids of all documents that have vectors not very efficiently. This improve their encoding by for a case when all documents have vectors, we don't write document IDs, but just write a single short value – a dense marker.

* main: migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)

* main: LUCENE-10416: move changes entry to v10.0.0 migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)

Better encoding of doc Ids in Lucene91HnswVectorsFormat for a dense case where all docs have vectors. Currently we write doc Ids of all documents that have vectors not very efficiently. This improve their encoding by for a case when all documents have vectors, we don't write document IDs, but just write a single short value – a dense marker.

LuXugang · 2022-04-05T09:31:24Z

I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.

@jpountz, could we use IndexedDISI to store docIds and DirectMonotonicWriter to store ordToDoc mapping like doc values did.

jpountz · 2022-04-05T12:26:05Z

Yes, something like that sounds like a good fit to store the ordToDoc mapping indeed 👍

jtibshirani reviewed Feb 5, 2022

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java Outdated Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java Outdated Show resolved Hide resolved

mayya-sharipova added 2 commits February 8, 2022 17:53

Merge remote-tracking branch 'upstream/main' into hnsw-graph-docs2

487eec0

Address Adrien's and Julie's comments

6bf1aea

jtibshirani reviewed Feb 10, 2022

View reviewed changes

Address Julie's feedback 2

ca9bfb2

jtibshirani approved these changes Feb 14, 2022

View reviewed changes

Address Julie's comments

47042f2

jtibshirani approved these changes Feb 16, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into hnsw-graph-docs2

0e0341a

mayya-sharipova merged commit f8c5408 into apache:main Feb 17, 2022

mayya-sharipova deleted the hnsw-graph-docs2 branch February 17, 2022 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10408 Better encoding of doc Ids in vectors #649

LUCENE-10408 Better encoding of doc Ids in vectors #649

mayya-sharipova commented Feb 4, 2022 •

edited

Loading

jtibshirani left a comment

jpountz commented Feb 7, 2022

mayya-sharipova commented Feb 8, 2022 •

edited

Loading

jpountz commented Feb 9, 2022

jpountz commented Feb 9, 2022

jtibshirani left a comment

jtibshirani commented Feb 10, 2022

mayya-sharipova commented Feb 13, 2022

jtibshirani left a comment

jtibshirani Feb 16, 2022

LuXugang commented Apr 5, 2022

jpountz commented Apr 5, 2022

LUCENE-10408 Better encoding of doc Ids in vectors #649

LUCENE-10408 Better encoding of doc Ids in vectors #649

Conversation

mayya-sharipova commented Feb 4, 2022 • edited Loading

jtibshirani left a comment

Choose a reason for hiding this comment

jpountz commented Feb 7, 2022

mayya-sharipova commented Feb 8, 2022 • edited Loading

jpountz commented Feb 9, 2022

jpountz commented Feb 9, 2022

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani commented Feb 10, 2022

mayya-sharipova commented Feb 13, 2022

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani Feb 16, 2022

Choose a reason for hiding this comment

LuXugang commented Apr 5, 2022

jpountz commented Apr 5, 2022

mayya-sharipova commented Feb 4, 2022 •

edited

Loading

mayya-sharipova commented Feb 8, 2022 •

edited

Loading