Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9280: Collectors to skip noncompetitive documents #1351

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
0d4b2f5
Collectors to skip noncompetitive documents
mayya-sharipova Mar 13, 2020
6384b15
Address feedback1
mayya-sharipova Mar 18, 2020
d732d7e
Address Feedback2
mayya-sharipova Mar 19, 2020
209bc21
Adjust tests
mayya-sharipova Mar 20, 2020
95e1bc1
Address feedback
mayya-sharipova Mar 23, 2020
0e3c7da
Add docs and correct bugs
mayya-sharipova Mar 25, 2020
d7e9507
Make constructor of LongDocValuesPointSortField public
mayya-sharipova Mar 25, 2020
1154d4a
Adjust docs and tests
mayya-sharipova Mar 26, 2020
39379a7
Optimized sort of field without points
mayya-sharipova Mar 30, 2020
6c628f7
Address Adrien's feedback
mayya-sharipova Mar 31, 2020
eeb23c1
Add IterableFieldComparator
mayya-sharipova Apr 2, 2020
89d241e
Address Alan's comments
mayya-sharipova Apr 6, 2020
4448499
Address Alan's feedback2
mayya-sharipova Apr 9, 2020
719882e
Address Alan's feedback 3
mayya-sharipova Apr 11, 2020
c84fe5e
Address Alan's feedback 4
mayya-sharipova Apr 15, 2020
d7ef9b6
Collector returns comparator's iterator
mayya-sharipova Apr 15, 2020
24c94ff
Address Adrien's feedback
mayya-sharipova Apr 21, 2020
2fd9075
Set scorer for inner comparator
mayya-sharipova Apr 23, 2020
b8e138c
Correct Indent
mayya-sharipova Apr 24, 2020
7120424
Separate classes for comparator and leaf comparator
mayya-sharipova Apr 29, 2020
6c62fd0
Merge remote-tracking branch 'upstream/master' into comparator-set-mi…
mayya-sharipova Apr 29, 2020
1ab2a6e
Address Adrien's comments
mayya-sharipova May 26, 2020
f910030
Merge remote-tracking branch 'upstream/master' into comparator-set-mi…
mayya-sharipova May 26, 2020
99dd0c1
Address Adrien's feedback
mayya-sharipova Jun 16, 2020
8ebcff8
Merge remote-tracking branch 'upstream/master' into comparator-set-mi…
mayya-sharipova Jun 16, 2020
55f2940
Add contributors' names
mayya-sharipova Jun 23, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float bo
return new ConstantScoreWeight(this, boost) {
@Override
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
if (scoreMode == ScoreMode.TOP_SCORES) {
if (scoreMode == ScoreMode.TOP_SCORES || scoreMode == ScoreMode.TOP_DOCS) {
return super.bulkScorer(context);
}
final BulkScorer innerScorer = innerWeight.bulkScorer(context);
Expand Down
12 changes: 12 additions & 0 deletions lucene/core/src/java/org/apache/lucene/search/FieldComparator.java
Original file line number Diff line number Diff line change
Expand Up @@ -928,4 +928,16 @@ public int compareTop(int doc) throws IOException {
@Override
public void setScorer(Scorable scorer) {}
}

/**
* A field comparator that can provide an iterator over competitive documents.
*/
public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator {
abstract DocIdSetIterator iterator();

// This method is called from TopFieldCollector when already enough top hits have been collected.
// This method should be called every time the bottom entry is updated, and it informs the comparator
// to update its iterator to include possibly only docs that are "stronger" than the current bottom entry
abstract void updateIterator() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this ? We could update the iterator every time a bottom value is set ?

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is more straightforward to just update an iterator in setBottom function of a comparator.

But I was thinking it is better to have a special function for two reasons:

  1. After updating an iterator, in TopFieldCollector we need to change
    totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;

  2. we also need to check hitsThresholdChecker.isThresholdReached(), and passing not strictly related object hitsThresholdChecker to a comparator's constructor doesn't look nice to me.

Please let me know if you think otherwise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1. we could set the totalHitsRelation when we reach the total hits threshold in the TOP_DOCS mode ?
For 2. I wonder if we could pass the hitsThresholdChecker to the LeafFieldComparator like we do for the scorer ?
This way we can update the iterator internally when a new bottom is set or when compareBottom is called ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Thanks for the suggestion, naming is tough, addressed in 95e1bc1.

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,8 @@ public String toString() {
return name + "(" + in + ")";
}

@Override
public DocIdSetIterator iterator() {
return in.iterator();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had endless discussions about whether or not to delegate in FilterXXX classes and I think that the consensus is that we should only delegate abstract methods. Since this one has a default implementation, let's not delegate and look for extensions of FilterCollector that should delegate it? (e.g. asserting collectors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 7120424

}
Original file line number Diff line number Diff line change
Expand Up @@ -93,4 +93,11 @@ public interface LeafCollector {
*/
void collect(int doc) throws IOException;

/*
jpountz marked this conversation as resolved.
Show resolved Hide resolved
* optionally returns an iterator over competitive documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document that the default is to return null which Lucene interprets as the collector doesn't filter any documents. It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz

It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents

What is the way to make this explicit?

*/
default DocIdSetIterator iterator() {
jpountz marked this conversation as resolved.
Show resolved Hide resolved
return null;
}
jpountz marked this conversation as resolved.
Show resolved Hide resolved

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.lucene.search;

import org.apache.lucene.document.LongPoint;
import org.apache.lucene.index.DocValues;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.NumericDocValues;
import org.apache.lucene.index.PointValues;
import org.apache.lucene.util.DocIdSetBuilder;

import java.io.IOException;
import java.util.Arrays;

import static org.apache.lucene.search.FieldComparator.IteratorSupplierComparator;

public class LongDocValuesPointComparator extends IteratorSupplierComparator<Long> {
private final String field;
private final boolean reverse;
private final long missingValue;
private final long[] values;
private long bottom;
private long topValue;
boolean hasTopValue = false; // indicates that topValue for searchAfter is set
protected NumericDocValues docValues;
private DocIdSetIterator iterator;
private PointValues pointValues;
private int maxDoc;
private int maxDocVisited;
private int updateCounter = 0;

public LongDocValuesPointComparator(String field, int numHits, boolean reverse, Long missingValue) {
this.field = field;
this.reverse = reverse;
this.missingValue = missingValue != null ? missingValue : 0L;
this.values = new long[numHits];
}

private long getValueForDoc(int doc) throws IOException {
if (docValues.advanceExact(doc)) {
return docValues.longValue();
} else {
return missingValue;
}
}

@Override
public int compare(int slot1, int slot2) {
return Long.compare(values[slot1], values[slot2]);
}

@Override
public void setTopValue(Long value) {
topValue = value;
hasTopValue = true;
}

@Override
public Long value(int slot) {
return Long.valueOf(values[slot]);
}

@Override
public LeafFieldComparator getLeafComparator(LeafReaderContext context) throws IOException {
docValues = DocValues.getNumeric(context.reader(), field);
iterator = docValues;
pointValues = context.reader().getPointValues(field);
maxDoc = context.reader().maxDoc();
maxDocVisited = 0;
return this;
}

@Override
public void setBottom(int slot) throws IOException {
this.bottom = values[slot];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the iterator here ? We would need to check the total hits threshold so maybe pass the HitsThresholdChecker in the ctr somehow ?

}

@Override
public int compareBottom(int doc) throws IOException {
return Long.compare(bottom, getValueForDoc(doc));
}

@Override
public int compareTop(int doc) throws IOException {
return Long.compare(topValue, getValueForDoc(doc));
}

@Override
public void copy(int slot, int doc) throws IOException {
maxDocVisited = doc;
values[slot] = getValueForDoc(doc);
}

@Override
public void setScorer(Scorable scorer) throws IOException {}

public DocIdSetIterator iterator() {
return iterator;
}

// update its iterator to include possibly only docs that are "stronger" than the current bottom entry
public void updateIterator() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should throttle the checks here (if the bottom value changes frequently). In the distance feature query we start throttling after 256 calls, we should replicate here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6384b15

updateCounter++;
if (updateCounter > 256 && (updateCounter & 0x1f) != 0x1f) { // Start sampling if we get called too much
return;
}

final byte[] maxValueAsBytes = reverse == false ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this initialization into the constructor, or is this not shareable and must be local storage? I think we call updateIterator in collect() right? If we can avoid object creation in an inner loop, that would be good. We could create both arrays unconditionally I think and set a boolean here to be used below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Thanks for the suggestion, indeed these values can be initialized in the comparator's constructor. As each topfieldcollector has its own comparator and processes segments sequentially, these values should be shareable. Addressed in 95e1bc1

final byte[] minValueAsBytes = reverse ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null;
if (reverse == false) {
LongPoint.encodeDimension(bottom, maxValueAsBytes, 0);
if (hasTopValue) {
LongPoint.encodeDimension(topValue, minValueAsBytes, 0);
}
} else {
LongPoint.encodeDimension(bottom, minValueAsBytes, 0);
if (hasTopValue) {
LongPoint.encodeDimension(topValue, maxValueAsBytes, 0);
}
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also take the topValue into account here (searchAfter) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6384b15

DocIdSetBuilder result = new DocIdSetBuilder(maxDoc);
PointValues.IntersectVisitor visitor = new PointValues.IntersectVisitor() {
DocIdSetBuilder.BulkAdder adder;
@Override
public void grow(int count) {
adder = result.grow(count);
}

@Override
public void visit(int docID) {
if (docID <= maxDocVisited) {
return; // Already visited or skipped
}
adder.add(docID);
}

@Override
public void visit(int docID, byte[] packedValue) {
if (docID <= maxDocVisited) {
return; // Already visited or skipped
}
if (maxValueAsBytes != null) {
// doc's value is too high
if (Arrays.compareUnsigned(packedValue, 0, Long.BYTES, maxValueAsBytes, 0, Long.BYTES) > 0) return;
}
if (minValueAsBytes != null) {
// doc's value is too low
if (Arrays.compareUnsigned(packedValue, 0, Long.BYTES, minValueAsBytes, 0, Long.BYTES) < 0) return;
}
adder.add(docID); // doc is competitive
}

@Override
public PointValues.Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
if (((maxValueAsBytes != null) &&
Arrays.compareUnsigned(minPackedValue, 0, Long.BYTES, maxValueAsBytes, 0, Long.BYTES) > 0) ||
((minValueAsBytes != null) &&
Arrays.compareUnsigned(maxPackedValue, 0, Long.BYTES, minValueAsBytes, 0, Long.BYTES) < 0)) {
return PointValues.Relation.CELL_OUTSIDE_QUERY;
}
if (((maxValueAsBytes != null) &&
Arrays.compareUnsigned(maxPackedValue, 0, Long.BYTES, maxValueAsBytes, 0, Long.BYTES) > 0) ||
((minValueAsBytes != null) &&
Arrays.compareUnsigned(minPackedValue, 0, Long.BYTES, minValueAsBytes, 0, Long.BYTES) < 0)) {
return PointValues.Relation.CELL_CROSSES_QUERY;
}
return PointValues.Relation.CELL_INSIDE_QUERY;
}
};
final long threshold = iterator.cost() >>> 3;
long estimatedNumberOfMatches = pointValues.estimatePointCount(visitor); // runs in O(log(numPoints))
if (estimatedNumberOfMatches >= threshold) {
// the new range is not selective enough to be worth materializing, it doesn't reduce number of docs at least 8x
return;
}
pointValues.intersect(visitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update the iterator only if it allows to skip "lots" of documents, in distance feature query we set the threshold to a 8x reduction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6384b15

this.iterator = result.build().iterator();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.lucene.search;

/*
* Sort field for long values indexed both with doc values and points.
* Use this field if you want to skip collecting non-competitive documents,
* which in some cases can significantly speed up sort queries.
*/
public class LongDocValuesPointSortField extends SortField {

LongDocValuesPointSortField(String field) {
super(field, SortField.Type.CUSTOM);
}

LongDocValuesPointSortField(String field, boolean reverse) {
super(field, SortField.Type.CUSTOM, reverse);
}

@Override
public FieldComparator<?> getComparator(int numHits, int sortPos) {
if (sortPos == 0) {
return new LongDocValuesPointComparator(getField(), numHits, reverse, (Long) missingValue);
} else {
return new FieldComparator.LongComparator(numHits, getField(), (Long) missingValue);
}
}

@Override
public void setMissingValue(Object missingValue) {
if (missingValue != null && missingValue.getClass() != Long.class)
throw new IllegalArgumentException("Missing values for Type.LONG can only be of type java.lang.Long, but got " + missingValue.getClass());
this.missingValue = missingValue;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ public boolean isCacheable(LeafReaderContext ctx) {

@Override
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
if (scoreMode == ScoreMode.TOP_SCORES) {
if ((scoreMode == ScoreMode.TOP_SCORES) || (scoreMode == ScoreMode.TOP_DOCS)) {
return super.bulkScorer(context);
}
final float score = score();
Expand Down
11 changes: 11 additions & 0 deletions lucene/core/src/java/org/apache/lucene/search/ScoreMode.java
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ public boolean needsScores() {
public boolean needsScores() {
return true;
}
},

/**
* ScoreMode for collectors that can provide their own iterators,
* to optionally allow to skip for non-competitive docs
*/
TOP_DOCS {
@Override
public boolean needsScores() {
return false;
}
};

/**
Expand Down
Loading