-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9280: Collectors to skip noncompetitive documents #1351
Changes from 4 commits
0d4b2f5
6384b15
d732d7e
209bc21
95e1bc1
0e3c7da
d7e9507
1154d4a
39379a7
6c628f7
eeb23c1
89d241e
4448499
719882e
c84fe5e
d7ef9b6
24c94ff
2fd9075
b8e138c
7120424
6c62fd0
1ab2a6e
f910030
99dd0c1
8ebcff8
55f2940
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,4 +53,8 @@ public String toString() { | |
return name + "(" + in + ")"; | ||
} | ||
|
||
@Override | ||
public DocIdSetIterator iterator() { | ||
return in.iterator(); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've had endless discussions about whether or not to delegate in FilterXXX classes and I think that the consensus is that we should only delegate abstract methods. Since this one has a default implementation, let's not delegate and look for extensions of FilterCollector that should delegate it? (e.g. asserting collectors) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. addressed in 7120424 |
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -93,4 +93,11 @@ public interface LeafCollector { | |
*/ | ||
void collect(int doc) throws IOException; | ||
|
||
/* | ||
jpountz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* optionally returns an iterator over competitive documents | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you document that the default is to return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @jpountz
What is the way to make this explicit? |
||
*/ | ||
default DocIdSetIterator iterator() { | ||
jpountz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return null; | ||
} | ||
jpountz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.lucene.search; | ||
|
||
import org.apache.lucene.document.LongPoint; | ||
import org.apache.lucene.index.DocValues; | ||
import org.apache.lucene.index.LeafReaderContext; | ||
import org.apache.lucene.index.NumericDocValues; | ||
import org.apache.lucene.index.PointValues; | ||
import org.apache.lucene.util.DocIdSetBuilder; | ||
|
||
import java.io.IOException; | ||
import java.util.Arrays; | ||
|
||
import static org.apache.lucene.search.FieldComparator.IteratorSupplierComparator; | ||
|
||
public class LongDocValuesPointComparator extends IteratorSupplierComparator<Long> { | ||
private final String field; | ||
private final boolean reverse; | ||
private final long missingValue; | ||
private final long[] values; | ||
private long bottom; | ||
private long topValue; | ||
boolean hasTopValue = false; // indicates that topValue for searchAfter is set | ||
protected NumericDocValues docValues; | ||
private DocIdSetIterator iterator; | ||
private PointValues pointValues; | ||
private int maxDoc; | ||
private int maxDocVisited; | ||
private int updateCounter = 0; | ||
|
||
public LongDocValuesPointComparator(String field, int numHits, boolean reverse, Long missingValue) { | ||
this.field = field; | ||
this.reverse = reverse; | ||
this.missingValue = missingValue != null ? missingValue : 0L; | ||
this.values = new long[numHits]; | ||
} | ||
|
||
private long getValueForDoc(int doc) throws IOException { | ||
if (docValues.advanceExact(doc)) { | ||
return docValues.longValue(); | ||
} else { | ||
return missingValue; | ||
} | ||
} | ||
|
||
@Override | ||
public int compare(int slot1, int slot2) { | ||
return Long.compare(values[slot1], values[slot2]); | ||
} | ||
|
||
@Override | ||
public void setTopValue(Long value) { | ||
topValue = value; | ||
hasTopValue = true; | ||
} | ||
|
||
@Override | ||
public Long value(int slot) { | ||
return Long.valueOf(values[slot]); | ||
} | ||
|
||
@Override | ||
public LeafFieldComparator getLeafComparator(LeafReaderContext context) throws IOException { | ||
docValues = DocValues.getNumeric(context.reader(), field); | ||
iterator = docValues; | ||
pointValues = context.reader().getPointValues(field); | ||
maxDoc = context.reader().maxDoc(); | ||
maxDocVisited = 0; | ||
return this; | ||
} | ||
|
||
@Override | ||
public void setBottom(int slot) throws IOException { | ||
this.bottom = values[slot]; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you update the iterator here ? We would need to check the total hits threshold so maybe pass the |
||
} | ||
|
||
@Override | ||
public int compareBottom(int doc) throws IOException { | ||
return Long.compare(bottom, getValueForDoc(doc)); | ||
} | ||
|
||
@Override | ||
public int compareTop(int doc) throws IOException { | ||
return Long.compare(topValue, getValueForDoc(doc)); | ||
} | ||
|
||
@Override | ||
public void copy(int slot, int doc) throws IOException { | ||
maxDocVisited = doc; | ||
values[slot] = getValueForDoc(doc); | ||
} | ||
|
||
@Override | ||
public void setScorer(Scorable scorer) throws IOException {} | ||
|
||
public DocIdSetIterator iterator() { | ||
return iterator; | ||
} | ||
|
||
// update its iterator to include possibly only docs that are "stronger" than the current bottom entry | ||
public void updateIterator() throws IOException { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should throttle the checks here (if the bottom value changes frequently). In the distance feature query we start throttling after 256 calls, we should replicate here ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in 6384b15 |
||
updateCounter++; | ||
if (updateCounter > 256 && (updateCounter & 0x1f) != 0x1f) { // Start sampling if we get called too much | ||
return; | ||
} | ||
|
||
final byte[] maxValueAsBytes = reverse == false ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we move this initialization into the constructor, or is this not shareable and must be local storage? I think we call updateIterator in collect() right? If we can avoid object creation in an inner loop, that would be good. We could create both arrays unconditionally I think and set a boolean here to be used below? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
final byte[] minValueAsBytes = reverse ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null; | ||
if (reverse == false) { | ||
LongPoint.encodeDimension(bottom, maxValueAsBytes, 0); | ||
if (hasTopValue) { | ||
LongPoint.encodeDimension(topValue, minValueAsBytes, 0); | ||
} | ||
} else { | ||
LongPoint.encodeDimension(bottom, minValueAsBytes, 0); | ||
if (hasTopValue) { | ||
LongPoint.encodeDimension(topValue, maxValueAsBytes, 0); | ||
} | ||
}; | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should also take the topValue into account here (searchAfter) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in 6384b15 |
||
DocIdSetBuilder result = new DocIdSetBuilder(maxDoc); | ||
PointValues.IntersectVisitor visitor = new PointValues.IntersectVisitor() { | ||
DocIdSetBuilder.BulkAdder adder; | ||
@Override | ||
public void grow(int count) { | ||
adder = result.grow(count); | ||
} | ||
|
||
@Override | ||
public void visit(int docID) { | ||
if (docID <= maxDocVisited) { | ||
return; // Already visited or skipped | ||
} | ||
adder.add(docID); | ||
} | ||
|
||
@Override | ||
public void visit(int docID, byte[] packedValue) { | ||
if (docID <= maxDocVisited) { | ||
return; // Already visited or skipped | ||
} | ||
if (maxValueAsBytes != null) { | ||
// doc's value is too high | ||
if (Arrays.compareUnsigned(packedValue, 0, Long.BYTES, maxValueAsBytes, 0, Long.BYTES) > 0) return; | ||
} | ||
if (minValueAsBytes != null) { | ||
// doc's value is too low | ||
if (Arrays.compareUnsigned(packedValue, 0, Long.BYTES, minValueAsBytes, 0, Long.BYTES) < 0) return; | ||
} | ||
adder.add(docID); // doc is competitive | ||
} | ||
|
||
@Override | ||
public PointValues.Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { | ||
if (((maxValueAsBytes != null) && | ||
Arrays.compareUnsigned(minPackedValue, 0, Long.BYTES, maxValueAsBytes, 0, Long.BYTES) > 0) || | ||
((minValueAsBytes != null) && | ||
Arrays.compareUnsigned(maxPackedValue, 0, Long.BYTES, minValueAsBytes, 0, Long.BYTES) < 0)) { | ||
return PointValues.Relation.CELL_OUTSIDE_QUERY; | ||
} | ||
if (((maxValueAsBytes != null) && | ||
Arrays.compareUnsigned(maxPackedValue, 0, Long.BYTES, maxValueAsBytes, 0, Long.BYTES) > 0) || | ||
((minValueAsBytes != null) && | ||
Arrays.compareUnsigned(minPackedValue, 0, Long.BYTES, minValueAsBytes, 0, Long.BYTES) < 0)) { | ||
return PointValues.Relation.CELL_CROSSES_QUERY; | ||
} | ||
return PointValues.Relation.CELL_INSIDE_QUERY; | ||
} | ||
}; | ||
final long threshold = iterator.cost() >>> 3; | ||
long estimatedNumberOfMatches = pointValues.estimatePointCount(visitor); // runs in O(log(numPoints)) | ||
if (estimatedNumberOfMatches >= threshold) { | ||
// the new range is not selective enough to be worth materializing, it doesn't reduce number of docs at least 8x | ||
return; | ||
} | ||
pointValues.intersect(visitor); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should update the iterator only if it allows to skip "lots" of documents, in distance feature query we set the threshold to a 8x reduction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in 6384b15 |
||
this.iterator = result.build().iterator(); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.lucene.search; | ||
|
||
/* | ||
* Sort field for long values indexed both with doc values and points. | ||
* Use this field if you want to skip collecting non-competitive documents, | ||
* which in some cases can significantly speed up sort queries. | ||
*/ | ||
public class LongDocValuesPointSortField extends SortField { | ||
|
||
LongDocValuesPointSortField(String field) { | ||
super(field, SortField.Type.CUSTOM); | ||
} | ||
|
||
LongDocValuesPointSortField(String field, boolean reverse) { | ||
super(field, SortField.Type.CUSTOM, reverse); | ||
} | ||
|
||
@Override | ||
public FieldComparator<?> getComparator(int numHits, int sortPos) { | ||
if (sortPos == 0) { | ||
return new LongDocValuesPointComparator(getField(), numHits, reverse, (Long) missingValue); | ||
} else { | ||
return new FieldComparator.LongComparator(numHits, getField(), (Long) missingValue); | ||
} | ||
} | ||
|
||
@Override | ||
public void setMissingValue(Object missingValue) { | ||
if (missingValue != null && missingValue.getClass() != Long.class) | ||
throw new IllegalArgumentException("Missing values for Type.LONG can only be of type java.lang.Long, but got " + missingValue.getClass()); | ||
this.missingValue = missingValue; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this ? We could update the iterator every time a bottom value is set ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is more straightforward to just update an iterator in
setBottom
function of a comparator.But I was thinking it is better to have a special function for two reasons:
After updating an iterator, in
TopFieldCollector
we need to changetotalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
we also need to check
hitsThresholdChecker.isThresholdReached()
, and passing not strictly related objecthitsThresholdChecker
to a comparator's constructor doesn't look nice to me.Please let me know if you think otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1. we could set the totalHitsRelation when we reach the total hits threshold in the TOP_DOCS mode ?
For 2. I wonder if we could pass the hitsThresholdChecker to the LeafFieldComparator like we do for the scorer ?
This way we can update the iterator internally when a new bottom is set or when
compareBottom
is called ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov Thanks for the suggestion, naming is tough, addressed in 95e1bc1.