-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9280: Collectors to skip noncompetitive documents #1351
LUCENE-9280: Collectors to skip noncompetitive documents #1351
Conversation
Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents This could be useful if we want to sort by some field.
@jimczi I have created a draft PR for comparators and collectors to skip non-competitive docs. Can you please have a look at it and see if we are happy with this approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great start @mayya-sharipova ! I left some comments to make this change less invasive but I really like the simplicity of the new long sort field.
It could also be nice to run some early benchmarks with luceneutil to show how useful this change can be for numeric sort ?
|
||
public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator { | ||
abstract DocIdSetIterator iterator(); | ||
abstract void updateIterator() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this ? We could update the iterator every time a bottom value is set ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it is more straightforward to just update an iterator in setBottom
function of a comparator.
But I was thinking it is better to have a special function for two reasons:
-
After updating an iterator, in
TopFieldCollector
we need to change
totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
-
we also need to check
hitsThresholdChecker.isThresholdReached()
, and passing not strictly related objecthitsThresholdChecker
to a comparator's constructor doesn't look nice to me.
Please let me know if you think otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1. we could set the totalHitsRelation when we reach the total hits threshold in the TOP_DOCS mode ?
For 2. I wonder if we could pass the hitsThresholdChecker to the LeafFieldComparator like we do for the scorer ?
This way we can update the iterator internally when a new bottom is set or when compareBottom
is called ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return PointValues.Relation.CELL_CROSSES_QUERY; | ||
} | ||
}; | ||
pointValues.intersect(visitor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should update the iterator only if it allows to skip "lots" of documents, in distance feature query we set the threshold to a 8x reduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 6384b15
return iterator; | ||
} | ||
|
||
public void updateIterator() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should throttle the checks here (if the bottom value changes frequently). In the distance feature query we start throttling after 256 calls, we should replicate here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 6384b15
|
||
@Override | ||
public void setBottom(int slot) { | ||
this.bottom = values[slot]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the iterator here ? We would need to check the total hits threshold so maybe pass the HitsThresholdChecker
in the ctr somehow ?
} else { | ||
LongPoint.encodeDimension(bottom, minValueAsBytes, 0); | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also take the topValue into account here (searchAfter) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 6384b15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not think of any clever way to do this in IndexSearcher, I would appreciate your help if you can suggest any such way. I just redesigned DefaultBulkScorer to use a conjunction of a scorer's and collector's iterators.
I left some comments regarding the refactor but I like it better. I think you're right, the bulk scorer is a good entry point to handle the leaf collector iterator.
} | ||
} | ||
|
||
// conjunction iterator between scorer's iterator and collector's iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can replace this with ConjunctionDISI#intersectIterators
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (twoPhase == null) { | ||
while (currentDoc < end) { | ||
if (acceptDocs == null || acceptDocs.get(currentDoc)) { | ||
collector.collect(currentDoc); | ||
} | ||
currentDoc = iterator.nextDoc(); | ||
} | ||
return currentDoc; | ||
} else { | ||
final DocIdSetIterator approximation = twoPhase.approximation(); | ||
while (currentDoc < end) { | ||
if ((acceptDocs == null || acceptDocs.get(currentDoc)) && twoPhase.matches()) { | ||
collector.collect(currentDoc); | ||
} | ||
currentDoc = approximation.nextDoc(); | ||
while (currentDoc < end) { | ||
if ((acceptDocs == null || acceptDocs.get(currentDoc)) && (twoPhase == null || twoPhase.matches())) { | ||
collector.collect(currentDoc); | ||
} | ||
return currentDoc; | ||
currentDoc = iterator.nextDoc(); | ||
} | ||
return currentDoc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is not required ? I see hotspot in the javadoc comment above so we shouldn't touch it if it's not required ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in d732d7e
doc = iterator.advance(min); | ||
} else { | ||
doc = twoPhase.approximation().advance(min); | ||
if (doc < min) scorerIterator.advance(min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (doc < min) scorerIterator.advance(min); | |
if (doc < min) { | |
doc = combinedIterator.advance(min); | |
} |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in d732d7e
this.bottom = values[slot]; | ||
// can't use hitsThresholdChecker.isThresholdReached() as it uses > numHits, | ||
// while we want to update iterator as soon as threshold reaches numHits | ||
if (hitsThresholdChecker != null && (hitsThresholdChecker.getHitsThreshold() >= numHits)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi I am not very happy about this change because of 2 reasons:
- We can't use
hitsThresholdChecker.isThresholdReached
as it checks for greater than numHits, but we need to check starting with equal, as if there are no competitive docs latersetBottom
will not be called.
Do you know the reason whyhitsThresholdChecker.isThresholdReached
checks for greater than numHits and not greater or equal numHits? - totalHitsRelation may not end up to be set to TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, as we set it only when we have later competitive hits.
I think it is better to have a previous implementation with a dedicated updateIterator
function called from TopFieldCollector
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little fuzzy on my understanding of how you are making use of Points here, but I left a few micro-comments. I'll echo Jim's comment; it'd be great to see some results from luceneutil (or any reproducible benchmark) demonstrating this new idea.
|
||
public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator { | ||
abstract DocIdSetIterator iterator(); | ||
abstract void updateIterator() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?
return; | ||
} | ||
|
||
final byte[] maxValueAsBytes = reverse == false ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this initialization into the constructor, or is this not shareable and must be local storage? I think we call updateIterator in collect() right? If we can avoid object creation in an inner loop, that would be good. We could create both arrays unconditionally I think and set a boolean here to be used below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run some benchmarking using
Here the main point of comparison is wikimedium1m
wikimedium10m
For wikimedium1m TermDTSort using |
That 3x speedup is very nice! My experience with these benchmarks is they can be pretty noisy, maybe accounting for the regressions? I tend to increase comp.taskRepeatCount = 500. I'd also be interested to see how this optimization fares for higher values of topN - I think the default is 10, but you can edit in benchUtil.py. You did not sort the index right (eg: comp.newIndex('baseline', sourceData, facets=facets, indexSort='lastModNDV:long', addDVFields=True)? It would be interesting to see if this has the same impact for sorted index, large N, especially running with an executor (.competitor(...concurrentSearchers = True ). |
Update: these are wrong results. Please disregard them @msokolov Thank for suggesting additional benchmarks that we can use. First I will repeat the results from the previous round of benchmarking: topN=10, taskRepeatCount = 20, concurrentSearchers = False
topN=10, taskRepeatCount = 500, concurrentSearchers = False
This seemed to speed up all operations, and here the speedups for topN=500, taskRepeatCount = 20, concurrentSearchers = False
With increased topN=10, taskRepeatCount = 20, concurrentSearchers = True
With the concurrent searchers the speedups are also smaller up to 2x. This is expected as now segments are spread between several TopFieldCollects/Comparators and they don't exchange bottom values. As a follow-up on this PR, we can think how we can have a global bottom value similar how with indexSort='lastModNDV:long' topN=10, taskRepeatCount = 20, concurrentSearchers = False
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very compelling! I think you've addressed most of the outstanding comments: it seems like only the question about when to update the iterator remains (descussion below about moving it to setBottom). I'm not too concerned either way.
private int maxDoc; | ||
private int maxDocVisited; | ||
private int updateCounter = 0; | ||
private byte[] cmaxValueAsBytes = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be final, and allocated only in the constructor? I think it might be clearer to add a boolean "hasTopValues" and set that in setTopValue, rather than use the existence of these byte[]? Then you could make these final and eliminate the local variables where they get copied below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um I approved and then realized - there is still the mystery of the regressions you observed in the non-optimized cases. I think we should try to understand where that is coming from before committing this?
Ensure optimized sort works as expected (as long sort) of a field that is not indexed with points.
@msokolov Thank you for an additional review. I realized I ran benchmarks incorrectly, not indexing documents with docValues. Sorry, I am still learning lucene benchmarking tool. Please disregard the previous benchmarking results, I will be rerunning them. |
@mayya-sharipova sounds good - I'd also encourage you to post a PR with your modifications to luceneutil |
@msokolov Sorry again for reporting incorrect benchmarking results. Below are are my latest results, and I feel quite confident in their correctness. First about the benchmarking setup.
public class LongDocValuesPointSortField extends SortField {
public LongDocValuesPointSortField(String field) {
super(field, SortField.Type.LONG);
}
public LongDocValuesPointSortField(String field, boolean reverse) {
super(field, SortField.Type.LONG, reverse);
}
} So basically I was benchmarking a traditional long sort VS a long sort using a new field wikimedium10m: 10 millon docs, up to 2x speedups
wikimediumall: about 33 million docs, up to 3.5 x speedups
Please let me know if these results and methodology make sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks good to me: the additional ScoreMode
enum constants and the new LeafFieldComparator#iterator
method. I wonder whether we could make it easier to write implementations. I haven't spent much time thinking about it, but for instance would it be possible to wrap existing comparators to add the skipping functionality? Alternatively we could add the skipping logic to the existing comparators, but the fact that Lucene doesn't require that the same data be stored in indexes and doc values makes me a bit nervous about enabling it by default, and I'd like to avoid adding a new constructor argument.
lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/LeafCollector.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/LeafCollector.java
Outdated
Show resolved
Hide resolved
@@ -93,4 +93,11 @@ | |||
*/ | |||
void collect(int doc) throws IOException; | |||
|
|||
/* | |||
* optionally returns an iterator over competitive documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document that the default is to return null
which Lucene interprets as the collector doesn't filter any documents. It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz
It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents
What is the way to make this explicit?
@jpountz Thank you for the review.
Would it make sense for each numeric FieldComparator to add an extra class that would wrap a numeric comparator and provide additional methods for skipping logic (getting an iterator and updating an iterator)? |
Add a decorator for FieldComparatori to add a functionality to skip over non-competitive docs
@jpountz What do you think of this design in eeb23c1?
In this case, we would not need other classes that I previously introduced |
I like the idea of wrapping things up, and I think we may be able to take this further by pushing more of the logic into the comparator:
|
Sort optimization introduced in apache/lucene-solr#1351 depends on numeric fields being indexed both as doc_values and points. This PR does the following: - add a LongPoint field – lastModLP, last modified timestamp - add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp - add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks - don't fail a task if hitCounts don't match in benchUtil.py. As we don't collect all hits in the optimized runs, we don't expect hits total to match.
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351) Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents. To enable sort optimization for numeric sort fields, the following needs to be done: 1) the field should be indexed with both doc_values and points, that must have the same field name and same data 2) SortField#setCanSkipNonCompetitiveDocs must be set 3) totalHitsThreshold should not be set to max value.
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (#1351) Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents. To enable sort optimization for numeric sort fields, the following needs to be done: 1) the field should be indexed with both doc_values and points, that must have the same field name and same data 2) SortField#setCanUsePoints must be set 3) totalHitsThreshold should not be set to max value.
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351) Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents. To enable sort optimization for numeric sort fields, the following needs to be done: 1) the field should be indexed with both doc_values and points, that must have the same field name and same data 2) SortField#setCanUsePoints must be set 3) totalHitsThreshold should not be set to max value.
PR apache#1351 introduced a sort optimization where documents can be skipped. But iteration over competitive iterators was not properly organized, as they were not storing the current docID, and when competitive iterator was updated the current doc ID was lost. This patch fixed it. Relates to apache#1351
Sort optimization introduced in apache/lucene-solr#1351 depends on numeric fields being indexed both as doc_values and points. This PR does the following: - add a LongPoint field – lastModLP, last modified timestamp - add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp - add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks If we make a comparison with a run where sort optimization is not enabled, as hits count may differ for a task not to fail, `competition` in `localrun.py` should be modified to: ``` comp = competition.Competition(verifyCounts=False) ```
Sort optimization introduced in apache/lucene-solr#1351 depends on numeric fields being indexed both as doc_values and points. This PR does the following: - add a LongPoint field – lastModLP, last modified timestamp - add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp - add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks - don't fail a task if hitCounts don't match in benchUtil.py. As we don't collect all hits in the optimized runs, we don't expect hits total to match.
Currently, if search sort is equal to index sort, we have an early termination in TopFieldCollector. As we work to enhance comparators to provide skipping functionality (PR apache#1351), we would like to move this termination functionality on index sort from TopFieldCollector to comparators. This patch does the following: - Add method usesIndexSort to LeafFieldComparator - Make numeric comparators aware of index sort and early terminate on collecting all competitive hits - Move TermValComparator and TermOrdValComparator from FieldComparator to comparator package, for all comparators to be in the same package - Enhance TermValComparator to provide skipping functionality when index is sorted One item left for TODO for a following PR is to remove the logic of early termniation from TopFieldCollector. We can do that once we ensure that all BulkScorers are using iterators from collectors than can skip non-competitive docs. Relates to apache#1351
Currently, if search sort is equal to index sort, we have an early termination in TopFieldCollector. As we work to enhance comparators to provide skipping functionality (PR apache#1351), we would like to move this termination functionality on index sort from TopFieldCollector to comparators. This patch does the following: - Add method usesIndexSort to LeafFieldComparator - Make numeric comparators aware of index sort and early terminate on collecting all competitive hits - Move TermValComparator and TermOrdValComparator from FieldComparator to comparator package, for all comparators to be in the same package - Enhance TermOrdValComparator to provide skipping functionality when index is sorted One item left for TODO for a following PR is to remove the logic of early termination from TopFieldCollector. We can do that once we ensure that all BulkScorers are using iterators from collectors that can skip non-competitive docs. Relates to apache#1351
Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - make leaf comparators aware that search sort is congruent with the index sort. - disables sort optimization in comparators in this case. - removes a private MultiComparatorLeafCollector class as the only class that extended that class was TopFieldLeafCollector that now incorporates the logic of the deleted class. Relates to apache#1351
Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, and currently called when the search sort is congruent with the index sort, but more conditions can be added. - disables sort optimization in comparators in this case. - removes a private `MultiComparatorLeafCollector` class, because the only class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`. The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`. Relates to #1351
Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, when the search sort is congruent with the index sort. It is also called when we can't use points for sort optimization. - disables sort optimization in comparators in these cases. Relates to #1351 Backport for #2075
Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, and currently called when the search sort is congruent with the index sort, but more conditions can be added. - disables sort optimization in comparators in this case. - removes a private `MultiComparatorLeafCollector` class, because the only class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`. The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`. Relates to apache#1351
Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, and currently called when the search sort is congruent with the index sort, but more conditions can be added. - disables sort optimization in comparators in this case. - removes a private `MultiComparatorLeafCollector` class, because the only class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`. The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`. Relates to apache#1351
Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents
This could be useful if we want to sort by some field.