LUCENE-9280: Collectors to skip noncompetitive documents #1351

mayya-sharipova · 2020-03-13T22:19:01Z

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents

This could be useful if we want to sort by some field.

Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents This could be useful if we want to sort by some field.

mayya-sharipova · 2020-03-13T22:20:16Z

@jimczi I have created a draft PR for comparators and collectors to skip non-competitive docs. Can you please have a look at it and see if we are happy with this approach.

jimczi

It's a great start @mayya-sharipova ! I left some comments to make this change less invasive but I really like the simplicity of the new long sort field.
It could also be nice to run some early benchmarks with luceneutil to show how useful this change can be for numeric sort ?

jimczi · 2020-03-17T11:48:35Z

lucene/core/src/java/org/apache/lucene/search/FieldComparator.java

+
+  public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator {
+    abstract DocIdSetIterator iterator();
+    abstract void updateIterator() throws IOException;


Why do we need this ? We could update the iterator every time a bottom value is set ?

Indeed it is more straightforward to just update an iterator in setBottom function of a comparator.

But I was thinking it is better to have a special function for two reasons:

After updating an iterator, in TopFieldCollector we need to change
totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;

we also need to check hitsThresholdChecker.isThresholdReached(), and passing not strictly related object hitsThresholdChecker to a comparator's constructor doesn't look nice to me.

Please let me know if you think otherwise

For 1. we could set the totalHitsRelation when we reach the total hits threshold in the TOP_DOCS mode ?
For 2. I wonder if we could pass the hitsThresholdChecker to the LeafFieldComparator like we do for the scorer ?
This way we can update the iterator internally when a new bottom is set or when compareBottom is called ?

The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?

@msokolov Thanks for the suggestion, naming is tough, addressed in 95e1bc1.

jimczi · 2020-03-17T11:52:04Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+                return PointValues.Relation.CELL_CROSSES_QUERY;
+            }
+        };
+        pointValues.intersect(visitor);


we should update the iterator only if it allows to skip "lots" of documents, in distance feature query we set the threshold to a 8x reduction.

Addressed in 6384b15

jimczi · 2020-03-17T11:53:31Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+        return iterator;
+    }
+
+    public void updateIterator() throws IOException {


We should throttle the checks here (if the bottom value changes frequently). In the distance feature query we start throttling after 256 calls, we should replicate here ?

Addressed in 6384b15

jimczi · 2020-03-17T11:55:21Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+
+    @Override
+    public void setBottom(int slot) {
+        this.bottom = values[slot];


Can you update the iterator here ? We would need to check the total hits threshold so maybe pass the HitsThresholdChecker in the ctr somehow ?

jimczi · 2020-03-17T11:56:38Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+        } else {
+            LongPoint.encodeDimension(bottom, minValueAsBytes, 0);
+        };
+


you should also take the topValue into account here (searchAfter) ?

Addressed in 6384b15

lucene/core/src/java/org/apache/lucene/search/Weight.java

jimczi

I could not think of any clever way to do this in IndexSearcher, I would appreciate your help if you can suggest any such way. I just redesigned DefaultBulkScorer to use a conjunction of a scorer's and collector's iterators.

I left some comments regarding the refactor but I like it better. I think you're right, the bulk scorer is a good entry point to handle the leaf collector iterator.

jimczi · 2020-03-18T22:19:48Z

lucene/core/src/java/org/apache/lucene/search/Weight.java

+      }
+    }
+
+    // conjunction iterator between scorer's iterator and collector's iterator


you can replace this with ConjunctionDISI#intersectIterators ?

jimczi · 2020-03-18T22:30:31Z

lucene/core/src/java/org/apache/lucene/search/Weight.java

-      if (twoPhase == null) {
-        while (currentDoc < end) {
-          if (acceptDocs == null || acceptDocs.get(currentDoc)) {
-            collector.collect(currentDoc);
-          }
-          currentDoc = iterator.nextDoc();
-        }
-        return currentDoc;
-      } else {
-        final DocIdSetIterator approximation = twoPhase.approximation();
-        while (currentDoc < end) {
-          if ((acceptDocs == null || acceptDocs.get(currentDoc)) && twoPhase.matches()) {
-            collector.collect(currentDoc);
-          }
-          currentDoc = approximation.nextDoc();
+      while (currentDoc < end) {
+        if ((acceptDocs == null || acceptDocs.get(currentDoc)) && (twoPhase == null || twoPhase.matches())) {
+          collector.collect(currentDoc);
        }
-        return currentDoc;
+        currentDoc = iterator.nextDoc();
      }
+      return currentDoc;


this change is not required ? I see hotspot in the javadoc comment above so we shouldn't touch it if it's not required ;).

Addressed in d732d7e

jimczi · 2020-03-18T22:31:16Z

lucene/core/src/java/org/apache/lucene/search/Weight.java

-            doc = iterator.advance(min);
-          } else {
-            doc = twoPhase.approximation().advance(min);
+        if (doc < min) scorerIterator.advance(min);


Suggested change

if (doc < min) scorerIterator.advance(min);

if (doc < min) {

doc = combinedIterator.advance(min);

}

?

Addressed in d732d7e

mayya-sharipova · 2020-03-19T20:11:16Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+        this.bottom = values[slot];
+        // can't use hitsThresholdChecker.isThresholdReached() as it uses > numHits,
+        // while we want to update iterator as soon as threshold reaches numHits
+        if (hitsThresholdChecker != null && (hitsThresholdChecker.getHitsThreshold() >= numHits)) {


@jimczi I am not very happy about this change because of 2 reasons:

We can't use hitsThresholdChecker.isThresholdReached as it checks for greater than numHits, but we need to check starting with equal, as if there are no competitive docs later setBottom will not be called.
Do you know the reason why hitsThresholdChecker.isThresholdReached checks for greater than numHits and not greater or equal numHits?

totalHitsRelation may not end up to be set to TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, as we set it only when we have later competitive hits.

I think it is better to have a previous implementation with a dedicated updateIterator function called from TopFieldCollector. WDYT?

msokolov

I'm a little fuzzy on my understanding of how you are making use of Points here, but I left a few micro-comments. I'll echo Jim's comment; it'd be great to see some results from luceneutil (or any reproducible benchmark) demonstrating this new idea.

msokolov · 2020-03-22T18:50:56Z

lucene/core/src/java/org/apache/lucene/search/FieldComparator.java

+
+  public static abstract class IteratorSupplierComparator<T> extends FieldComparator<T> implements LeafFieldComparator {
+    abstract DocIdSetIterator iterator();
+    abstract void updateIterator() throws IOException;


The name seems to indicate that this is something that compares IteratorSuppliers, when in fact it is something that is a comparator that also supplies iterators. I'm not sure I understand yet where it fits, but given that, a better name might be IterableComparator?

msokolov · 2020-03-22T18:52:52Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+            return;
+        }
+
+        final byte[] maxValueAsBytes = reverse == false ? new byte[Long.BYTES] : hasTopValue ? new byte[Long.BYTES]: null;


Can we move this initialization into the constructor, or is this not shareable and must be local storage? I think we call updateIterator in collect() right? If we can avoid object creation in an inner loop, that would be good. We could create both arrays unconditionally I think and set a boolean here to be used below?

@msokolov Thanks for the suggestion, indeed these values can be initialized in the comparator's constructor. As each topfieldcollector has its own comparator and processes segments sequentially, these values should be shareable. Addressed in 95e1bc1

mayya-sharipova · 2020-03-26T01:17:55Z

I have run some benchmarking using luceneutil.
As the new sort optimization uses a new LongDocValuesPointSortField that is not present in luceneutil, I had to hack luceneutil as follows:

I added a sort task on a long field TermDateTimeSort to wikimedium.1M.nostopwords.tasks . This task was present in wikinightly.tasks , but was not able for wikimedium 1M and 10M tasks
I indexed the corresponding field lastModNDV as LongPoint as well. It was only indexed as NumericDocValuesField before, but for the sort optimization we need long values to be indexed both as docValues and as points.
I modified SearchTask.java to have TopFieldCollector with totalHitsThreshold set to topK: final TopFieldCollector c = TopFieldCollector.create(s, topN, null, topN); Sort optimization only works when we set total hits threshold.
For the patch version , I modified sort in TaskParser.java. Instead of lastModNDVSort = new Sort(new SortField("lastModNDV", SortField.Type.LONG)); I useed the optimized sort: lastModNDVSort = new Sort(new LongDocValuesPointSortField("lastModNDV"));

Here the main point of comparison is TermDTSort as it is the only sort on long field. Other sorts are presented to demonstrate a possible regression or absence on them.

wikimedium1m

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	507.20	(11.2%)	550.02	(16.1%)
HighTermMonthSort	550.06	(10.4%)	443.69	(16.1%)
HighTermDayOfYearSort	105.62	(24.9%)	91.93	(22.1%)

wikimedium10m

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	147.64	(11.5%)	547.80	(6.6%)
HighTermMonthSort	147.85	(12.2%)	239.28	(7.3%)
HighTermDayOfYearSort	74.44	(7.7%)	42.56	(12.1%)

For wikimedium1m TermDTSort using LongDocValuesPointSortField doesn't seem to have much effect. As probably in this index segments are smaller, and probably optimization was completely skipped on those segments.
For wikimedium10m TermDTSort using LongDocValuesPointSortField instead of usual SortField.Type.LONG brings about 3x speedups.
There is some regression/speedups for the sort tasks of HighTermMonthSort and HighTermDayOfYearSort, which I don't know the reason why, as they should not be effected.

msokolov · 2020-03-26T21:01:33Z

That 3x speedup is very nice! My experience with these benchmarks is they can be pretty noisy, maybe accounting for the regressions? I tend to increase comp.taskRepeatCount = 500. I'd also be interested to see how this optimization fares for higher values of topN - I think the default is 10, but you can edit in benchUtil.py. You did not sort the index right (eg: comp.newIndex('baseline', sourceData, facets=facets, indexSort='lastModNDV:long', addDVFields=True)? It would be interesting to see if this has the same impact for sorted index, large N, especially running with an executor (.competitor(...concurrentSearchers = True ).

mayya-sharipova · 2020-03-27T21:42:53Z

Update: these are wrong results. Please disregard them

@msokolov Thank for suggesting additional benchmarks that we can use.
Below are the results on the dataset wikimedium10m.

First I will repeat the results from the previous round of benchmarking:

topN=10, taskRepeatCount = 20, concurrentSearchers = False

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	147.64	(11.5%)	547.80	(6.6%)
HighTermMonthSort	147.85	(12.2%)	239.28	(7.3%)
HighTermDayOfYearSort	74.44	(7.7%)	42.56	(12.1%)

topN=10, taskRepeatCount = 500, concurrentSearchers = False

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	184.60	(8.2%)	3046.19	(4.4%)
HighTermMonthSort	209.43	(6.5%)	253.90	(10.5%)
HighTermDayOfYearSort	130.97	(5.8%)	73.25	(11.8%)

This seemed to speed up all operations, and here the speedups for TermDTSort even bigger: 16.5x times. There is also seems to be more regression for HighTermDayOfYearSort.

topN=500, taskRepeatCount = 20, concurrentSearchers = False

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	210.24	(9.7%)	537.65	(6.7%)
HighTermMonthSort	116.02	(8.9%)	189.96	(13.5%)
HighTermDayOfYearSort	42.33	(7.6%)	67.93	(9.3%)

With increased topN the sort optimization has less speedups up to 2x, as it is expected as it will be possible to run it only after collecting topN docs.

topN=10, taskRepeatCount = 20, concurrentSearchers = True

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	132.09	(14.3%)	287.93	(11.8%)
HighTermMonthSort	211.01	(12.2%)	116.46	(7.1%)
HighTermDayOfYearSort	72.28	(6.1%)	68.21	(11.4%)

With the concurrent searchers the speedups are also smaller up to 2x. This is expected as now segments are spread between several TopFieldCollects/Comparators and they don't exchange bottom values. As a follow-up on this PR, we can think how we can have a global bottom value similar how MaxScoreAccumulator is used to set up a global competitive min score.

with indexSort='lastModNDV:long' topN=10, taskRepeatCount = 20, concurrentSearchers = False

TaskQPS	baseline QPS	StdDevQPS	my_modified_version QPS	StdDevQPS
TermDTSort	321.75	(11.5%)	364.83	(7.8%)
HighTermMonthSort	205.20	(5.7%)	178.16	(7.8%)
HighTermDayOfYearSort	66.07	(12.0%)	58.84	(9.3%)

msokolov

This is very compelling! I think you've addressed most of the outstanding comments: it seems like only the question about when to update the iterator remains (descussion below about moving it to setBottom). I'm not too concerned either way.

msokolov · 2020-03-29T17:45:11Z

lucene/core/src/java/org/apache/lucene/search/LongDocValuesPointComparator.java

+    private int maxDoc;
+    private int maxDocVisited;
+    private int updateCounter = 0;
+    private byte[] cmaxValueAsBytes = null;


Can these be final, and allocated only in the constructor? I think it might be clearer to add a boolean "hasTopValues" and set that in setTopValue, rather than use the existence of these byte[]? Then you could make these final and eliminate the local variables where they get copied below

msokolov

Um I approved and then realized - there is still the mystery of the regressions you observed in the non-optimized cases. I think we should try to understand where that is coming from before committing this?

Ensure optimized sort works as expected (as long sort) of a field that is not indexed with points.

mayya-sharipova · 2020-03-30T14:43:35Z

@msokolov Thank you for an additional review. I realized I ran benchmarks incorrectly, not indexing documents with docValues. Sorry, I am still learning lucene benchmarking tool. Please disregard the previous benchmarking results, I will be rerunning them.

msokolov · 2020-03-30T14:52:08Z

@mayya-sharipova sounds good - I'd also encourage you to post a PR with your modifications to luceneutil

mayya-sharipova · 2020-03-30T21:24:19Z

@msokolov Sorry again for reporting incorrect benchmarking results. Below are are my latest results, and I feel quite confident in their correctness.

First about the benchmarking setup.

Here are the changes made to luceneutil
patch folder is checkout as this PR
trunk folder is checkout as this PR as well with a modification. As there is no LongDocValuesPointSortField in master, I can't benchmark sorting using this field on master. What I did is just is on trunk
folder delegated sorting to the traditional sorting on a long field like this:

public class LongDocValuesPointSortField extends SortField {
    public LongDocValuesPointSortField(String field) {
        super(field, SortField.Type.LONG);
    }
    public LongDocValuesPointSortField(String field, boolean reverse) {
        super(field, SortField.Type.LONG, reverse);
    }
}

So basically I was benchmarking a traditional long sort VS a long sort using a new field LongDocValuesPointSortField.

wikimedium10m: 10 millon docs, up to 2x speedups

 TaskQPS                     baseline   StdDevQPS     patch     StdDev    Pct diff
             TermDTSort       64.53      (6.4%)      155.29     (42.3%)  140.7% (  86% -  202%)
  HighTermDayOfYearSort       47.63      (5.4%)       50.47      (6.8%)    6.0% (  -5% -   19%)
       HighTermMonthSort      110.07     (7.3%)      121.13      (6.8%)   10.0% (  -3% -   26%)
WARNING: cat=TermDTSort: hit counts differ: 754451 vs 1669+

wikimediumall: about 33 million docs, up to 3.5 x speedups

 TaskQPS                     baseline   StdDevQPS     patch     StdDev    Pct diff
              TermDTSort       28.96      (4.3%)      108.45     (56.9%)  274.5% ( 204% -  350%)
   HighTermDayOfYearSort        9.69      (5.1%)        9.56      (6.1%)   -1.3% ( -11% -   10%)
       HighTermMonthSort       39.41      (4.7%)       47.99     (10.0%)   21.8% (   6% -   38%)
WARNING: cat=TermDTSort: hit counts differ: 1474717 vs 1070+

Please let me know if these results and methodology make sense.

jpountz

The API looks good to me: the additional ScoreMode enum constants and the new LeafFieldComparator#iterator method. I wonder whether we could make it easier to write implementations. I haven't spent much time thinking about it, but for instance would it be possible to wrap existing comparators to add the skipping functionality? Alternatively we could add the skipping logic to the existing comparators, but the fact that Lucene doesn't require that the same data be stored in indexes and doc values makes me a bit nervous about enabling it by default, and I'd like to avoid adding a new constructor argument.

lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java

lucene/core/src/java/org/apache/lucene/search/LeafCollector.java

jpountz · 2020-03-31T08:42:44Z

lucene/core/src/java/org/apache/lucene/search/LeafCollector.java

@@ -93,4 +93,11 @@
   */
  void collect(int doc) throws IOException;

+  /*
+   * optionally returns an iterator over competitive documents


Can you document that the default is to return null which Lucene interprets as the collector doesn't filter any documents. It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents.

Thanks @jpountz

It's probably worth making explicit as null iterators are elsewhere interpreted as matching no documents

What is the way to make this explicit?

mayya-sharipova · 2020-03-31T21:35:56Z

@jpountz Thank you for the review.

I wonder whether we could make it easier to write implementations. I haven't spent much time thinking about it, but for instance would it be possible to wrap existing comparators to add the skipping functionality? Alternatively we could add the skipping logic to the existing comparators, but the fact that Lucene doesn't require that the same data be stored in indexes and doc values makes me a bit nervous about enabling it by default, and I'd like to avoid adding a new constructor argument.

Would it make sense for each numeric FieldComparator to add an extra class that would wrap a numeric comparator and provide additional methods for skipping logic (getting an iterator and updating an iterator)?

Add a decorator for FieldComparatori to add a functionality to skip over non-competitive docs

mayya-sharipova · 2020-04-02T19:28:35Z

@jpountz What do you think of this design in eeb23c1?

IterableFieldComparator wraps an FieldComparator to provide skipping functionality. All numeric comparators are wrapped in corresponding iterable comparators.
SortField has a new method allowSkipNonCompetitveDocs, that if set will use a comparator that provided skipping functionality.

In this case, we would not need other classes that I previously introduced LongDocValuesPointComparator and LongDocValuesPointSortField.

romseygeek · 2020-04-03T13:37:26Z

I like the idea of wrapping things up, and I think we may be able to take this further by pushing more of the logic into the comparator:

add a wrapDocIdSetIterator(DocIdSetIterator in) method to LeafCollector that by default returns the passed-in iterator. This gets called in DefaultBulkScorer#score to wrap the iterator for a query.
add a wrapDocIdSetIterator(DocIdSetIterator in) method to FieldComparator that by default returns the passed-in iterator. TopFieldCollector delegates its wrapDocIdSetIterator method to this method on its first comparator. This allows us to completely contain the logic that combines a query's iterator with sorting shortcuts to the SortField and associated FieldComparator implementation.
Move the logic that checks whether or not to update the iterator into setBottom on the leaf comparator. I know this involves passing the HitsThresholdChecker into the leaf comparator constructor, but I think that's reasonable if the point of this API change is to make it possible for comparators to skip hits

…n-competitive

Sort optimization introduced in apache/lucene-solr#1351 depends on numeric fields being indexed both as doc_values and points. This PR does the following: - add a LongPoint field – lastModLP, last modified timestamp - add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp - add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks - don't fail a task if hitCounts don't match in benchUtil.py. As we don't collect all hits in the optimized runs, we don't expect hits total to match.

Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351) Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents. To enable sort optimization for numeric sort fields, the following needs to be done: 1) the field should be indexed with both doc_values and points, that must have the same field name and same data 2) SortField#setCanSkipNonCompetitiveDocs must be set 3) totalHitsThreshold should not be set to max value.

Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (#1351) Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents. To enable sort optimization for numeric sort fields, the following needs to be done: 1) the field should be indexed with both doc_values and points, that must have the same field name and same data 2) SortField#setCanUsePoints must be set 3) totalHitsThreshold should not be set to max value.

Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351) Similar how scorers can update their iterators to skip non-competitive documents, collectors and comparators should also provide and update iterators that allow them to skip non-competive documents. To enable sort optimization for numeric sort fields, the following needs to be done: 1) the field should be indexed with both doc_values and points, that must have the same field name and same data 2) SortField#setCanUsePoints must be set 3) totalHitsThreshold should not be set to max value.

) PR #1351 introduced a sort optimization where documents can be skipped. But there was a bug in case we were using two phase approximation, as we would advance it without advancing an overall conjunction iterator. This patch fixed it. Relates to #1351

PR apache#1351 introduced a sort optimization where documents can be skipped. But iteration over competitive iterators was not properly organized, as they were not storing the current docID, and when competitive iterator was updated the current doc ID was lost. This patch fixed it. Relates to apache#1351

PR #1351 introduced a sort optimization where documents can be skipped. But iteration over competitive iterators was not properly organized, as they were not storing the current docID, and when competitive iterator was updated the current doc ID was lost. This patch fixed it. Relates to #1351

Sort optimization introduced in apache/lucene-solr#1351 depends on numeric fields being indexed both as doc_values and points. This PR does the following: - add a LongPoint field – lastModLP, last modified timestamp - add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp - add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks If we make a comparison with a run where sort optimization is not enabled, as hits count may differ for a task not to fail, `competition` in `localrun.py` should be modified to: ``` comp = competition.Competition(verifyCounts=False) ```

Sort optimization introduced in apache/lucene-solr#1351 depends on numeric fields being indexed both as doc_values and points. This PR does the following: - add a LongPoint field – lastModLP, last modified timestamp - add an IntPoint field – dayOfYearIP, day of the year of the last modified timestamp - add sort on the last modified timestamp to wikimedium.10M.nostopwords.tasks - don't fail a task if hitCounts don't match in benchUtil.py. As we don't collect all hits in the optimized runs, we don't expect hits total to match.

Currently, if search sort is equal to index sort, we have an early termination in TopFieldCollector. As we work to enhance comparators to provide skipping functionality (PR apache#1351), we would like to move this termination functionality on index sort from TopFieldCollector to comparators. This patch does the following: - Add method usesIndexSort to LeafFieldComparator - Make numeric comparators aware of index sort and early terminate on collecting all competitive hits - Move TermValComparator and TermOrdValComparator from FieldComparator to comparator package, for all comparators to be in the same package - Enhance TermValComparator to provide skipping functionality when index is sorted One item left for TODO for a following PR is to remove the logic of early termniation from TopFieldCollector. We can do that once we ensure that all BulkScorers are using iterators from collectors than can skip non-competitive docs. Relates to apache#1351

Currently, if search sort is equal to index sort, we have an early termination in TopFieldCollector. As we work to enhance comparators to provide skipping functionality (PR apache#1351), we would like to move this termination functionality on index sort from TopFieldCollector to comparators. This patch does the following: - Add method usesIndexSort to LeafFieldComparator - Make numeric comparators aware of index sort and early terminate on collecting all competitive hits - Move TermValComparator and TermOrdValComparator from FieldComparator to comparator package, for all comparators to be in the same package - Enhance TermOrdValComparator to provide skipping functionality when index is sorted One item left for TODO for a following PR is to remove the logic of early termination from TopFieldCollector. We can do that once we ensure that all BulkScorers are using iterators from collectors that can skip non-competitive docs. Relates to apache#1351

Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - make leaf comparators aware that search sort is congruent with the index sort. - disables sort optimization in comparators in this case. - removes a private MultiComparatorLeafCollector class as the only class that extended that class was TopFieldLeafCollector that now incorporates the logic of the deleted class. Relates to apache#1351

Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, and currently called when the search sort is congruent with the index sort, but more conditions can be added. - disables sort optimization in comparators in this case. - removes a private `MultiComparatorLeafCollector` class, because the only class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`. The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`. Relates to #1351

Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, when the search sort is congruent with the index sort. It is also called when we can't use points for sort optimization. - disables sort optimization in comparators in these cases. Relates to #1351 Backport for #2075

Disable sort optimization in comparators on index sort. Currently, if search sort is equal or a part of the index sort, we have an early termination in TopFieldCollector. But comparators are not aware of the index sort, and may run sort optimization even if the search sort is congruent with the index sort. This patch: - adds `disableSkipping` method to `FieldComparator`, This method is called by `TopFieldCollector`, and currently called when the search sort is congruent with the index sort, but more conditions can be added. - disables sort optimization in comparators in this case. - removes a private `MultiComparatorLeafCollector` class, because the only class that extends `MultiComparatorLeafCollector` was `TopFieldLeafCollector`. The logic of the deleted `TopFieldLeafCollector` is added to `TopFieldLeafCollector`. Relates to apache#1351

mayya-sharipova changed the title ~~Collectors to skip noncompetitive documents~~ LUCENE-9280: Collectors to skip noncompetitive documents Mar 16, 2020

jimczi reviewed Mar 17, 2020

View reviewed changes

Address feedback1

6384b15

jimczi reviewed Mar 18, 2020

View reviewed changes

Address Feedback2

d732d7e

mayya-sharipova commented Mar 19, 2020

View reviewed changes

Adjust tests

209bc21

msokolov reviewed Mar 22, 2020

View reviewed changes

mayya-sharipova added 3 commits March 23, 2020 14:57

Address feedback

95e1bc1

Add docs and correct bugs

0e3c7da

Make constructor of LongDocValuesPointSortField public

d7e9507

Adjust docs and tests

1154d4a

mayya-sharipova marked this pull request as ready for review March 26, 2020 20:05

msokolov approved these changes Mar 29, 2020

View reviewed changes

msokolov requested changes Mar 29, 2020

View reviewed changes

Optimized sort of field without points

39379a7

Ensure optimized sort works as expected (as long sort) of a field that is not indexed with points.

jpountz reviewed Mar 31, 2020

View reviewed changes

Address Adrien's feedback

6c628f7

Add IterableFieldComparator

eeb23c1

Add a decorator for FieldComparatori to add a functionality to skip over non-competitive docs

Merge remote-tracking branch 'upstream/master' into comparator-set-mi…

8ebcff8

…n-competitive

mayya-sharipova mentioned this pull request Jun 23, 2020

Index points to benchmark sort optimization mikemccand/luceneutil#68

Merged

Add contributors' names

55f2940

mayya-sharipova merged commit b0333ab into apache:master Jun 23, 2020

mayya-sharipova mentioned this pull request Jun 24, 2020

LUCENE-9384: Backport for field sort optimization #1610

Merged

mayya-sharipova mentioned this pull request Oct 5, 2020

LUCENE-9555: Advance conjuction Iterator for two phase iteration #1943

Merged

mayya-sharipova mentioned this pull request Oct 6, 2020

LUCENE-9565 Fix competitive iteration #1952

Merged

mayya-sharipova mentioned this pull request Nov 4, 2020

LUCENE-9599 Make comparator aware of index sorting #2063

Closed

mayya-sharipova mentioned this pull request Nov 10, 2020

LUCENE-9599 Disable sort optim on index sort #2075

Merged

mayya-sharipova mentioned this pull request Dec 3, 2020

LUCENE-9599 Disable sort optim on index sort #2117

Merged

RS146BIJAY mentioned this pull request Apr 14, 2024

[RFC] Context Aware Segments opensearch-project/OpenSearch#13183

Open

RS146BIJAY mentioned this pull request May 20, 2024

Support for criteria based DWPT selection inside DocumentWriter apache/lucene#13387

Open

-        if (doc < min) scorerIterator.advance(min);
+        if (doc < min) {
+          doc = combinedIterator.advance(min);
+        }

LUCENE-9280: Collectors to skip noncompetitive documents #1351

LUCENE-9280: Collectors to skip noncompetitive documents #1351

Conversation

mayya-sharipova commented Mar 13, 2020

mayya-sharipova commented Mar 13, 2020

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Mar 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Mar 19, 2020 • edited Loading

Choose a reason for hiding this comment

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Mar 26, 2020 • edited Loading

msokolov commented Mar 26, 2020

mayya-sharipova commented Mar 27, 2020 • edited Loading

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov left a comment

Choose a reason for hiding this comment

mayya-sharipova commented Mar 30, 2020

msokolov commented Mar 30, 2020

mayya-sharipova commented Mar 30, 2020

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Mar 31, 2020

mayya-sharipova commented Apr 2, 2020

romseygeek commented Apr 3, 2020

mayya-sharipova Mar 18, 2020 •

edited

Loading

mayya-sharipova Mar 19, 2020 •

edited

Loading

mayya-sharipova commented Mar 26, 2020 •

edited

Loading

mayya-sharipova commented Mar 27, 2020 •

edited

Loading