Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9384: Backport for field sort optimization #1610

Merged

Conversation

mayya-sharipova
Copy link
Contributor

Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:

  1. the field should be indexed with both doc_values and points, that
    must have the same field name and same data
  2. SortField#setCanSkipNonCompetitiveDocs must be set
  3. totalHitsThreshold should not be set to max value.

Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:
1) the field should be indexed with both doc_values and points, that
must have the same field name and same data
2) SortField#setCanSkipNonCompetitiveDocs must be set
3) totalHitsThreshold should not be set to max value.
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, I left a comment regarding the name of the option. @romseygeek any idea on how we should expose the option in 8x ?

* that these fields have the same name.
* This allows to use sort optimization and skip non-competitive documents.
*/
public void setSkipNonCompetitiveDocs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make the option less ambiguous by naming it something like setCanUsePoints or something along those lines ? I am afraid that this could be misleading if we set this option on a sort field that use the _score or a keyword field for instance. We should maybe at least validate that the field is not null ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about setCanUsePointsForSort? I am also fine with setCanUsePoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Do you have suggestions how we should expose sort optimization in 8.x? Are you ok with a proposal to have a method for SortField::setCanUsePointsForSort that when set true indicates that the same data is used for indexing points and docValues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO setCanUsePoints is good enough since this is a "SortField" so it should be obvious that this is for sorting?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me. I wonder if updating equals/hashcode is right.

* that these fields have the same name.
* This allows to use sort optimization and skip non-competitive documents.
*/
public void setSkipNonCompetitiveDocs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO setCanUsePoints is good enough since this is a "SortField" so it should be obvious that this is for sorting?

@@ -408,6 +428,7 @@ public boolean equals(Object o) {
&& other.reverse == this.reverse
&& Objects.equals(this.comparatorSource, other.comparatorSource)
&& Objects.equals(this.missingValue, other.missingValue)
&& other.skipNonCompetitiveDocs == this.skipNonCompetitiveDocs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should actually take it into account for equals/hashcode as this could disable index sorting optimizations, which check for equality of the index-time and search-time SortField instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thank you for your feedback. Alternatively we can add setCanUsePoints parameter to Sort object instead of SortField? What do you think?

@mayya-sharipova
Copy link
Contributor Author

@jimczi @jpountz Thanks for your feedback, I have addressed it:

  • renamed SortField#setSkipNonCompetitiveDocs to SortField#setCanUsePoints
  • remove canUsePoints field from SortField#equals and hashCode.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -428,7 +428,6 @@ public boolean equals(Object o) {
&& other.reverse == this.reverse
&& Objects.equals(this.comparatorSource, other.comparatorSource)
&& Objects.equals(this.missingValue, other.missingValue)
&& other.skipNonCompetitiveDocs == this.skipNonCompetitiveDocs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to say that we're not taking it into account on purpose? I foresee that someone will file a PR to add it after running a code analysis tool that would complain that not all fields are taken into account for equals otherwise.

@mayya-sharipova mayya-sharipova merged commit 7b12849 into apache:branch_8x Jul 31, 2020
@mayya-sharipova mayya-sharipova deleted the sort-optimization-8.x branch July 31, 2020 18:13
gus-asf pushed a commit to gus-asf/lucene-solr that referenced this pull request Sep 4, 2020
Backport for: LUCENE-9280: Collectors to skip noncompetitive documents (apache#1351)

Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.

To enable sort optimization for numeric sort fields,
the following needs to be done:
1) the field should be indexed with both doc_values and points, that
must have the same field name and same data
2) SortField#setCanUsePoints must be set
3) totalHitsThreshold should not be set to max value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants