Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require consistency between data-structures on a per-field basis [LUCENE-9334] #10374

Closed
asfimport opened this issue Apr 21, 2020 · 24 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Apr 21, 2020

Follow-up of https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E.

We would like to start requiring consitency across data-structures on a per-field basis in order to make it easier to do the right thing by default: range queries can run faster if doc values are enabled, sorted queries can run faster if points by indexed, etc.

This would be a big change, so it should be rolled out in a major.

Strict validation is tricky to implement, but we should still implement best-effort validation:

  • Documents all use the same data-structures, e.g. it is illegal for a document to only enable points and another document to only enable doc values,
  • When possible, check whether values are consistent too.

Migrated from LUCENE-9334 by Adrien Grand (@jpountz), 2 votes, resolved Jul 16 2021
Linked issues:

Pull requests: apache/lucene-solr#2186

@asfimport
Copy link
Author

asfimport commented Apr 22, 2020

David Smiley (@dsmiley) (migrated from JIRA)

The scope of this seems to subsume #10120, so I'm linking.  No work was done there AFAIK.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit d03662c in lucene's branch refs/heads/main from Mayya Sharipova
https://gitbox.apache.org/repos/asf?p=lucene.git;h=d03662c

LUCENE-9334 Consistency of field data structures

Require consistency between data-structures on a per-field basis

A field must be indexed with the same index options and data-structures across
all documents. Thus, for example, it is not allowed to have one document
where a certain field is indexed with doc values and points, and another document
where the same field is indexed only with points.
But it is allowed for a document not to have a certain field at all.

As a consequence of this, doc values updates are
only applicable for fields that are indexed with doc values only.

@asfimport
Copy link
Author

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I noticed a test failure from this commit (I haven't had the chance to dig in):

./gradlew test --tests TestLucene50TermVectorsFormat.testMerge -Dtests.seed=C091C70F50381021 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ml-IN -Dtests.timezone=America/Montevideo -Dtests.asserts=true -Dtests.file.encoding=UTF-8

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@jtibshiraniThanks for the report on this test failure. I will investigate it. I've temporarily muted the test in #86

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 49c7cc1 in lucene's branch refs/heads/main from Mayya Sharipova
https://gitbox.apache.org/repos/asf?p=lucene.git;h=49c7cc1

Fix test that modifies schema (#87)

LUCENE-9334 requires that docs have the same schema
across the whole schema.
This fixes the test that attempts to modify schema of "number" field
from DocValues and Points to just DocValues.

Relates to #11

@asfimport
Copy link
Author

asfimport commented Apr 20, 2021

Mike Drob (@madrob) (migrated from JIRA)

I think this is causing SOLR-15360, but I can't say for certain. If there's any chance that somebody can come over and help us understand a bit more, that would be much appreciated.

@asfimport
Copy link
Author

asfimport commented Apr 20, 2021

David Smiley (@dsmiley) (migrated from JIRA)

Mike, the issue you just filed is effectively a duplicate of SOLR-15356 which I spent time debugging for that one. Already solved :-). I sent a message to the dev list about this too the other day.

@asfimport
Copy link
Author

Mike Drob (@madrob) (migrated from JIRA)

Thanks David! I tried searching the dev list and for existing issues, but it looks like I started with the other end of the failing tests than you did. Thanks for being proactive!

@asfimport
Copy link
Author

Dawid Weiss (@dweiss) (migrated from JIRA)

TestPerFieldConsistency has been failing on and off recently and it seems relevant to this issue. Those failures are not really reproducible but they do look similar. Is this something known? Should it be annotated as awaiting a fix?

Build: https://jenkins.thetaphi.de/job/Lucene-main-Linux/30142/
Java: 64bit/jdk-15 -XX:-UseCompressedOops -XX:+UseSerialGC

1 tests failed.
FAILED:  org.apache.lucene.document.TestPerFieldConsistency.testDocWithMissingSchemaOptionsThrowsError

Error Message:
java.lang.AssertionError: expected:<4> but was:<0>

Stack Trace:
java.lang.AssertionError: expected:<4> but was:<0>
        at __randomizedtesting.SeedInfo.seed([589A8C303E4D61F4:20DDC33151E81C85]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.apache.lucene.document.TestPerFieldConsistency.testDocWithMissingSchemaOptionsThrowsError(TestPerFieldConsistency.java:172)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)

@asfimport
Copy link
Author

Ignacio Vera (@iverase) (migrated from JIRA)

I had a look and I see the problem with the test. We need to add an IndexWriterConfig with SerialMergeScheduler in order to reproduce the failures:

 IndexWriterConfig iwc = newIndexWriterConfig();
  // Else seeds may not reproduce:
  iwc.setMergeScheduler(new SerialMergeScheduler());

Adding that, the following seed reproduces the failure:

 ./gradlew cleanTest test --tests TestPerFieldConsistency -Dtests.seed=C40258ABF5E76DCB -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=te-IN -Dtests.timezone=SystemV/CST6CDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8

@asfimport
Copy link
Author

Ignacio Vera (@iverase) (migrated from JIRA)

The test assumes there will be no merges in the background which is not true. Maybe an easy fix is to disable merges:

iwc.setMergePolicy(NoMergePolicy.INSTANCE); 

@asfimport
Copy link
Author

Dawid Weiss (@dweiss) (migrated from JIRA)

I think that's a good first step - I don't know this patch. @mayya-sharipova may have a better insight.

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@dweiss Thanks for raising the failure, and thanks @iverase for investigation. @iverase Indeed, the test assumes no merging. I've created a fix in PR, and will merge it today.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit b5a77de in lucene's branch refs/heads/main from Mayya Sharipova
https://gitbox.apache.org/repos/asf?p=lucene.git;h=b5a77de

Fix failures in TestPerFieldConsistency (#125)

This test assumes that there is no merging,
and was failing when there were merges.
This fixes the test but setting NoMergePolicy for
IndexWriter.

Relates to LUCENE-9334
Relates to #11

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Is my understanding correct that we still need to look into relaxing these checks for indexes created with version 8.x or earlier before resolving this issue?

@asfimport
Copy link
Author

asfimport commented Jul 2, 2021

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@jpountz For old indices with inconsistent data structures, the current behaviour is following:

  1. We can read them
  2. We can't write new docs that introduce inconsistencies (e.g. a doc introduces a field indexed with points where previous this field was indexed only with docvalues).
  3. We can't do any writes, if segments of the index have inconsistent differences in schemas (e.g. in one segment a field is indexed with points, and in another segment with points and doc values).  This behaviour is similar to Disallow changing index options on the fly [LUCENE-8134] #9182 for old indices, where IndexWriter will not be opened if different segments have different index options for a field. 
  4. We can do writes (new docs and merges), if the segments have consistency, but individual docs are inconsistent (e.g. within a segment, one doc has a field indexed with points, and another doc has a field indexed with points and doc values).

I think #2 and #3 are desirable behaviours and we should keep them, as for new segments we would like to have guarantees of consistencies.

#4 is not ideal, we can do a check for consistency and refuse to do merges into a new segment if there are any inconsistencies between docs,  but this could be an expensive operation.  If we are ok with inconsistencies between docs carried over from old indices, then there is nothing left to be done for this issue.

What do you think?

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks @mayya-sharipova, this makes sense to me. I wasn't sure if we'd still allow reading old indices with inconsistencies, which felt important. #4 feels consistent with how we have dealt with e.g. broken offsets in the past, e.g. we started rejecting broken offsets in Lucene 7.0 but still allowed carrying them over through merges for indices created by 6.x. Then let's close this issue now?

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@jpountz Thank for your feedback and clarifications, it makes sense.

I just need to modify a test org.apache.lucene.facet.taxonomy.directory.TestBackwardsCompatibility::testCreateNewTaxonomy that I disabled, as this test introduces new data structures, and I will be closing this issue.

@asfimport
Copy link
Author

asfimport commented Jul 16, 2021

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Closing this issue, at the progress on taxonomy backwards compatibility test will be tracked in its own issue #10490

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 7cb6960 in lucene's branch refs/heads/main from Gautam Worah
https://gitbox.apache.org/repos/asf?p=lucene.git;h=7cb6960

Category documents added in the Lucene 9.0 taxonomy index use a
BDV field with a different name

Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).

This commit also enables the back-compat check that was disabled
earlier.

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

@asfimport
Copy link
Author

asfimport commented Feb 17, 2022

ASF subversion and git services (migrated from JIRA)

Commit c132bbf in lucene's branch refs/heads/main from Vigya Sharma
https://gitbox.apache.org/repos/asf?p=lucene.git;h=c132bbf

#11122: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (#677)

Since all documents are required to use the same features (LUCENE-9334) we can
rewrite DocValuesFieldExistsQuery to a MatchAllDocsQuery whenever terms or
points have a docCount that is equal to maxDoc.

@asfimport
Copy link
Author

asfimport commented Feb 17, 2022

ASF subversion and git services (migrated from JIRA)

Commit a9532f3 in lucene's branch refs/heads/branch_9x from Vigya Sharma
https://gitbox.apache.org/repos/asf?p=lucene.git;h=a9532f3

#11122: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (#677)

Since all documents are required to use the same features (LUCENE-9334) we can
rewrite DocValuesFieldExistsQuery to a MatchAllDocsQuery whenever terms or
points have a docCount that is equal to maxDoc.

@ArtemLukanin-TomTom
Copy link

ArtemLukanin-TomTom commented Aug 5, 2023

In Lucene 8 it was possible to add stored and indexed fields separately to the document (because they are effectively written to different files). In Lucene 9 after this change it does not work any more. E.g.

doc.add(new Field("title", "Book", StoredField.TYPE);
doc.add(new Field("title", "book", StringField.TYPE_NOT_STORED);
indexWriter.addDocument(doc);
doc2.add(new Field("title", "Another Book", StoredField.TYPE);
doc2.add(new Field("title", "another", StringField.TYPE_NOT_STORED);
doc2.add(new Field("title", "book", StringField.TYPE_NOT_STORED);
indexWriter.addDocument(doc2);

Does it mean, that I can use only this code?:

doc.add(new Field("title", "Book", StringField.TYPE_STORED);
indexWriter.addDocument(doc);
doc2.add(new Field("title", "Another Book", StringField.TYPE_STORED);
indexWriter.addDocument(doc2);

In the last case I cannot control stored and index values separately, but have to use a common analyzer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants