Merge branch 'main' into feature/binary_search_in_bkd

* main: migrate to temurin (apache#697) LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691) LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori Remove deprecated constructors in Nori (apache#695) LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693) LUCENE-10408: Fix vector values iteration bug (apache#690) Temporarily mute TestKnnVectorQuery#testRandomWithFilter LUCENE-10382: Support filtering in KnnVectorQuery (apache#656) LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677) Add CHANGES entry for LUCENE-10398 LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678) LUCENE-10408 Better encoding of doc Ids in vectors (apache#649) LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
wjp719 · Feb 22, 2022 · 53b96bf · 53b96bf
2 parents 31548f7 + c7602a4
commit 53b96bf
Show file tree

Hide file tree

Showing 82 changed files with 1,539 additions and 467 deletions.
diff --git a/.github/workflows/gradle-precommit.yml b/.github/workflows/gradle-precommit.yml
@@ -73,7 +73,7 @@ jobs:
     - name: Set up JDK
       uses: actions/setup-java@v2
       with:
-        distribution: 'adopt-hotspot'
+        distribution: 'temurin'
         java-version: ${{ matrix.java }}
         java-package: jdk
 

diff --git a/NOTICE.txt b/NOTICE.txt
@@ -202,11 +202,11 @@ Nori Korean Morphological Analyzer - Apache Lucene Integration
 
 This software includes a binary and/or source version of data from
 
-  mecab-ko-dic-2.0.3-20170922
+  mecab-ko-dic-2.1.1-20180720
 
 which can be obtained from
 
-  https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz
+  https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
 
 The floating point precision conversion in NumericUtils.Float16Converter is derived from work by
 Jeroen van der Zijp, granted for use under the Apache license.
diff --git a/gradle/generation/nori.gradle b/gradle/generation/nori.gradle
@@ -54,7 +54,7 @@ configure(project(":lucene:analysis:nori")) {
       dependsOn deleteDictionaryData
       dependsOn sourceSets.main.runtimeClasspath
 
-      def dictionaryName = "mecab-ko-dic-2.0.3-20170922"
+      def dictionaryName = "mecab-ko-dic-2.1.1-20180720"
       def dictionarySource = "https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/${dictionaryName}.tar.gz"
       def dictionaryFile = file("${buildDir}/generate/${dictionaryName}.tar.gz")
       def unpackedDir = file("${buildDir}/generate/${dictionaryName}")

diff --git a/lucene/CHANGES.txt b/lucene/CHANGES.txt
@@ -15,7 +15,7 @@ API Changes
 * LUCENE-10368: IntTaxonomyFacets has been make pkg-private and serves only as an internal
   implementation detail of taxonomy-faceting. (Greg Miller)
 
-* LUCENE-10400: Remove deprecated dictionary constructors in Kuromoji (Tomoko Uchida)
+* LUCENE-10400: Remove deprecated dictionary constructors in Kuromoji and Nori (Tomoko Uchida)
 
 New Features
 ---------------------
@@ -77,7 +77,7 @@ API Changes
 * LUCENE-10368: IntTaxonomyFacets has been deprecated and is no longer a supported extension point
   for user-created faceting implementations. (Greg Miller)
 
-* LUCENE-10400: Add constructors that take external resource Paths to dictionary classes in Kuromoji:
+* LUCENE-10400: Add constructors that take external resource Paths to dictionary classes in Kuromoji and Nori:
   ConnectionCosts, TokenInfoDictionary, and UnknownDictionary. Old constructors that take resource scheme and
   resource path in those classes are deprecated; These are replaced with the new constructors and planned to be
   removed in a future release. (Tomoko Uchida, Uwe Schindler, Mike Sokolov)
@@ -89,6 +89,8 @@ API Changes
 * LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces.
   (David Smiley, Uwe Schindler, Dawid Weiss, Tomoko Uchida)
 
+* LUCENE-10398: Add static method for getting Terms from LeafReader. (Spike Liu)
+
 New Features
 ---------------------
 
@@ -154,6 +156,11 @@ New Features
   the number of matching range docs when each doc has at-most one point and the points are 1-dimensional.
   (Gautam Worah, Ignacio Vera, Adrien Grand)
 
+* LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (Ignacio Vera)     
+
+* LUCENE-10382: Add support for filtering in KnnVectorQuery. This allows for finding the
+  nearest k documents that also match a query. (Julie Tibshirani, Joel Bernstein)
+
 Improvements
 ---------------------
 
@@ -182,6 +189,9 @@ Improvements
 * LUCENE-10371: Make IndexRearranger able to arrange segment in a determined order.
   (Patrick Zhai)
 
+* LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori.
+  (Uihyun Kim)
+
 Optimizations
 ---------------------
 
@@ -220,6 +230,13 @@ Optimizations
 * LUCENE-10412: More `Query#rewrite` optimizations for MatchNoDocsQuery.
   (Adrien Grand)
 
+* LUCENE-10408 Better encoding of doc Ids in vectors. (Mayya Sharipova, Julie Tibshirani, Adrien Grand)
+
+* LUCENE-10424 Optimize the "everything matches" case for count query in PointRangeQuery. (Ignacio Vera, Lu Xugang)
+
+* LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery whenever terms
+  or points have a docCount that is equal to maxDoc. (Vigya Sharma)
+
 Changes in runtime behavior
 ---------------------
 
@@ -614,7 +631,7 @@ Improvements
   (David Smiley)
 
 * LUCENE-10062: Switch taxonomy faceting to use numeric doc values for storing ordinals instead of binary doc values
-  with its own custom encoding. (Greg Miller)
+  with its own custom encoding. (Greg Miller) 
 
 Bug fixes
 ---------------------

diff --git a/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/BinaryDictionary.java b/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/BinaryDictionary.java
@@ -18,30 +18,21 @@
 
 import java.io.BufferedInputStream;
 import java.io.EOFException;
-import java.io.FileNotFoundException;
 import java.io.IOException;
 import java.io.InputStream;
 import java.nio.ByteBuffer;
 import java.nio.channels.Channels;
 import java.nio.channels.ReadableByteChannel;
-import java.nio.file.Files;
-import java.nio.file.Paths;
 import org.apache.lucene.analysis.ko.POS;
 import org.apache.lucene.codecs.CodecUtil;
 import org.apache.lucene.store.DataInput;
 import org.apache.lucene.store.InputStreamDataInput;
-import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.IOSupplier;
 import org.apache.lucene.util.IntsRef;
 
 /** Base class for a binary-encoded in-memory dictionary. */
 public abstract class BinaryDictionary implements Dictionary {
 
-  /** Used to specify where (dictionary) resources get loaded from. */
-  public enum ResourceScheme {
-    CLASSPATH,
-    FILE
-  }
-
   public static final String TARGETMAP_FILENAME_SUFFIX = "$targetMap.dat";
   public static final String DICT_FILENAME_SUFFIX = "$buffer.dat";
   public static final String POSDICT_FILENAME_SUFFIX = "$posDict.dat";
@@ -51,75 +42,36 @@ public enum ResourceScheme {
   public static final String POSDICT_HEADER = "ko_dict_pos";
   public static final int VERSION = 1;
 
-  private final ResourceScheme resourceScheme;
-  private final String resourcePath;
   private final ByteBuffer buffer;
   private final int[] targetMapOffsets, targetMap;
   private final POS.Tag[] posDict;
 
-  protected BinaryDictionary() throws IOException {
-    this(ResourceScheme.CLASSPATH, null);
-  }
-
-  /**
-   * @param resourceScheme - scheme for loading resources (FILE or CLASSPATH).
-   * @param resourcePath - where to load resources (dictionaries) from. If null, with CLASSPATH
-   *     scheme only, use this class's name as the path.
-   */
-  protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath)
+  protected BinaryDictionary(
+      IOSupplier<InputStream> targetMapResource,
+      IOSupplier<InputStream> posResource,
+      IOSupplier<InputStream> dictResource)
       throws IOException {
-    this.resourceScheme = resourceScheme;
-    if (resourcePath == null) {
-      if (resourceScheme != ResourceScheme.CLASSPATH) {
-        throw new IllegalArgumentException(
-            "resourcePath must be supplied with FILE resource scheme");
-      }
-      this.resourcePath = getClass().getSimpleName();
-    } else {
-      if (resourceScheme == ResourceScheme.CLASSPATH && !resourcePath.startsWith("/")) {
-        resourcePath = "/".concat(resourcePath);
-      }
-      this.resourcePath = resourcePath;
-    }
-    int[] targetMapOffsets, targetMap;
-    ByteBuffer buffer;
-    try (InputStream mapIS = new BufferedInputStream(getResource(TARGETMAP_FILENAME_SUFFIX));
-        InputStream posIS = new BufferedInputStream(getResource(POSDICT_FILENAME_SUFFIX));
-        // no buffering here, as we load in one large buffer
-        InputStream dictIS = getResource(DICT_FILENAME_SUFFIX)) {
+    try (InputStream mapIS = new BufferedInputStream(targetMapResource.get())) {
       DataInput in = new InputStreamDataInput(mapIS);
       CodecUtil.checkHeader(in, TARGETMAP_HEADER, VERSION, VERSION);
-      targetMap = new int[in.readVInt()];
-      targetMapOffsets = new int[in.readVInt()];
-      int accum = 0, sourceId = 0;
-      for (int ofs = 0; ofs < targetMap.length; ofs++) {
-        final int val = in.readVInt();
-        if ((val & 0x01) != 0) {
-          targetMapOffsets[sourceId] = ofs;
-          sourceId++;
-        }
-        accum += val >>> 1;
-        targetMap[ofs] = accum;
-      }
-      if (sourceId + 1 != targetMapOffsets.length)
-        throw new IOException(
-            "targetMap file format broken; targetMap.length="
-                + targetMap.length
-                + ", targetMapOffsets.length="
-                + targetMapOffsets.length
-                + ", sourceId="
-                + sourceId);
-      targetMapOffsets[sourceId] = targetMap.length;
+      this.targetMap = new int[in.readVInt()];
+      this.targetMapOffsets = new int[in.readVInt()];
+      populateTargetMap(in, this.targetMap, this.targetMapOffsets);
+    }
 
-      in = new InputStreamDataInput(posIS);
+    try (InputStream posIS = new BufferedInputStream(posResource.get())) {
+      DataInput in = new InputStreamDataInput(posIS);
       CodecUtil.checkHeader(in, POSDICT_HEADER, VERSION, VERSION);
       int posSize = in.readVInt();
-      posDict = new POS.Tag[posSize];
+      this.posDict = new POS.Tag[posSize];
       for (int j = 0; j < posSize; j++) {
         posDict[j] = POS.resolveTag(in.readByte());
       }
+    }
 
-      in = new InputStreamDataInput(dictIS);
+    // no buffering here, as we load in one large buffer
+    try (InputStream dictIS = dictResource.get()) {
+      DataInput in = new InputStreamDataInput(dictIS);
       CodecUtil.checkHeader(in, DICT_HEADER, VERSION, VERSION);
       final int size = in.readVInt();
       final ByteBuffer tmpBuffer = ByteBuffer.allocateDirect(size);
@@ -128,48 +80,31 @@ protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath)
       if (read != size) {
         throw new EOFException("Cannot read whole dictionary");
       }
-      buffer = tmpBuffer.asReadOnlyBuffer();
-    }
-
-    this.targetMap = targetMap;
-    this.targetMapOffsets = targetMapOffsets;
-    this.buffer = buffer;
-  }
-
-  protected final InputStream getResource(String suffix) throws IOException {
-    switch (resourceScheme) {
-      case CLASSPATH:
-        return getClassResource(resourcePath + suffix);
-      case FILE:
-        return Files.newInputStream(Paths.get(resourcePath + suffix));
-      default:
-        throw new IllegalStateException("unknown resource scheme " + resourceScheme);
+      this.buffer = tmpBuffer.asReadOnlyBuffer();
     }
   }
 
-  public static InputStream getResource(ResourceScheme scheme, String path) throws IOException {
-    switch (scheme) {
-      case CLASSPATH:
-        return getClassResource(path);
-      case FILE:
-        return Files.newInputStream(Paths.get(path));
-      default:
-        throw new IllegalStateException("unknown resource scheme " + scheme);
-    }
-  }
-
-  // util, reused by ConnectionCosts and CharacterDefinition
-  public static InputStream getClassResource(Class<?> clazz, String suffix) throws IOException {
-    final InputStream is = clazz.getResourceAsStream(clazz.getSimpleName() + suffix);
-    if (is == null) {
-      throw new FileNotFoundException(
-          "Not in classpath: " + clazz.getName().replace('.', '/') + suffix);
+  private static void populateTargetMap(DataInput in, int[] targetMap, int[] targetMapOffsets)
+      throws IOException {
+    int accum = 0, sourceId = 0;
+    for (int ofs = 0; ofs < targetMap.length; ofs++) {
+      final int val = in.readVInt();
+      if ((val & 0x01) != 0) {
+        targetMapOffsets[sourceId] = ofs;
+        sourceId++;
+      }
+      accum += val >>> 1;
+      targetMap[ofs] = accum;
     }
-    return is;
-  }
-
-  private static InputStream getClassResource(String path) throws IOException {
-    return IOUtils.requireResourceNonNull(BinaryDictionary.class.getResourceAsStream(path), path);
+    if (sourceId + 1 != targetMapOffsets.length)
+      throw new IOException(
+          "targetMap file format broken; targetMap.length="
+              + targetMap.length
+              + ", targetMapOffsets.length="
+              + targetMapOffsets.length
+              + ", sourceId="
+              + sourceId);
+    targetMapOffsets[sourceId] = targetMap.length;
   }
 
   public void lookupWordIds(int sourceId, IntsRef ref) {

diff --git a/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/CharacterDefinition.java b/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/CharacterDefinition.java
@@ -73,11 +73,7 @@ enum CharacterClass {
   public static final byte HANJANUMERIC = (byte) CharacterClass.HANJANUMERIC.ordinal();
 
   private CharacterDefinition() throws IOException {
-    InputStream is = null;
-    boolean success = false;
-    try {
-      is = BinaryDictionary.getClassResource(getClass(), FILENAME_SUFFIX);
-      is = new BufferedInputStream(is);
+    try (InputStream is = new BufferedInputStream(getClassResource())) {
       final DataInput in = new InputStreamDataInput(is);
       CodecUtil.checkHeader(in, HEADER, VERSION, VERSION);
       in.readBytes(characterCategoryMap, 0, characterCategoryMap.length);
@@ -86,16 +82,15 @@ private CharacterDefinition() throws IOException {
         invokeMap[i] = (b & 0x01) != 0;
         groupMap[i] = (b & 0x02) != 0;
       }
-      success = true;
-    } finally {
-      if (success) {
-        IOUtils.close(is);
-      } else {
-        IOUtils.closeWhileHandlingException(is);
-      }
     }
   }
 
+  private static InputStream getClassResource() throws IOException {
+    final String resourcePath = CharacterDefinition.class.getSimpleName() + FILENAME_SUFFIX;
+    return IOUtils.requireResourceNonNull(
+        CharacterDefinition.class.getResourceAsStream(resourcePath), resourcePath);
+  }
+
   public byte getCharacterClass(char c) {
     return characterCategoryMap[c];
   }

diff --git a/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/ConnectionCosts.java b/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/ConnectionCosts.java
@@ -20,9 +20,13 @@
 import java.io.IOException;
 import java.io.InputStream;
 import java.nio.ByteBuffer;
+import java.nio.file.Files;
+import java.nio.file.Path;
 import org.apache.lucene.codecs.CodecUtil;
 import org.apache.lucene.store.DataInput;
 import org.apache.lucene.store.InputStreamDataInput;
+import org.apache.lucene.util.IOSupplier;
+import org.apache.lucene.util.IOUtils;
 
 /** n-gram connection cost data */
 public final class ConnectionCosts {
@@ -35,15 +39,21 @@ public final class ConnectionCosts {
   private final int forwardSize;
 
   /**
-   * @param scheme - scheme for loading resources (FILE or CLASSPATH).
-   * @param resourcePath - where to load resources from, without the ".dat" suffix
+   * Create a {@link ConnectionCosts} from an external resource path.
+   *
+   * @param connectionCostsFile where to load connection costs resource
+   * @throws IOException if resource was not found or broken
    */
-  public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String resourcePath)
-      throws IOException {
-    try (InputStream is =
-        new BufferedInputStream(
-            BinaryDictionary.getResource(
-                scheme, "/" + resourcePath.replace('.', '/') + FILENAME_SUFFIX))) {
+  public ConnectionCosts(Path connectionCostsFile) throws IOException {
+    this(() -> Files.newInputStream(connectionCostsFile));
+  }
+
+  private ConnectionCosts() throws IOException {
+    this(ConnectionCosts::getClassResource);
+  }
+
+  private ConnectionCosts(IOSupplier<InputStream> connectionCostResource) throws IOException {
+    try (InputStream is = new BufferedInputStream(connectionCostResource.get())) {
       final DataInput in = new InputStreamDataInput(is);
       CodecUtil.checkHeader(in, HEADER, VERSION, VERSION);
       this.forwardSize = in.readVInt();
@@ -63,8 +73,10 @@ public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String resourcePa
     }
   }
 
-  private ConnectionCosts() throws IOException {
-    this(BinaryDictionary.ResourceScheme.CLASSPATH, ConnectionCosts.class.getName());
+  private static InputStream getClassResource() throws IOException {
+    final String resourcePath = ConnectionCosts.class.getSimpleName() + FILENAME_SUFFIX;
+    return IOUtils.requireResourceNonNull(
+        ConnectionCosts.class.getResourceAsStream(resourcePath), resourcePath);
   }
 
   public int get(int forwardId, int backwardId) {