Skip to content

Commit

Permalink
Merge branch 'main' into feature/binary_search_in_bkd
Browse files Browse the repository at this point in the history
* main:
  migrate to temurin (apache#697)
  LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (apache#691)
  LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori
  Remove deprecated constructors in Nori (apache#695)
  LUCENE-10400: revise binary dictionaries' constructor in nori (apache#693)
  LUCENE-10408: Fix vector values iteration bug (apache#690)
  Temporarily mute TestKnnVectorQuery#testRandomWithFilter
  LUCENE-10382: Support filtering in KnnVectorQuery (apache#656)
  LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (apache#677)
  Add CHANGES entry for LUCENE-10398
  LUCENE-10398: Add static method for getting Terms from LeafReader (apache#678)
  LUCENE-10408 Better encoding of doc Ids in vectors (apache#649)
  LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (apache#685)
  • Loading branch information
wjp719 committed Feb 22, 2022
2 parents 31548f7 + c7602a4 commit 53b96bf
Show file tree
Hide file tree
Showing 82 changed files with 1,539 additions and 467 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/gradle-precommit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ jobs:
- name: Set up JDK
uses: actions/setup-java@v2
with:
distribution: 'adopt-hotspot'
distribution: 'temurin'
java-version: ${{ matrix.java }}
java-package: jdk

Expand Down
4 changes: 2 additions & 2 deletions NOTICE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -202,11 +202,11 @@ Nori Korean Morphological Analyzer - Apache Lucene Integration

This software includes a binary and/or source version of data from

mecab-ko-dic-2.0.3-20170922
mecab-ko-dic-2.1.1-20180720

which can be obtained from

https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz
https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz

The floating point precision conversion in NumericUtils.Float16Converter is derived from work by
Jeroen van der Zijp, granted for use under the Apache license.
2 changes: 1 addition & 1 deletion gradle/generation/nori.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ configure(project(":lucene:analysis:nori")) {
dependsOn deleteDictionaryData
dependsOn sourceSets.main.runtimeClasspath

def dictionaryName = "mecab-ko-dic-2.0.3-20170922"
def dictionaryName = "mecab-ko-dic-2.1.1-20180720"
def dictionarySource = "https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/${dictionaryName}.tar.gz"
def dictionaryFile = file("${buildDir}/generate/${dictionaryName}.tar.gz")
def unpackedDir = file("${buildDir}/generate/${dictionaryName}")
Expand Down
23 changes: 20 additions & 3 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ API Changes
* LUCENE-10368: IntTaxonomyFacets has been make pkg-private and serves only as an internal
implementation detail of taxonomy-faceting. (Greg Miller)

* LUCENE-10400: Remove deprecated dictionary constructors in Kuromoji (Tomoko Uchida)
* LUCENE-10400: Remove deprecated dictionary constructors in Kuromoji and Nori (Tomoko Uchida)

New Features
---------------------
Expand Down Expand Up @@ -77,7 +77,7 @@ API Changes
* LUCENE-10368: IntTaxonomyFacets has been deprecated and is no longer a supported extension point
for user-created faceting implementations. (Greg Miller)

* LUCENE-10400: Add constructors that take external resource Paths to dictionary classes in Kuromoji:
* LUCENE-10400: Add constructors that take external resource Paths to dictionary classes in Kuromoji and Nori:
ConnectionCosts, TokenInfoDictionary, and UnknownDictionary. Old constructors that take resource scheme and
resource path in those classes are deprecated; These are replaced with the new constructors and planned to be
removed in a future release. (Tomoko Uchida, Uwe Schindler, Mike Sokolov)
Expand All @@ -89,6 +89,8 @@ API Changes
* LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces.
(David Smiley, Uwe Schindler, Dawid Weiss, Tomoko Uchida)

* LUCENE-10398: Add static method for getting Terms from LeafReader. (Spike Liu)

New Features
---------------------

Expand Down Expand Up @@ -154,6 +156,11 @@ New Features
the number of matching range docs when each doc has at-most one point and the points are 1-dimensional.
(Gautam Worah, Ignacio Vera, Adrien Grand)

* LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (Ignacio Vera)

* LUCENE-10382: Add support for filtering in KnnVectorQuery. This allows for finding the
nearest k documents that also match a query. (Julie Tibshirani, Joel Bernstein)

Improvements
---------------------

Expand Down Expand Up @@ -182,6 +189,9 @@ Improvements
* LUCENE-10371: Make IndexRearranger able to arrange segment in a determined order.
(Patrick Zhai)

* LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori.
(Uihyun Kim)

Optimizations
---------------------

Expand Down Expand Up @@ -220,6 +230,13 @@ Optimizations
* LUCENE-10412: More `Query#rewrite` optimizations for MatchNoDocsQuery.
(Adrien Grand)

* LUCENE-10408 Better encoding of doc Ids in vectors. (Mayya Sharipova, Julie Tibshirani, Adrien Grand)

* LUCENE-10424 Optimize the "everything matches" case for count query in PointRangeQuery. (Ignacio Vera, Lu Xugang)

* LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery whenever terms
or points have a docCount that is equal to maxDoc. (Vigya Sharma)

Changes in runtime behavior
---------------------

Expand Down Expand Up @@ -614,7 +631,7 @@ Improvements
(David Smiley)

* LUCENE-10062: Switch taxonomy faceting to use numeric doc values for storing ordinals instead of binary doc values
with its own custom encoding. (Greg Miller)
with its own custom encoding. (Greg Miller)

Bug fixes
---------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,30 +18,21 @@

import java.io.BufferedInputStream;
import java.io.EOFException;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.lucene.analysis.ko.POS;
import org.apache.lucene.codecs.CodecUtil;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.InputStreamDataInput;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.IOSupplier;
import org.apache.lucene.util.IntsRef;

/** Base class for a binary-encoded in-memory dictionary. */
public abstract class BinaryDictionary implements Dictionary {

/** Used to specify where (dictionary) resources get loaded from. */
public enum ResourceScheme {
CLASSPATH,
FILE
}

public static final String TARGETMAP_FILENAME_SUFFIX = "$targetMap.dat";
public static final String DICT_FILENAME_SUFFIX = "$buffer.dat";
public static final String POSDICT_FILENAME_SUFFIX = "$posDict.dat";
Expand All @@ -51,75 +42,36 @@ public enum ResourceScheme {
public static final String POSDICT_HEADER = "ko_dict_pos";
public static final int VERSION = 1;

private final ResourceScheme resourceScheme;
private final String resourcePath;
private final ByteBuffer buffer;
private final int[] targetMapOffsets, targetMap;
private final POS.Tag[] posDict;

protected BinaryDictionary() throws IOException {
this(ResourceScheme.CLASSPATH, null);
}

/**
* @param resourceScheme - scheme for loading resources (FILE or CLASSPATH).
* @param resourcePath - where to load resources (dictionaries) from. If null, with CLASSPATH
* scheme only, use this class's name as the path.
*/
protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath)
protected BinaryDictionary(
IOSupplier<InputStream> targetMapResource,
IOSupplier<InputStream> posResource,
IOSupplier<InputStream> dictResource)
throws IOException {
this.resourceScheme = resourceScheme;
if (resourcePath == null) {
if (resourceScheme != ResourceScheme.CLASSPATH) {
throw new IllegalArgumentException(
"resourcePath must be supplied with FILE resource scheme");
}
this.resourcePath = getClass().getSimpleName();
} else {
if (resourceScheme == ResourceScheme.CLASSPATH && !resourcePath.startsWith("/")) {
resourcePath = "/".concat(resourcePath);
}
this.resourcePath = resourcePath;
}
int[] targetMapOffsets, targetMap;
ByteBuffer buffer;
try (InputStream mapIS = new BufferedInputStream(getResource(TARGETMAP_FILENAME_SUFFIX));
InputStream posIS = new BufferedInputStream(getResource(POSDICT_FILENAME_SUFFIX));
// no buffering here, as we load in one large buffer
InputStream dictIS = getResource(DICT_FILENAME_SUFFIX)) {
try (InputStream mapIS = new BufferedInputStream(targetMapResource.get())) {
DataInput in = new InputStreamDataInput(mapIS);
CodecUtil.checkHeader(in, TARGETMAP_HEADER, VERSION, VERSION);
targetMap = new int[in.readVInt()];
targetMapOffsets = new int[in.readVInt()];
int accum = 0, sourceId = 0;
for (int ofs = 0; ofs < targetMap.length; ofs++) {
final int val = in.readVInt();
if ((val & 0x01) != 0) {
targetMapOffsets[sourceId] = ofs;
sourceId++;
}
accum += val >>> 1;
targetMap[ofs] = accum;
}
if (sourceId + 1 != targetMapOffsets.length)
throw new IOException(
"targetMap file format broken; targetMap.length="
+ targetMap.length
+ ", targetMapOffsets.length="
+ targetMapOffsets.length
+ ", sourceId="
+ sourceId);
targetMapOffsets[sourceId] = targetMap.length;
this.targetMap = new int[in.readVInt()];
this.targetMapOffsets = new int[in.readVInt()];
populateTargetMap(in, this.targetMap, this.targetMapOffsets);
}

in = new InputStreamDataInput(posIS);
try (InputStream posIS = new BufferedInputStream(posResource.get())) {
DataInput in = new InputStreamDataInput(posIS);
CodecUtil.checkHeader(in, POSDICT_HEADER, VERSION, VERSION);
int posSize = in.readVInt();
posDict = new POS.Tag[posSize];
this.posDict = new POS.Tag[posSize];
for (int j = 0; j < posSize; j++) {
posDict[j] = POS.resolveTag(in.readByte());
}
}

in = new InputStreamDataInput(dictIS);
// no buffering here, as we load in one large buffer
try (InputStream dictIS = dictResource.get()) {
DataInput in = new InputStreamDataInput(dictIS);
CodecUtil.checkHeader(in, DICT_HEADER, VERSION, VERSION);
final int size = in.readVInt();
final ByteBuffer tmpBuffer = ByteBuffer.allocateDirect(size);
Expand All @@ -128,48 +80,31 @@ protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath)
if (read != size) {
throw new EOFException("Cannot read whole dictionary");
}
buffer = tmpBuffer.asReadOnlyBuffer();
}

this.targetMap = targetMap;
this.targetMapOffsets = targetMapOffsets;
this.buffer = buffer;
}

protected final InputStream getResource(String suffix) throws IOException {
switch (resourceScheme) {
case CLASSPATH:
return getClassResource(resourcePath + suffix);
case FILE:
return Files.newInputStream(Paths.get(resourcePath + suffix));
default:
throw new IllegalStateException("unknown resource scheme " + resourceScheme);
this.buffer = tmpBuffer.asReadOnlyBuffer();
}
}

public static InputStream getResource(ResourceScheme scheme, String path) throws IOException {
switch (scheme) {
case CLASSPATH:
return getClassResource(path);
case FILE:
return Files.newInputStream(Paths.get(path));
default:
throw new IllegalStateException("unknown resource scheme " + scheme);
}
}

// util, reused by ConnectionCosts and CharacterDefinition
public static InputStream getClassResource(Class<?> clazz, String suffix) throws IOException {
final InputStream is = clazz.getResourceAsStream(clazz.getSimpleName() + suffix);
if (is == null) {
throw new FileNotFoundException(
"Not in classpath: " + clazz.getName().replace('.', '/') + suffix);
private static void populateTargetMap(DataInput in, int[] targetMap, int[] targetMapOffsets)
throws IOException {
int accum = 0, sourceId = 0;
for (int ofs = 0; ofs < targetMap.length; ofs++) {
final int val = in.readVInt();
if ((val & 0x01) != 0) {
targetMapOffsets[sourceId] = ofs;
sourceId++;
}
accum += val >>> 1;
targetMap[ofs] = accum;
}
return is;
}

private static InputStream getClassResource(String path) throws IOException {
return IOUtils.requireResourceNonNull(BinaryDictionary.class.getResourceAsStream(path), path);
if (sourceId + 1 != targetMapOffsets.length)
throw new IOException(
"targetMap file format broken; targetMap.length="
+ targetMap.length
+ ", targetMapOffsets.length="
+ targetMapOffsets.length
+ ", sourceId="
+ sourceId);
targetMapOffsets[sourceId] = targetMap.length;
}

public void lookupWordIds(int sourceId, IntsRef ref) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,7 @@ enum CharacterClass {
public static final byte HANJANUMERIC = (byte) CharacterClass.HANJANUMERIC.ordinal();

private CharacterDefinition() throws IOException {
InputStream is = null;
boolean success = false;
try {
is = BinaryDictionary.getClassResource(getClass(), FILENAME_SUFFIX);
is = new BufferedInputStream(is);
try (InputStream is = new BufferedInputStream(getClassResource())) {
final DataInput in = new InputStreamDataInput(is);
CodecUtil.checkHeader(in, HEADER, VERSION, VERSION);
in.readBytes(characterCategoryMap, 0, characterCategoryMap.length);
Expand All @@ -86,16 +82,15 @@ private CharacterDefinition() throws IOException {
invokeMap[i] = (b & 0x01) != 0;
groupMap[i] = (b & 0x02) != 0;
}
success = true;
} finally {
if (success) {
IOUtils.close(is);
} else {
IOUtils.closeWhileHandlingException(is);
}
}
}

private static InputStream getClassResource() throws IOException {
final String resourcePath = CharacterDefinition.class.getSimpleName() + FILENAME_SUFFIX;
return IOUtils.requireResourceNonNull(
CharacterDefinition.class.getResourceAsStream(resourcePath), resourcePath);
}

public byte getCharacterClass(char c) {
return characterCategoryMap[c];
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,13 @@
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.file.Files;
import java.nio.file.Path;
import org.apache.lucene.codecs.CodecUtil;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.InputStreamDataInput;
import org.apache.lucene.util.IOSupplier;
import org.apache.lucene.util.IOUtils;

/** n-gram connection cost data */
public final class ConnectionCosts {
Expand All @@ -35,15 +39,21 @@ public final class ConnectionCosts {
private final int forwardSize;

/**
* @param scheme - scheme for loading resources (FILE or CLASSPATH).
* @param resourcePath - where to load resources from, without the ".dat" suffix
* Create a {@link ConnectionCosts} from an external resource path.
*
* @param connectionCostsFile where to load connection costs resource
* @throws IOException if resource was not found or broken
*/
public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String resourcePath)
throws IOException {
try (InputStream is =
new BufferedInputStream(
BinaryDictionary.getResource(
scheme, "/" + resourcePath.replace('.', '/') + FILENAME_SUFFIX))) {
public ConnectionCosts(Path connectionCostsFile) throws IOException {
this(() -> Files.newInputStream(connectionCostsFile));
}

private ConnectionCosts() throws IOException {
this(ConnectionCosts::getClassResource);
}

private ConnectionCosts(IOSupplier<InputStream> connectionCostResource) throws IOException {
try (InputStream is = new BufferedInputStream(connectionCostResource.get())) {
final DataInput in = new InputStreamDataInput(is);
CodecUtil.checkHeader(in, HEADER, VERSION, VERSION);
this.forwardSize = in.readVInt();
Expand All @@ -63,8 +73,10 @@ public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String resourcePa
}
}

private ConnectionCosts() throws IOException {
this(BinaryDictionary.ResourceScheme.CLASSPATH, ConnectionCosts.class.getName());
private static InputStream getClassResource() throws IOException {
final String resourcePath = ConnectionCosts.class.getSimpleName() + FILENAME_SUFFIX;
return IOUtils.requireResourceNonNull(
ConnectionCosts.class.getResourceAsStream(resourcePath), resourcePath);
}

public int get(int forwardId, int backwardId) {
Expand Down
Loading

0 comments on commit 53b96bf

Please sign in to comment.