Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10400: revise binary dictionaries' constructor in kuromoji #643

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
63ca8e7
rewrite BinaryDictionary constructor in kuromoji
mocobeta Feb 3, 2022
90c2b54
fix bugs in locating resources
mocobeta Feb 3, 2022
d4c89c1
add a Deprecated annotation
mocobeta Feb 3, 2022
cb1c549
use UncheckedIOException; add wrapper interface for input stream supp…
mocobeta Feb 5, 2022
7841817
take valid Path object in constructors
mocobeta Feb 5, 2022
873bb04
add license header
mocobeta Feb 5, 2022
9c2c971
retire abstract BinaryDictionary's two-args constructor
mocobeta Feb 5, 2022
db1494e
remove unused variables
mocobeta Feb 5, 2022
8bfc984
split up fat BinaryDictionary constructor
mocobeta Feb 5, 2022
bfb1cda
add final
mocobeta Feb 5, 2022
ecc4ee9
make initialization methods synchronized
mocobeta Feb 5, 2022
ea72ded
invoke new constructor from deprecated ctor
mocobeta Feb 5, 2022
bfd437c
remove synchronized from private static methods
mocobeta Feb 5, 2022
3cc46a3
add forRemoval and since elements to Deprecated
mocobeta Feb 5, 2022
12960dd
cleanup old methods
mocobeta Feb 5, 2022
d1986ea
introduce IOUtils.ResourceSupplier and rewrite ctors to use it
mocobeta Feb 5, 2022
bd5894b
fix typo
mocobeta Feb 5, 2022
d219d64
rename ResourceSupplier to IOSupplier
mocobeta Feb 5, 2022
4e96ed4
add note to javadocs
mocobeta Feb 5, 2022
72c9cab
Update github hunspell regression test to use JDK 17
gsmiller Feb 5, 2022
d769795
add changes entry
mocobeta Feb 6, 2022
785f3f3
Merge branch 'main' into jira/LUCENE-10400-cleanup-kuromoji-dictionary
mocobeta Feb 6, 2022
32aff3f
add forRemoval=true to ResourceScheme enum
mocobeta Feb 6, 2022
5c6a77a
simplify private method signatures
mocobeta Feb 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,15 @@

import java.io.BufferedInputStream;
import java.io.EOFException;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.function.Supplier;
import org.apache.lucene.codecs.CodecUtil;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.InputStreamDataInput;
Expand All @@ -36,6 +37,7 @@
public abstract class BinaryDictionary implements Dictionary {

/** Used to specify where (dictionary) resources get loaded from. */
@Deprecated
public enum ResourceScheme {
CLASSPATH,
FILE
Expand All @@ -58,15 +60,12 @@ public enum ResourceScheme {
private final String[] inflTypeDict;
private final String[] inflFormDict;

protected BinaryDictionary() throws IOException {
this(ResourceScheme.CLASSPATH, null);
}

/**
* @param resourceScheme - scheme for loading resources (FILE or CLASSPATH).
* @param resourcePath - where to load resources (dictionaries) from. If null, with CLASSPATH
* scheme only, use this class's name as the path.
*/
@Deprecated
protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath)
throws IOException {
this.resourceScheme = resourceScheme;
Expand Down Expand Up @@ -154,6 +153,98 @@ protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath)
this.buffer = buffer;
}

protected BinaryDictionary(
Supplier<InputStream> targetMapResource,
Supplier<InputStream> posResource,
Supplier<InputStream> dictResource)
throws IOException {
this.resourceScheme = null;
this.resourcePath = null;

int[] targetMapOffsets = null, targetMap = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to assign nulls to these? Could we assign directly to members below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm now I see we basically just copy-paste the big ugly method that was here before. I wonder if it's possible to share the same implementation -- in other words can we make the old constructor call this one?

Copy link
Contributor Author

@mocobeta mocobeta Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't call/delegate to other this() from the old constructor. A kind of chicken-egg problem is there - getClass() in it - this is needed to delegate to the newly added constructor that takes input streams, but it's a constructor so getClass() can't be invoked before this().

Instead, we can retire the old constructor and call this() or new super() in the implementation classes without changing public APIs. Anyway, it's not great to have the if-else for resource switching and the current super() call to delegate class resources loading; I think it is okay to remove the old constructor in the abstract BinaryDictionary right now?
9c2c971

String[] posDict = null;
String[] inflFormDict = null;
String[] inflTypeDict = null;
ByteBuffer buffer = null;
try (InputStream mapIS = new BufferedInputStream(targetMapResource.get());
InputStream posIS = new BufferedInputStream(posResource.get());
// no buffering here, as we load in one large buffer
InputStream dictIS = dictResource.get()) {
DataInput in = new InputStreamDataInput(mapIS);
CodecUtil.checkHeader(in, TARGETMAP_HEADER, VERSION, VERSION);
targetMap = new int[in.readVInt()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we read the targetMap in a separate function for better readability/modularity?

targetMapOffsets = new int[in.readVInt()];
int accum = 0, sourceId = 0;
for (int ofs = 0; ofs < targetMap.length; ofs++) {
final int val = in.readVInt();
if ((val & 0x01) != 0) {
targetMapOffsets[sourceId] = ofs;
sourceId++;
}
accum += val >>> 1;
targetMap[ofs] = accum;
}
if (sourceId + 1 != targetMapOffsets.length)
throw new IOException(
"targetMap file format broken; targetMap.length="
+ targetMap.length
+ ", targetMapOffsets.length="
+ targetMapOffsets.length
+ ", sourceId="
+ sourceId);
targetMapOffsets[sourceId] = targetMap.length;

in = new InputStreamDataInput(posIS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - can we split out into a separate function? I guess Java's handling of final makes it annoying to assign to these members in a function called from the constructor. Perhaps we can return a String[][] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally all fields and variables should be final, so I agree. In addition, assigning null to variables first and then reassigning a real value is bad code pattern! They should be final and only once assigned. This is a relic from the old code, we should clean it up.
For the special case where you need to pass a value to super.ctor, you can use a static method. All other code should be moved to ctor only. This is especially important for multithreading, because the ctor is ensured to be atomic.

Copy link
Contributor Author

@mocobeta mocobeta Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do this.
8bfc984

and maybe add synchronized
ecc4ee9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized is not needed if the method is private and only called from ctor.

Problematic are methods tha are public or public to subclasses or those which have sideeffects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized is not needed if the method is private and only called from ctor.

I will remove "synchronized" here again.

CodecUtil.checkHeader(in, POSDICT_HEADER, VERSION, VERSION);
int posSize = in.readVInt();
posDict = new String[posSize];
inflTypeDict = new String[posSize];
inflFormDict = new String[posSize];
for (int j = 0; j < posSize; j++) {
posDict[j] = in.readString();
inflTypeDict[j] = in.readString();
inflFormDict[j] = in.readString();
// this is how we encode null inflections
if (inflTypeDict[j].length() == 0) {
inflTypeDict[j] = null;
}
if (inflFormDict[j].length() == 0) {
inflFormDict[j] = null;
}
}

in = new InputStreamDataInput(dictIS);
CodecUtil.checkHeader(in, DICT_HEADER, VERSION, VERSION);
final int size = in.readVInt();
final ByteBuffer tmpBuffer = ByteBuffer.allocateDirect(size);
final ReadableByteChannel channel = Channels.newChannel(dictIS);
final int read = channel.read(tmpBuffer);
if (read != size) {
throw new EOFException("Cannot read whole dictionary");
}
buffer = tmpBuffer.asReadOnlyBuffer();
}

this.targetMap = targetMap;
this.targetMapOffsets = targetMapOffsets;
this.posDict = posDict;
this.inflTypeDict = inflTypeDict;
this.inflFormDict = inflFormDict;
this.buffer = buffer;
}

protected static Supplier<InputStream> openFileOrThrowRuntimeException(Path path)
throws RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use IOExceptionUnchecked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UncheckedIOException :-)

Copy link
Contributor Author

@mocobeta mocobeta Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to throw UncheckedIOException. Also, I added a utility interface to wrap the try-catch clauses here and there.
cb1c549

return () -> {
try {
return Files.newInputStream(path);
} catch (IOException e) {
throw new RuntimeException(e);
}
};
}

@Deprecated
protected final InputStream getResource(String suffix) throws IOException {
switch (resourceScheme) {
case CLASSPATH:
Expand All @@ -165,6 +256,7 @@ protected final InputStream getResource(String suffix) throws IOException {
}
}

@Deprecated
public static final InputStream getResource(ResourceScheme scheme, String path)
throws IOException {
switch (scheme) {
Expand All @@ -177,17 +269,7 @@ public static final InputStream getResource(ResourceScheme scheme, String path)
}
}

// util, reused by ConnectionCosts and CharacterDefinition
public static final InputStream getClassResource(Class<?> clazz, String suffix)
throws IOException {
final InputStream is = clazz.getResourceAsStream(clazz.getSimpleName() + suffix);
if (is == null) {
throw new FileNotFoundException(
"Not in classpath: " + clazz.getName().replace('.', '/') + suffix);
}
return is;
}

@Deprecated
private static InputStream getClassResource(String path) throws IOException {
return IOUtils.requireResourceNonNull(BinaryDictionary.class.getResourceAsStream(path), path);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import org.apache.lucene.codecs.CodecUtil;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.InputStreamDataInput;
import org.apache.lucene.util.IOUtils;

/** Character category data. */
public final class CharacterDefinition {
Expand Down Expand Up @@ -68,8 +69,11 @@ private enum CharacterClass {
public static final byte KANJINUMERIC = (byte) CharacterClass.KANJINUMERIC.ordinal();

private CharacterDefinition() throws IOException {
final String resourcePath = CharacterDefinition.class.getSimpleName() + FILENAME_SUFFIX;
try (InputStream is =
new BufferedInputStream(BinaryDictionary.getClassResource(getClass(), FILENAME_SUFFIX))) {
new BufferedInputStream(
IOUtils.requireResourceNonNull(
CharacterDefinition.class.getResourceAsStream(resourcePath), resourcePath))) {
final DataInput in = new InputStreamDataInput(is);
CodecUtil.checkHeader(in, HEADER, VERSION, VERSION);
in.readBytes(characterCategoryMap, 0, characterCategoryMap.length);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,14 @@
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.function.Supplier;
import org.apache.lucene.codecs.CodecUtil;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.InputStreamDataInput;
import org.apache.lucene.util.IOUtils;

/** n-gram connection cost data */
public final class ConnectionCosts {
Expand All @@ -37,7 +42,9 @@ public final class ConnectionCosts {
/**
* @param scheme - scheme for loading resources (FILE or CLASSPATH).
* @param path - where to load resources from, without the ".dat" suffix
* @deprecated replaced by {@link #ConnectionCosts(String)}
*/
@Deprecated
public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String path) throws IOException {
try (InputStream is =
new BufferedInputStream(
Expand All @@ -61,8 +68,63 @@ public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String path) thro
}
}

/**
* Create a {@link ConnectionCosts} from an external resource path.
*
* @param resourceLocation where to load resources from
* @throws IOException
*/
public ConnectionCosts(String resourceLocation) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take Pathas Parameter, no String. According to Robert, Paths.get() should be put on forbidden list, as it makes testing with custom path impls impossible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the interfaces to take the valid Path objects.
7841817

this(openFileOrThrowRuntimeException(Paths.get(resourceLocation + FILENAME_SUFFIX)));
}

private ConnectionCosts() throws IOException {
this(BinaryDictionary.ResourceScheme.CLASSPATH, ConnectionCosts.class.getName());
this(getClassResourceOrThrowRuntimeException(FILENAME_SUFFIX));
}

private ConnectionCosts(Supplier<InputStream> connectionCostResource) throws IOException {
try (InputStream is = new BufferedInputStream(connectionCostResource.get())) {
final DataInput in = new InputStreamDataInput(is);
CodecUtil.checkHeader(in, HEADER, VERSION, VERSION);
forwardSize = in.readVInt();
int backwardSize = in.readVInt();
int size = forwardSize * backwardSize;

// copy the matrix into a direct byte buffer
final ByteBuffer tmpBuffer = ByteBuffer.allocateDirect(size * 2);
int accum = 0;
for (int j = 0; j < backwardSize; j++) {
for (int i = 0; i < forwardSize; i++) {
accum += in.readZInt();
tmpBuffer.putShort((short) accum);
}
}
buffer = tmpBuffer.asReadOnlyBuffer();
}
}

private static Supplier<InputStream> openFileOrThrowRuntimeException(Path path)
throws RuntimeException {
return () -> {
try {
return Files.newInputStream(path);
} catch (IOException e) {
throw new RuntimeException(e);
}
};
}

private static Supplier<InputStream> getClassResourceOrThrowRuntimeException(String suffix)
throws RuntimeException {
final String resourcePath = ConnectionCosts.class.getSimpleName() + suffix;
return () -> {
try {
return IOUtils.requireResourceNonNull(
ConnectionCosts.class.getResourceAsStream(resourcePath), resourcePath);
} catch (IOException e) {
throw new RuntimeException(e);
}
};
}

public int get(int forwardId, int backwardId) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,11 @@
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Paths;
import java.util.function.Supplier;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.InputStreamDataInput;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.PositiveIntOutputs;

Expand All @@ -38,7 +41,9 @@ public final class TokenInfoDictionary extends BinaryDictionary {
* @param resourceScheme - scheme for loading resources (FILE or CLASSPATH).
* @param resourcePath - where to load resources (dictionaries) from. If null, with CLASSPATH
* scheme only, use this class's name as the path.
* @deprecated replaced by {@link #TokenInfoDictionary(String)}
*/
@Deprecated
public TokenInfoDictionary(ResourceScheme resourceScheme, String resourcePath)
throws IOException {
super(resourceScheme, resourcePath);
Expand All @@ -51,8 +56,55 @@ public TokenInfoDictionary(ResourceScheme resourceScheme, String resourcePath)
this.fst = new TokenInfoFST(fst, true);
}

/**
* Create a {@link TokenInfoDictionary} from an external resource path.
*
* @param resourceLocation where to load resources (dictionaries) from.
* @throws IOException
*/
public TokenInfoDictionary(String resourceLocation) throws IOException {
this(
openFileOrThrowRuntimeException(Paths.get(resourceLocation + TARGETMAP_FILENAME_SUFFIX)),
openFileOrThrowRuntimeException(Paths.get(resourceLocation + POSDICT_FILENAME_SUFFIX)),
openFileOrThrowRuntimeException(Paths.get(resourceLocation + DICT_FILENAME_SUFFIX)),
openFileOrThrowRuntimeException(Paths.get(resourceLocation + FST_FILENAME_SUFFIX)));
}

private TokenInfoDictionary() throws IOException {
this(ResourceScheme.CLASSPATH, null);
this(
getClassResourceOrThrowRuntimeException(TARGETMAP_FILENAME_SUFFIX),
getClassResourceOrThrowRuntimeException(POSDICT_FILENAME_SUFFIX),
getClassResourceOrThrowRuntimeException(DICT_FILENAME_SUFFIX),
getClassResourceOrThrowRuntimeException(FST_FILENAME_SUFFIX));
}

private TokenInfoDictionary(
Supplier<InputStream> targetMapResource,
Supplier<InputStream> posResource,
Supplier<InputStream> dictResource,
Supplier<InputStream> fstResource)
throws IOException {
super(targetMapResource, posResource, dictResource);
FST<Long> fst;
try (InputStream is = new BufferedInputStream(fstResource.get())) {
DataInput in = new InputStreamDataInput(is);
fst = new FST<>(in, in, PositiveIntOutputs.getSingleton());
}
// TODO: some way to configure?
this.fst = new TokenInfoFST(fst, true);
}

private static Supplier<InputStream> getClassResourceOrThrowRuntimeException(String suffix)
throws RuntimeException {
final String resourcePath = TokenInfoDictionary.class.getSimpleName() + suffix;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "simple name + suffix" naming convention is ubiquitous in kuromoji (and nori). I am wondering if this is an appropriate method to locate resources. I think this question could pop up again when we closely look at BinaryDictionaryWriter#write().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is at least fine for the "default loading" for our own resources. It should not be used for public resources (there we now only have constructors taking file streams, as it makes no sense for external files).

return () -> {
try {
return IOUtils.requireResourceNonNull(
TokenInfoDictionary.class.getResourceAsStream(resourcePath), resourcePath);
} catch (IOException e) {
throw new RuntimeException(e);
}
};
}

public TokenInfoFST getFST() {
Expand Down
Loading