-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10400: revise binary dictionaries' constructor in kuromoji #643
Changes from 3 commits
63ca8e7
90c2b54
d4c89c1
cb1c549
7841817
873bb04
9c2c971
db1494e
8bfc984
bfb1cda
ecc4ee9
ea72ded
bfd437c
3cc46a3
12960dd
d1986ea
bd5894b
d219d64
4e96ed4
72c9cab
d769795
785f3f3
32aff3f
5c6a77a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,14 +18,15 @@ | |
|
||
import java.io.BufferedInputStream; | ||
import java.io.EOFException; | ||
import java.io.FileNotFoundException; | ||
import java.io.IOException; | ||
import java.io.InputStream; | ||
import java.nio.ByteBuffer; | ||
import java.nio.channels.Channels; | ||
import java.nio.channels.ReadableByteChannel; | ||
import java.nio.file.Files; | ||
import java.nio.file.Path; | ||
import java.nio.file.Paths; | ||
import java.util.function.Supplier; | ||
import org.apache.lucene.codecs.CodecUtil; | ||
import org.apache.lucene.store.DataInput; | ||
import org.apache.lucene.store.InputStreamDataInput; | ||
|
@@ -36,6 +37,7 @@ | |
public abstract class BinaryDictionary implements Dictionary { | ||
|
||
/** Used to specify where (dictionary) resources get loaded from. */ | ||
@Deprecated | ||
public enum ResourceScheme { | ||
CLASSPATH, | ||
FILE | ||
|
@@ -58,15 +60,12 @@ public enum ResourceScheme { | |
private final String[] inflTypeDict; | ||
private final String[] inflFormDict; | ||
|
||
protected BinaryDictionary() throws IOException { | ||
this(ResourceScheme.CLASSPATH, null); | ||
} | ||
|
||
/** | ||
* @param resourceScheme - scheme for loading resources (FILE or CLASSPATH). | ||
* @param resourcePath - where to load resources (dictionaries) from. If null, with CLASSPATH | ||
* scheme only, use this class's name as the path. | ||
*/ | ||
@Deprecated | ||
protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath) | ||
throws IOException { | ||
this.resourceScheme = resourceScheme; | ||
|
@@ -154,6 +153,98 @@ protected BinaryDictionary(ResourceScheme resourceScheme, String resourcePath) | |
this.buffer = buffer; | ||
} | ||
|
||
protected BinaryDictionary( | ||
Supplier<InputStream> targetMapResource, | ||
Supplier<InputStream> posResource, | ||
Supplier<InputStream> dictResource) | ||
throws IOException { | ||
this.resourceScheme = null; | ||
this.resourcePath = null; | ||
|
||
int[] targetMapOffsets = null, targetMap = null; | ||
String[] posDict = null; | ||
String[] inflFormDict = null; | ||
String[] inflTypeDict = null; | ||
ByteBuffer buffer = null; | ||
try (InputStream mapIS = new BufferedInputStream(targetMapResource.get()); | ||
InputStream posIS = new BufferedInputStream(posResource.get()); | ||
// no buffering here, as we load in one large buffer | ||
InputStream dictIS = dictResource.get()) { | ||
DataInput in = new InputStreamDataInput(mapIS); | ||
CodecUtil.checkHeader(in, TARGETMAP_HEADER, VERSION, VERSION); | ||
targetMap = new int[in.readVInt()]; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we read the targetMap in a separate function for better readability/modularity? |
||
targetMapOffsets = new int[in.readVInt()]; | ||
int accum = 0, sourceId = 0; | ||
for (int ofs = 0; ofs < targetMap.length; ofs++) { | ||
final int val = in.readVInt(); | ||
if ((val & 0x01) != 0) { | ||
targetMapOffsets[sourceId] = ofs; | ||
sourceId++; | ||
} | ||
accum += val >>> 1; | ||
targetMap[ofs] = accum; | ||
} | ||
if (sourceId + 1 != targetMapOffsets.length) | ||
throw new IOException( | ||
"targetMap file format broken; targetMap.length=" | ||
+ targetMap.length | ||
+ ", targetMapOffsets.length=" | ||
+ targetMapOffsets.length | ||
+ ", sourceId=" | ||
+ sourceId); | ||
targetMapOffsets[sourceId] = targetMap.length; | ||
|
||
in = new InputStreamDataInput(posIS); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here - can we split out into a separate function? I guess Java's handling of final makes it annoying to assign to these members in a function called from the constructor. Perhaps we can return a String[][] ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally all fields and variables should be final, so I agree. In addition, assigning null to variables first and then reassigning a real value is bad code pattern! They should be final and only once assigned. This is a relic from the old code, we should clean it up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The synchronized is not needed if the method is private and only called from ctor. Problematic are methods tha are public or public to subclasses or those which have sideeffects. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I will remove "synchronized" here again. |
||
CodecUtil.checkHeader(in, POSDICT_HEADER, VERSION, VERSION); | ||
int posSize = in.readVInt(); | ||
posDict = new String[posSize]; | ||
inflTypeDict = new String[posSize]; | ||
inflFormDict = new String[posSize]; | ||
for (int j = 0; j < posSize; j++) { | ||
posDict[j] = in.readString(); | ||
inflTypeDict[j] = in.readString(); | ||
inflFormDict[j] = in.readString(); | ||
// this is how we encode null inflections | ||
if (inflTypeDict[j].length() == 0) { | ||
inflTypeDict[j] = null; | ||
} | ||
if (inflFormDict[j].length() == 0) { | ||
inflFormDict[j] = null; | ||
} | ||
} | ||
|
||
in = new InputStreamDataInput(dictIS); | ||
CodecUtil.checkHeader(in, DICT_HEADER, VERSION, VERSION); | ||
final int size = in.readVInt(); | ||
final ByteBuffer tmpBuffer = ByteBuffer.allocateDirect(size); | ||
final ReadableByteChannel channel = Channels.newChannel(dictIS); | ||
final int read = channel.read(tmpBuffer); | ||
if (read != size) { | ||
throw new EOFException("Cannot read whole dictionary"); | ||
} | ||
buffer = tmpBuffer.asReadOnlyBuffer(); | ||
} | ||
|
||
this.targetMap = targetMap; | ||
this.targetMapOffsets = targetMapOffsets; | ||
this.posDict = posDict; | ||
this.inflTypeDict = inflTypeDict; | ||
this.inflFormDict = inflFormDict; | ||
this.buffer = buffer; | ||
} | ||
|
||
protected static Supplier<InputStream> openFileOrThrowRuntimeException(Path path) | ||
throws RuntimeException { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed to throw UncheckedIOException. Also, I added a utility interface to wrap the try-catch clauses here and there. |
||
return () -> { | ||
try { | ||
return Files.newInputStream(path); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} | ||
}; | ||
} | ||
|
||
@Deprecated | ||
protected final InputStream getResource(String suffix) throws IOException { | ||
switch (resourceScheme) { | ||
case CLASSPATH: | ||
|
@@ -165,6 +256,7 @@ protected final InputStream getResource(String suffix) throws IOException { | |
} | ||
} | ||
|
||
@Deprecated | ||
public static final InputStream getResource(ResourceScheme scheme, String path) | ||
throws IOException { | ||
switch (scheme) { | ||
|
@@ -177,17 +269,7 @@ public static final InputStream getResource(ResourceScheme scheme, String path) | |
} | ||
} | ||
|
||
// util, reused by ConnectionCosts and CharacterDefinition | ||
public static final InputStream getClassResource(Class<?> clazz, String suffix) | ||
throws IOException { | ||
final InputStream is = clazz.getResourceAsStream(clazz.getSimpleName() + suffix); | ||
if (is == null) { | ||
throw new FileNotFoundException( | ||
"Not in classpath: " + clazz.getName().replace('.', '/') + suffix); | ||
} | ||
return is; | ||
} | ||
|
||
@Deprecated | ||
private static InputStream getClassResource(String path) throws IOException { | ||
return IOUtils.requireResourceNonNull(BinaryDictionary.class.getResourceAsStream(path), path); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,9 +20,14 @@ | |
import java.io.IOException; | ||
import java.io.InputStream; | ||
import java.nio.ByteBuffer; | ||
import java.nio.file.Files; | ||
import java.nio.file.Path; | ||
import java.nio.file.Paths; | ||
import java.util.function.Supplier; | ||
import org.apache.lucene.codecs.CodecUtil; | ||
import org.apache.lucene.store.DataInput; | ||
import org.apache.lucene.store.InputStreamDataInput; | ||
import org.apache.lucene.util.IOUtils; | ||
|
||
/** n-gram connection cost data */ | ||
public final class ConnectionCosts { | ||
|
@@ -37,7 +42,9 @@ public final class ConnectionCosts { | |
/** | ||
* @param scheme - scheme for loading resources (FILE or CLASSPATH). | ||
* @param path - where to load resources from, without the ".dat" suffix | ||
* @deprecated replaced by {@link #ConnectionCosts(String)} | ||
*/ | ||
@Deprecated | ||
public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String path) throws IOException { | ||
try (InputStream is = | ||
new BufferedInputStream( | ||
|
@@ -61,8 +68,63 @@ public ConnectionCosts(BinaryDictionary.ResourceScheme scheme, String path) thro | |
} | ||
} | ||
|
||
/** | ||
* Create a {@link ConnectionCosts} from an external resource path. | ||
* | ||
* @param resourceLocation where to load resources from | ||
* @throws IOException | ||
*/ | ||
public ConnectionCosts(String resourceLocation) throws IOException { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should take There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed the interfaces to take the valid Path objects. |
||
this(openFileOrThrowRuntimeException(Paths.get(resourceLocation + FILENAME_SUFFIX))); | ||
} | ||
|
||
private ConnectionCosts() throws IOException { | ||
this(BinaryDictionary.ResourceScheme.CLASSPATH, ConnectionCosts.class.getName()); | ||
this(getClassResourceOrThrowRuntimeException(FILENAME_SUFFIX)); | ||
} | ||
|
||
private ConnectionCosts(Supplier<InputStream> connectionCostResource) throws IOException { | ||
try (InputStream is = new BufferedInputStream(connectionCostResource.get())) { | ||
final DataInput in = new InputStreamDataInput(is); | ||
CodecUtil.checkHeader(in, HEADER, VERSION, VERSION); | ||
forwardSize = in.readVInt(); | ||
int backwardSize = in.readVInt(); | ||
int size = forwardSize * backwardSize; | ||
|
||
// copy the matrix into a direct byte buffer | ||
final ByteBuffer tmpBuffer = ByteBuffer.allocateDirect(size * 2); | ||
int accum = 0; | ||
for (int j = 0; j < backwardSize; j++) { | ||
for (int i = 0; i < forwardSize; i++) { | ||
accum += in.readZInt(); | ||
tmpBuffer.putShort((short) accum); | ||
} | ||
} | ||
buffer = tmpBuffer.asReadOnlyBuffer(); | ||
} | ||
} | ||
|
||
private static Supplier<InputStream> openFileOrThrowRuntimeException(Path path) | ||
throws RuntimeException { | ||
return () -> { | ||
try { | ||
return Files.newInputStream(path); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} | ||
}; | ||
} | ||
|
||
private static Supplier<InputStream> getClassResourceOrThrowRuntimeException(String suffix) | ||
throws RuntimeException { | ||
final String resourcePath = ConnectionCosts.class.getSimpleName() + suffix; | ||
return () -> { | ||
try { | ||
return IOUtils.requireResourceNonNull( | ||
ConnectionCosts.class.getResourceAsStream(resourcePath), resourcePath); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} | ||
}; | ||
} | ||
|
||
public int get(int forwardId, int backwardId) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,8 +19,11 @@ | |
import java.io.BufferedInputStream; | ||
import java.io.IOException; | ||
import java.io.InputStream; | ||
import java.nio.file.Paths; | ||
import java.util.function.Supplier; | ||
import org.apache.lucene.store.DataInput; | ||
import org.apache.lucene.store.InputStreamDataInput; | ||
import org.apache.lucene.util.IOUtils; | ||
import org.apache.lucene.util.fst.FST; | ||
import org.apache.lucene.util.fst.PositiveIntOutputs; | ||
|
||
|
@@ -38,7 +41,9 @@ public final class TokenInfoDictionary extends BinaryDictionary { | |
* @param resourceScheme - scheme for loading resources (FILE or CLASSPATH). | ||
* @param resourcePath - where to load resources (dictionaries) from. If null, with CLASSPATH | ||
* scheme only, use this class's name as the path. | ||
* @deprecated replaced by {@link #TokenInfoDictionary(String)} | ||
*/ | ||
@Deprecated | ||
public TokenInfoDictionary(ResourceScheme resourceScheme, String resourcePath) | ||
throws IOException { | ||
super(resourceScheme, resourcePath); | ||
|
@@ -51,8 +56,55 @@ public TokenInfoDictionary(ResourceScheme resourceScheme, String resourcePath) | |
this.fst = new TokenInfoFST(fst, true); | ||
} | ||
|
||
/** | ||
* Create a {@link TokenInfoDictionary} from an external resource path. | ||
* | ||
* @param resourceLocation where to load resources (dictionaries) from. | ||
* @throws IOException | ||
*/ | ||
public TokenInfoDictionary(String resourceLocation) throws IOException { | ||
this( | ||
openFileOrThrowRuntimeException(Paths.get(resourceLocation + TARGETMAP_FILENAME_SUFFIX)), | ||
openFileOrThrowRuntimeException(Paths.get(resourceLocation + POSDICT_FILENAME_SUFFIX)), | ||
openFileOrThrowRuntimeException(Paths.get(resourceLocation + DICT_FILENAME_SUFFIX)), | ||
openFileOrThrowRuntimeException(Paths.get(resourceLocation + FST_FILENAME_SUFFIX))); | ||
} | ||
|
||
private TokenInfoDictionary() throws IOException { | ||
this(ResourceScheme.CLASSPATH, null); | ||
this( | ||
getClassResourceOrThrowRuntimeException(TARGETMAP_FILENAME_SUFFIX), | ||
getClassResourceOrThrowRuntimeException(POSDICT_FILENAME_SUFFIX), | ||
getClassResourceOrThrowRuntimeException(DICT_FILENAME_SUFFIX), | ||
getClassResourceOrThrowRuntimeException(FST_FILENAME_SUFFIX)); | ||
} | ||
|
||
private TokenInfoDictionary( | ||
Supplier<InputStream> targetMapResource, | ||
Supplier<InputStream> posResource, | ||
Supplier<InputStream> dictResource, | ||
Supplier<InputStream> fstResource) | ||
throws IOException { | ||
super(targetMapResource, posResource, dictResource); | ||
FST<Long> fst; | ||
try (InputStream is = new BufferedInputStream(fstResource.get())) { | ||
DataInput in = new InputStreamDataInput(is); | ||
fst = new FST<>(in, in, PositiveIntOutputs.getSingleton()); | ||
} | ||
// TODO: some way to configure? | ||
this.fst = new TokenInfoFST(fst, true); | ||
} | ||
|
||
private static Supplier<InputStream> getClassResourceOrThrowRuntimeException(String suffix) | ||
throws RuntimeException { | ||
final String resourcePath = TokenInfoDictionary.class.getSimpleName() + suffix; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This "simple name + suffix" naming convention is ubiquitous in kuromoji (and nori). I am wondering if this is an appropriate method to locate resources. I think this question could pop up again when we closely look at BinaryDictionaryWriter#write(). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is at least fine for the "default loading" for our own resources. It should not be used for public resources (there we now only have constructors taking file streams, as it makes no sense for external files). |
||
return () -> { | ||
try { | ||
return IOUtils.requireResourceNonNull( | ||
TokenInfoDictionary.class.getResourceAsStream(resourcePath), resourcePath); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} | ||
}; | ||
} | ||
|
||
public TokenInfoFST getFST() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to assign nulls to these? Could we assign directly to members below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm now I see we basically just copy-paste the big ugly method that was here before. I wonder if it's possible to share the same implementation -- in other words can we make the old constructor call this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we can't call/delegate to other
this()
from the old constructor. A kind of chicken-egg problem is there -getClass()
in it - this is needed to delegate to the newly added constructor that takes input streams, but it's a constructor sogetClass()
can't be invoked beforethis()
.Instead, we can retire the old constructor and call
this()
or newsuper()
in the implementation classes without changing public APIs. Anyway, it's not great to have the if-else for resource switching and the currentsuper()
call to delegate class resources loading; I think it is okay to remove the old constructor in the abstractBinaryDictionary
right now?9c2c971