-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10400: revise binary dictionaries' constructor in kuromoji #643
LUCENE-10400: revise binary dictionaries' constructor in kuromoji #643
Conversation
@uschindler @msokolov |
Hi, I am a bit busy, let me check later! Could take over weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not my full review, just something I already noticed last night:
* @param resourceLocation where to load resources from | ||
* @throws IOException | ||
*/ | ||
public ConnectionCosts(String resourceLocation) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should take Path
as Parameter, no String. According to Robert, Paths.get()
should be put on forbidden list, as it makes testing with custom path impls impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the interfaces to take the valid Path objects.
7841817
One thing that I wanted to add: We should add @deprecated(forRemoval=true), this makes builds very noisy if you use the deprecated API. By adding a since="9.1" tag we may also remove it earlier than 10.0. Of course the deprecated APIs should not merged into "main", but maybe it is easier to keep them in in first commit and rmove in a secondary PR afterwards only in main. |
Thanks, I was thinking the same thing. Mark as Deprecated the two-args constructors for removal on branch_9x, and remove them on main. |
} | ||
|
||
protected static Supplier<InputStream> openFileOrThrowRuntimeException(Path path) | ||
throws RuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use IOExceptionUnchecked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UncheckedIOException
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to throw UncheckedIOException. Also, I added a utility interface to wrap the try-catch clauses here and there.
cb1c549
InputStream dictIS = dictResource.get()) { | ||
DataInput in = new InputStreamDataInput(mapIS); | ||
CodecUtil.checkHeader(in, TARGETMAP_HEADER, VERSION, VERSION); | ||
targetMap = new int[in.readVInt()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we read the targetMap in a separate function for better readability/modularity?
this.resourceScheme = null; | ||
this.resourcePath = null; | ||
|
||
int[] targetMapOffsets = null, targetMap = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to assign nulls to these? Could we assign directly to members below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm now I see we basically just copy-paste the big ugly method that was here before. I wonder if it's possible to share the same implementation -- in other words can we make the old constructor call this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we can't call/delegate to other this()
from the old constructor. A kind of chicken-egg problem is there - getClass()
in it - this is needed to delegate to the newly added constructor that takes input streams, but it's a constructor so getClass()
can't be invoked before this()
.
Instead, we can retire the old constructor and call this()
or new super()
in the implementation classes without changing public APIs. Anyway, it's not great to have the if-else for resource switching and the current super()
call to delegate class resources loading; I think it is okay to remove the old constructor in the abstract BinaryDictionary
right now?
9c2c971
+ sourceId); | ||
targetMapOffsets[sourceId] = targetMap.length; | ||
|
||
in = new InputStreamDataInput(posIS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here - can we split out into a separate function? I guess Java's handling of final makes it annoying to assign to these members in a function called from the constructor. Perhaps we can return a String[][] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally all fields and variables should be final, so I agree. In addition, assigning null to variables first and then reassigning a real value is bad code pattern! They should be final and only once assigned. This is a relic from the old code, we should clean it up.
For the special case where you need to pass a value to super.ctor, you can use a static method. All other code should be moved to ctor only. This is especially important for multithreading, because the ctor is ensured to be atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized is not needed if the method is private and only called from ctor.
Problematic are methods tha are public or public to subclasses or those which have sideeffects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized is not needed if the method is private and only called from ctor.
I will remove "synchronized" here again.
Encountered an issue with
Do we have to change tests to get rid of the deprecated method calls? |
Hi, as we want to remove the methods anyway (at least in master), fixing the tests would be a good idea. If we need to maintain test in 9.x testing the old methods, a |
In general I like the idea here, but I would do one change: You already defined a new functional interface, but the public methods taking Supplyier. I am against uselessly wrapping exceptions just because the stupid old Java concept of "checked" exceptions. My proposal would be:
If you still want to go the wrapping, you may also have a look at https://twitter.com/thetaph1/status/1487397163718135812 |
I added Also, the suggested |
* @see java.util.function.Supplier | ||
*/ | ||
@FunctionalInterface | ||
public interface ResourceSupplier<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this to IOSupplier<T>
to be in line with the other interfaces before. I wasn't aware that there were already interfaces like this.
@@ -528,7 +528,7 @@ public static void fsync(Path fileToSync, boolean isDir) throws IOException { | |||
} | |||
|
|||
/** | |||
* A resource supplier function that may throw on IOException | |||
* A resource supplier function that may throw an IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a note that the consumer has to close the resource (eg., use try-with-resources) and the supplier can count on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note here. 4e96ed4
This isn't exactly right - it is legal bytecode but it isn't exactly playing well with the language. The calling code may expect your interface to never throw a checked exception and thus only catch a selected subset of unchecked ones - a checked rethrown exception violates this semantics. To be a bit more concrete, imagine if the library code had this:
From the Java language point of view this either returns normally or sets the exception flag - it is not possible to write Java code that bypasses both processed and exceptionHappened. But the sneaky-rethrows hack does allow the supplier to go directly up the stack, up until an exception handler for a checked exception is found. This is where the problem is - it may leave the library code in inconsistent state. I agree the lack of any exception signature on streams is terrible but I prefer to have wrappers that rethrow an unchecked exception, with the original one as the cause - this is legal and leaves no space for surprises like the one above. |
Assuming this is close to being ready for merge, I'm going to prepare a changes entry for 9.1 (for deprecated APIs with removal.) Once this is backported I will open another pull request for main to remove the obsolete APIs with another changes entry (for removed APIs.) Then will need to repeat the same for Nori. |
Hi @dweiss: Let's not discuss this here, the whole thing does not apply to this PR. My recommendation was to add another functional interface to IOUtils (we already created them there). I would never design my APIs in a way like Java streams API was done, so having a separate functional interface for IO actions is a good idea and was applied by Tomoko to this PR. So no need for further work. The above tweet is more about working around problems with Java Streams API. We know for sure that Java Streams API internally handles exception in a correct way (using finally blocks), so the code example you posted does not apply there. To me, checked exceptions are a misconception of Java, but you can start flamewars about that. Another misconception was to let MethodHandle.invoke() throw Throwable. Thats another case where sneaky rethoring must be done to have nice and correct behaviour for the caller of the stuff you do internally with MethodHandles. But there it is no problem because you left Java Type System already! So inside methods that call MethodHandle.invoke(), it is recommended and perfectly fine to use rethrow, as long as you make sure the caller form public API gets correct exceptions (means you sneaky rethrow all excepions, but you declare the method to have the checked exceptions you know of). BTW, Java internally also rethrows like this in the lamda bootstraps (I rewrote it for Elasticsearch Painless language). So whenever you have lambdas or MethodReferences in your code, sneaky rethrowing is used in bootstraps extensive. In general: Every cast is a misuse of the Java language, and so is this one. What we do here is just casting the exception type. So it is a misuse of type system (like every cast is), but it is not a misuse of the language. Its perfectly legal to do it. If you try out my above Sneaky class and look (with eclipse) at the types introduced for the generics type arguments, it is funny how intelligent the Java compiler is: If you have a functional interface throwing I already talked too much, my problem with Sneaky is that if you use such type casting, the Java Compiler won't allow you to catch checked Exceptions anymore, that's the worst thing that happens - but Sneaky works great if you make sure that the method itsself calling your erased Java streams is redeclaring One last thing: The above try..catch also leads to surprises, if the method you call suddenly changes throws in a later version of the API and you did not recompile it. So try...catch blocks like this are a risky and errorprone/PMD warn you about such code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine now. I am glad that the horrible API is going away.
@@ -491,6 +491,7 @@ public void testUserDict3() throws Exception { | |||
} | |||
|
|||
// Make sure loading custom dictionaries from classpath works: | |||
@SuppressWarnings("removal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will be obsolete in main, so we can remove it completely. It does not test anything useful. We may need to adda test that really loads a resource from a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am going to handle this in the following PR.
I know it can be written differently, but that's not the point, Uwe - I just showed an example of what can be lurking in the code you can't/ don't control (including the streams implementation). Like I said, I got burnt on this once and I tried not to abuse the type system's loopholes ever since. |
That's OK. I just wanted to say: When working with MethopdHandles the general rule, also internally followed by the JDK code is to rethrow the original exception. If you have a method Handle you are the only one who knows from the inspection of the type system which Exceptions it really throws. If you would me to enforce not use rethrow or type casting here, I'd like you to open many bug reports in Elasticsearch Painless and the JDK source code's implementations of StringConcatFactory and LambdaMetaFactory. Both of those just rethrow exceptions everywhere. When working with MethodHandles you have to accept that you have left the type system and have to implement the type system on your own -- sorry. There it does not help if you wrap with runtime exceptions, because then the caller no longer gets exceptions that were expected. The problem I see with RuntimeExceptions and wrapping them for nonsense is that the caller of your method doing this no longer sees original exceptions. So you have to find the balance. And my balance clearly says: Throw exactly what you get inside your complex method handles logic. It is only important to REDECLARE the Exceptions excpected in the public API you expose to outside (that's what JDK and Painess does). Internally it ignores type system (also for performance). |
private TokenInfoDictionary() throws IOException { | ||
this(ResourceScheme.CLASSPATH, null); | ||
private static InputStream getClassResource(String suffix) throws IOException { | ||
final String resourcePath = TokenInfoDictionary.class.getSimpleName() + suffix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "simple name + suffix" naming convention is ubiquitous in kuromoji (and nori). I am wondering if this is an appropriate method to locate resources. I think this question could pop up again when we closely look at BinaryDictionaryWriter#write().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is at least fine for the "default loading" for our own resources. It should not be used for public resources (there we now only have constructors taking file streams, as it makes no sense for external files).
Thanks for reviewing, I'm going to merge this. I will open another pull request to remove the obsolete methods on main. |
I opened #655. |
* | ||
* @see java.util.function.Supplier | ||
*/ | ||
@FunctionalInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What aboutorg.apache.lucene.util.IOSupplier
that already exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice there is o.a.l.util.IOSupplier
. It's confusing to have two interfaces with the same signature.
Also, there are other functional interfaces in o.a.l.util.IOUtils
corresponding to Java stdlib's Consumer and Function.
I think it might be better to move all functional interfaces in o.a.l.util.IOUtils
to top-level interfaces? @uschindler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also not aware that this interface already existed. Because all others were in IOUtils, so this was inconsistent before.
IMHO, I agree with @mocobeta - Move all of them to IOUtils. Separate classes bloat the Javadocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's open a separate issue and remove the top-level one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly prefer to have top-level functional interfaces o.a.l.util.[IOSupplier|IOConsumer|IOFunction]
for visibility since we will use them more frequently in the future. And there are only two such existing functional interfaces in IOUtils
, so moving them to top-level would not be difficult - what do you think about it?
But I'm not so familiar with the convention in there, and am fine with having all of them under o.a.l.util.IOUtils
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is a bit of disagreement in the API design. I think we'd need to solve this inconsistency in the core util's API to proceed with LUCENE-10400, I will open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both is fine. Actually the nested ones have a naming problem: "IOUtils.IOSupplier" has two times IO, so either "IOSupplier" or "IOUtils.Supplier".
Excuse me, I wrote the previous comment before seeing your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad that we have reached an agreement here. Those are public APIs, though they may be mostly used in lambdas, so I think an issue for this clean-up is needed anyway? I'll open an issue and a PR for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards class-per-file convention as well. Too many nested (public) classes makes my head spin.
I think this depends on the use case. If the functional interface should be used by many classes throughout Lucene's API, I agree with that.
The ones in IOUtils were originally only meant to be part of method signatures inside IOUtils. In this case, it is better to hide them inside the IOUtils class, because nobody would refer to them in import statements! You call a method in IOUtils and use the functional interface implicitely, the name does not matter as you never import or use the interface name directly). But as soon as many APIs like kuromoji use the functional interface, then it is better to place them as top-level classes. Full agreement.
I think that happened with IOUtils: At beginning the interfaces were meant as method parameters to some methods only, but now they graduated to top-level citicens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, could anybody review #673?
We need to apply the same changes here to Nori. I opened #693. |
I drafted an idea to revise the dictionary constructors.
Main changes:
Supplier
s that returns InputStream.Each dictionary class now has two constructors:
Class.getResourceAsStream()
without delegation. This covers most use-cases.CharacterDefinition
doesn't have this public constructor since this loads only class resources.Also, each class (except for
CharacterDefinition
) has an internal auxiliary constructor that is invoked from the two entry-point constructors and populates instance fields from the given input streams (created from classpath or file path, or anything else).