Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10400: revise binary dictionaries' constructor in kuromoji #643

Merged

Conversation

mocobeta
Copy link
Contributor

@mocobeta mocobeta commented Feb 3, 2022

I drafted an idea to revise the dictionary constructors.

Main changes:

  1. always use Class.getResourceAsStream() to load class resources. (don't delegate it to the parent class)
  2. deprecate the constructors that switch the resource location (classpath/file path).
    • instead, break it down into two constructors - one for the classpath (non-arg constructor) and another (one-arg constructor) for the external file path.
    • the two constructors are exactly the same except for the way to locate the resource, I introduced an internal constructor that takes Suppliers that returns InputStream.

Each dictionary class now has two constructors:

  • A private no-arg constructor. This loads class resources by directly invoking Class.getResourceAsStream() without delegation. This covers most use-cases.
  • A public one-arg constructor. This loads external resources from the user's file system. For advanced users.
    • CharacterDefinition doesn't have this public constructor since this loads only class resources.

Also, each class (except for CharacterDefinition) has an internal auxiliary constructor that is invoked from the two entry-point constructors and populates instance fields from the given input streams (created from classpath or file path, or anything else).

@mocobeta mocobeta changed the title LUCENE-10400: revise to load dictionary resources in kuromoji LUCENE-10400: revise constructors to load dictionary resources in kuromoji Feb 4, 2022
@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 4, 2022

@uschindler @msokolov
Would you take a look at the revised constructors' API design when you have some time? If this gains basic consensus, I will add tests for the newly introduced APIs.

@uschindler
Copy link
Contributor

Hi, I am a bit busy, let me check later! Could take over weekend.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my full review, just something I already noticed last night:

* @param resourceLocation where to load resources from
* @throws IOException
*/
public ConnectionCosts(String resourceLocation) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take Pathas Parameter, no String. According to Robert, Paths.get() should be put on forbidden list, as it makes testing with custom path impls impossible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the interfaces to take the valid Path objects.
7841817

@uschindler
Copy link
Contributor

One thing that I wanted to add: We should add @deprecated(forRemoval=true), this makes builds very noisy if you use the deprecated API. By adding a since="9.1" tag we may also remove it earlier than 10.0.

Of course the deprecated APIs should not merged into "main", but maybe it is easier to keep them in in first commit and rmove in a secondary PR afterwards only in main.

@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 4, 2022

One thing that I wanted to add: We should add @deprecated(forRemoval=true), this makes builds very noisy if you use the deprecated API. By adding a since="9.1" tag we may also remove it earlier than 10.0.

Of course the deprecated APIs should not merged into "main", but maybe it is easier to keep them in in first commit and rmove in a secondary PR afterwards only in main.

Thanks, I was thinking the same thing. Mark as Deprecated the two-args constructors for removal on branch_9x, and remove them on main.

}

protected static Supplier<InputStream> openFileOrThrowRuntimeException(Path path)
throws RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use IOExceptionUnchecked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UncheckedIOException :-)

Copy link
Contributor Author

@mocobeta mocobeta Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to throw UncheckedIOException. Also, I added a utility interface to wrap the try-catch clauses here and there.
cb1c549

InputStream dictIS = dictResource.get()) {
DataInput in = new InputStreamDataInput(mapIS);
CodecUtil.checkHeader(in, TARGETMAP_HEADER, VERSION, VERSION);
targetMap = new int[in.readVInt()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we read the targetMap in a separate function for better readability/modularity?

this.resourceScheme = null;
this.resourcePath = null;

int[] targetMapOffsets = null, targetMap = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to assign nulls to these? Could we assign directly to members below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm now I see we basically just copy-paste the big ugly method that was here before. I wonder if it's possible to share the same implementation -- in other words can we make the old constructor call this one?

Copy link
Contributor Author

@mocobeta mocobeta Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't call/delegate to other this() from the old constructor. A kind of chicken-egg problem is there - getClass() in it - this is needed to delegate to the newly added constructor that takes input streams, but it's a constructor so getClass() can't be invoked before this().

Instead, we can retire the old constructor and call this() or new super() in the implementation classes without changing public APIs. Anyway, it's not great to have the if-else for resource switching and the current super() call to delegate class resources loading; I think it is okay to remove the old constructor in the abstract BinaryDictionary right now?
9c2c971

+ sourceId);
targetMapOffsets[sourceId] = targetMap.length;

in = new InputStreamDataInput(posIS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - can we split out into a separate function? I guess Java's handling of final makes it annoying to assign to these members in a function called from the constructor. Perhaps we can return a String[][] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally all fields and variables should be final, so I agree. In addition, assigning null to variables first and then reassigning a real value is bad code pattern! They should be final and only once assigned. This is a relic from the old code, we should clean it up.
For the special case where you need to pass a value to super.ctor, you can use a static method. All other code should be moved to ctor only. This is especially important for multithreading, because the ctor is ensured to be atomic.

Copy link
Contributor Author

@mocobeta mocobeta Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do this.
8bfc984

and maybe add synchronized
ecc4ee9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized is not needed if the method is private and only called from ctor.

Problematic are methods tha are public or public to subclasses or those which have sideeffects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized is not needed if the method is private and only called from ctor.

I will remove "synchronized" here again.

@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 5, 2022

Encountered an issue with @Deprecated(forRemoval = true). We can't call them even from test cases since @SuppressWarnings("deprecation") doesn't work when it is marked as "for removal", and then test compilation fails with warnings.

> Task :lucene:analysis:kuromoji:compileTestJava FAILED
/mnt/hdd/repo/lucene/lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/dict/TestTokenInfoDictionary.java:81: warning: [removal] TokenInfoDictionary(ResourceScheme,String) in TokenInfoDictionary has been deprecated and marked for removal
    return new TokenInfoDictionary(ResourceScheme.FILE, dir.resolve(dictionaryPath).toString());

Do we have to change tests to get rid of the deprecated method calls?

@uschindler
Copy link
Contributor

Encountered an issue with @Deprecated(forRemoval = true). We can't call them even from test cases since @SuppressWarnings("deprecation") doesn't work when it is marked as "for removal", and then test compilation fails with warnings.

> Task :lucene:analysis:kuromoji:compileTestJava FAILED
/mnt/hdd/repo/lucene/lucene/analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/dict/TestTokenInfoDictionary.java:81: warning: [removal] TokenInfoDictionary(ResourceScheme,String) in TokenInfoDictionary has been deprecated and marked for removal
    return new TokenInfoDictionary(ResourceScheme.FILE, dir.resolve(dictionaryPath).toString());

Do we have to change tests to get rid of the deprecated method calls?

Hi,

as we want to remove the methods anyway (at least in master), fixing the tests would be a good idea. If we need to maintain test in 9.x testing the old methods, a @SuppressWarnings("removal") would be applicable.

@uschindler
Copy link
Contributor

In general I like the idea here, but I would do one change: You already defined a new functional interface, but the public methods taking Supplyier. I am against uselessly wrapping exceptions just because the stupid old Java concept of "checked" exceptions.

My proposal would be:

  • Define a @FunctionalInterface public interface ResourceSupplier<T> { T get() throws IOException; } in Lucene's core module next to IOUtils (or maybe inside IOUtils). I say this, because we may use the supplier pattern now more often and when Nori is also changed, we would need the interface again.
  • The constructors take ResourceSupplier<InputStream>

If you still want to go the wrapping, you may also have a look at https://twitter.com/thetaph1/status/1487397163718135812
This shows that you can trick out Java compiler and convert functional interfaces (by casting them using generics) to allow them to throw checked exceptions. I had a similar problem like this in some large codebase using SQLException and many Function, Supplier, Consumer. The trick was to just make the exceptions invisible to the stupid javac compiler :-) - it is perfectly legal code and won't break as it follows the laguage spec. It is just casting Exceptions away.

@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 5, 2022

I added @SuppressWarnings("removal") to tests that invoke deprecated methods for now - to keep the tests as-is on branch_9x. I will remove the obsolete methods and change tests in a separate patch (after backporting this).

Also, the suggested ResourceSupplier<T> interface works very well, thanks.
As for the wrapping checked exception trick (type erasure plays the main role there, maybe?), it's cool but seems too much for me.

* @see java.util.function.Supplier
*/
@FunctionalInterface
public interface ResourceSupplier<T> {
Copy link
Contributor

@uschindler uschindler Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to IOSupplier<T> to be in line with the other interfaces before. I wasn't aware that there were already interfaces like this.

@@ -528,7 +528,7 @@ public static void fsync(Path fileToSync, boolean isDir) throws IOException {
}

/**
* A resource supplier function that may throw on IOException
* A resource supplier function that may throw an IOException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a note that the consumer has to close the resource (eg., use try-with-resources) and the supplier can count on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note here. 4e96ed4

@dweiss
Copy link
Contributor

dweiss commented Feb 5, 2022

The trick was to just make the exceptions invisible to the stupid javac compiler :-) - it is perfectly legal code and won't break as it follows the laguage spec. It is just casting Exceptions away.

This isn't exactly right - it is legal bytecode but it isn't exactly playing well with the language. The calling code may expect your interface to never throw a checked exception and thus only catch a selected subset of unchecked ones - a checked rethrown exception violates this semantics. To be a bit more concrete, imagine if the library code had this:

void foo(Supplier<T> t) {
  try {
    doSomething(t.get());
    processed = true;
  } catch (RuntimeException | Error e) {
    exceptionHappened = true;
  }
}

From the Java language point of view this either returns normally or sets the exception flag - it is not possible to write Java code that bypasses both processed and exceptionHappened. But the sneaky-rethrows hack does allow the supplier to go directly up the stack, up until an exception handler for a checked exception is found. This is where the problem is - it may leave the library code in inconsistent state.

I agree the lack of any exception signature on streams is terrible but I prefer to have wrappers that rethrow an unchecked exception, with the original one as the cause - this is legal and leaves no space for surprises like the one above.

@mocobeta mocobeta marked this pull request as ready for review February 6, 2022 04:37
@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 6, 2022

Assuming this is close to being ready for merge, I'm going to prepare a changes entry for 9.1 (for deprecated APIs with removal.) Once this is backported I will open another pull request for main to remove the obsolete APIs with another changes entry (for removed APIs.) Then will need to repeat the same for Nori.
I will keep this open for one or two days for further review.

@uschindler
Copy link
Contributor

uschindler commented Feb 6, 2022

This isn't exactly right - it is legal bytecode but it isn't exactly playing well with the language.

Hi @dweiss: Let's not discuss this here, the whole thing does not apply to this PR. My recommendation was to add another functional interface to IOUtils (we already created them there).

I would never design my APIs in a way like Java streams API was done, so having a separate functional interface for IO actions is a good idea and was applied by Tomoko to this PR. So no need for further work.

The above tweet is more about working around problems with Java Streams API. We know for sure that Java Streams API internally handles exception in a correct way (using finally blocks), so the code example you posted does not apply there. To me, checked exceptions are a misconception of Java, but you can start flamewars about that. Another misconception was to let MethodHandle.invoke() throw Throwable. Thats another case where sneaky rethoring must be done to have nice and correct behaviour for the caller of the stuff you do internally with MethodHandles. But there it is no problem because you left Java Type System already! So inside methods that call MethodHandle.invoke(), it is recommended and perfectly fine to use rethrow, as long as you make sure the caller form public API gets correct exceptions (means you sneaky rethrow all excepions, but you declare the method to have the checked exceptions you know of). BTW, Java internally also rethrows like this in the lamda bootstraps (I rewrote it for Elasticsearch Painless language). So whenever you have lambdas or MethodReferences in your code, sneaky rethrowing is used in bootstraps extensive.

In general: Every cast is a misuse of the Java language, and so is this one. What we do here is just casting the exception type. So it is a misuse of type system (like every cast is), but it is not a misuse of the language. Its perfectly legal to do it. If you try out my above Sneaky class and look (with eclipse) at the types introduced for the generics type arguments, it is funny how intelligent the Java compiler is: If you have a functional interface throwing E extends Exception and the code does not throw any checked exception, Javac inserts RuntimeException into the type argument.

I already talked too much, my problem with Sneaky is that if you use such type casting, the Java Compiler won't allow you to catch checked Exceptions anymore, that's the worst thing that happens - but Sneaky works great if you make sure that the method itsself calling your erased Java streams is redeclaring throws IOException. The try...catch posted above is.... bad code anyways. Better use finally blocks to ensure something is executed for sure.

One last thing: The above try..catch also leads to surprises, if the method you call suddenly changes throws in a later version of the API and you did not recompile it. So try...catch blocks like this are a risky and errorprone/PMD warn you about such code.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine now. I am glad that the horrible API is going away.

@@ -491,6 +491,7 @@ public void testUserDict3() throws Exception {
}

// Make sure loading custom dictionaries from classpath works:
@SuppressWarnings("removal")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will be obsolete in main, so we can remove it completely. It does not test anything useful. We may need to adda test that really loads a resource from a file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am going to handle this in the following PR.

@dweiss
Copy link
Contributor

dweiss commented Feb 6, 2022

The try...catch posted above is.... bad code anyways. Better use finally blocks to ensure something is executed for sure.

I know it can be written differently, but that's not the point, Uwe - I just showed an example of what can be lurking in the code you can't/ don't control (including the streams implementation). Like I said, I got burnt on this once and I tried not to abuse the type system's loopholes ever since.

@uschindler
Copy link
Contributor

uschindler commented Feb 6, 2022

The try...catch posted above is.... bad code anyways. Better use finally blocks to ensure something is executed for sure.

I know it can be written differently, but that's not the point, Uwe - I just showed an example of what can be lurking in the code you can't/ don't control (including the streams implementation). Like I said, I got burnt on this once and I tried not to abuse the type system's loopholes ever since.

That's OK. I just wanted to say: When working with MethopdHandles the general rule, also internally followed by the JDK code is to rethrow the original exception. If you have a method Handle you are the only one who knows from the inspection of the type system which Exceptions it really throws.

If you would me to enforce not use rethrow or type casting here, I'd like you to open many bug reports in Elasticsearch Painless and the JDK source code's implementations of StringConcatFactory and LambdaMetaFactory. Both of those just rethrow exceptions everywhere. When working with MethodHandles you have to accept that you have left the type system and have to implement the type system on your own -- sorry. There it does not help if you wrap with runtime exceptions, because then the caller no longer gets exceptions that were expected. The problem I see with RuntimeExceptions and wrapping them for nonsense is that the caller of your method doing this no longer sees original exceptions. So you have to find the balance. And my balance clearly says: Throw exactly what you get inside your complex method handles logic. It is only important to REDECLARE the Exceptions excpected in the public API you expose to outside (that's what JDK and Painess does). Internally it ignores type system (also for performance).

private TokenInfoDictionary() throws IOException {
this(ResourceScheme.CLASSPATH, null);
private static InputStream getClassResource(String suffix) throws IOException {
final String resourcePath = TokenInfoDictionary.class.getSimpleName() + suffix;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "simple name + suffix" naming convention is ubiquitous in kuromoji (and nori). I am wondering if this is an appropriate method to locate resources. I think this question could pop up again when we closely look at BinaryDictionaryWriter#write().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is at least fine for the "default loading" for our own resources. It should not be used for public resources (there we now only have constructors taking file streams, as it makes no sense for external files).

@mocobeta mocobeta changed the title LUCENE-10400: revise constructors to load dictionary resources in kuromoji LUCENE-10400: revise binary dictionaries' constructor in kuromoji Feb 7, 2022
@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 7, 2022

Thanks for reviewing, I'm going to merge this. I will open another pull request to remove the obsolete methods on main.

@mocobeta mocobeta merged commit e7546c2 into apache:main Feb 7, 2022
@mocobeta mocobeta deleted the jira/LUCENE-10400-cleanup-kuromoji-dictionary branch February 7, 2022 10:31
@mocobeta
Copy link
Contributor Author

mocobeta commented Feb 7, 2022

I opened #655.
I think the diff would be obvious - will merge it tomorrow.

*
* @see java.util.function.Supplier
*/
@FunctionalInterface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What aboutorg.apache.lucene.util.IOSupplier that already exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice there is o.a.l.util.IOSupplier. It's confusing to have two interfaces with the same signature.

Also, there are other functional interfaces in o.a.l.util.IOUtils corresponding to Java stdlib's Consumer and Function.
I think it might be better to move all functional interfaces in o.a.l.util.IOUtils to top-level interfaces? @uschindler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also not aware that this interface already existed. Because all others were in IOUtils, so this was inconsistent before.

IMHO, I agree with @mocobeta - Move all of them to IOUtils. Separate classes bloat the Javadocs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open a separate issue and remove the top-level one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer to have top-level functional interfaces o.a.l.util.[IOSupplier|IOConsumer|IOFunction] for visibility since we will use them more frequently in the future. And there are only two such existing functional interfaces in IOUtils, so moving them to top-level would not be difficult - what do you think about it?
But I'm not so familiar with the convention in there, and am fine with having all of them under o.a.l.util.IOUtils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is a bit of disagreement in the API design. I think we'd need to solve this inconsistency in the core util's API to proceed with LUCENE-10400, I will open an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both is fine. Actually the nested ones have a naming problem: "IOUtils.IOSupplier" has two times IO, so either "IOSupplier" or "IOUtils.Supplier".

Excuse me, I wrote the previous comment before seeing your comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad that we have reached an agreement here. Those are public APIs, though they may be mostly used in lambdas, so I think an issue for this clean-up is needed anyway? I'll open an issue and a PR for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards class-per-file convention as well. Too many nested (public) classes makes my head spin.

I think this depends on the use case. If the functional interface should be used by many classes throughout Lucene's API, I agree with that.

The ones in IOUtils were originally only meant to be part of method signatures inside IOUtils. In this case, it is better to hide them inside the IOUtils class, because nobody would refer to them in import statements! You call a method in IOUtils and use the functional interface implicitely, the name does not matter as you never import or use the interface name directly). But as soon as many APIs like kuromoji use the functional interface, then it is better to place them as top-level classes. Full agreement.

I think that happened with IOUtils: At beginning the interfaces were meant as method parameters to some methods only, but now they graduated to top-level citicens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, could anybody review #673?

@mocobeta
Copy link
Contributor Author

We need to apply the same changes here to Nori. I opened #693.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants