-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
OCR integration #13313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCR integration #13313
Conversation
Note that your PR will not be reviewed/accepted until you have gone through the mandatory checks in the description and marked each of them them exactly in the format of |
File pdfFile = pdfPath.toFile(); | ||
if (!pdfFile.exists()) { | ||
throw new OcrException("PDF file does not exist: " + pdfPath); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use exceptions only for exceptional states. See https://github.com/HugoMatilla/Effective-JAVA-Summary?tab=readme-ov-file#57-use-exceptions-only-for-exceptional-conditions
JUnit tests of You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide. |
* Currently uses Tesseract with English language support. | ||
*/ | ||
public OcrService() { | ||
if (Platform.isMac()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted from https://github.com/nguyenq/tess4j/pull/240/files
* This exception wraps lower-level OCR engine exceptions to provide | ||
* a consistent interface for error handling throughout JabRef. | ||
*/ | ||
public class OcrException extends Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, only drawback I see is this may be hard to maintain/stay consistent with as the project grows or if external contributors wish to add something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best is to avoid exceptions being thrown at all, especially for expected possible stupid user behaviour. And if they are being thrown, keep them as informative as possible. The consistent interface is Exception at the end either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backstory: This was based on my experience in OO - all the JStyle classes and the OO GUI class had a custom error wrapper, OOError. There was also OOResult and OOVoidResult. I still don't know much about them, but for OOError
, apart from just wrapping exceptions, it also acted as an interface for with localizing and displaying error messages. As obvious, when I was new, I found these hard to use. So I started by using native exceptions as thrown by the library methods during the CSL project.
It was convenient, and it worked, and people were fine with the inconsistency, so now 50% of OO uses that, 50% doesn't. I mentioned this as some free time refactoring in #11829, but never had the energy to change it.
public static OcrResult success(String text) { | ||
return new OcrResult(true, text, null); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factory method passes null explicitly as a parameter. This violates the principle of not passing null to methods. Consider using Optional or restructuring to avoid null parameter.
public static OcrResult failure(String errorMessage) { | ||
return new OcrResult(false, null, errorMessage); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factory method passes null explicitly as a parameter. This violates the principle of not passing null to methods. Consider using Optional or restructuring to avoid null parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional should only be a return type
// The OCR engine instance | ||
private final Tesseract tesseract; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is trivial and can be derived directly from the code. It doesn't add any new information about the implementation or reasoning behind using Tesseract.
// For now, we'll use a relative path that works during development | ||
tesseract.setDatapath("tessdata"); | ||
for (String path : possiblePaths) { | ||
File tessdata = new File(path, "tessdata"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use java.nio with Path and File.exists(...) https://docs.oracle.com/javase/tutorial/essential/io/legacy.html
private final String text; | ||
private final String errorMessage; | ||
|
||
private OcrResult(boolean success, String text, String errorMessage) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can be converted to a record ? https://docs.oracle.com/en/java/javase/24/language/records.html
Localization.lang("OCR failed"), | ||
exception.getMessage() | ||
); | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before add
.showToUser(true) then it will be shown in the UI task list
private OcrResult(boolean success, String text, String errorMessage) { | ||
this.success = success; | ||
this.text = text; | ||
this.errorMessage = errorMessage; | ||
} | ||
|
||
public static OcrResult success(String text) { | ||
return new OcrResult(true, text, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a better way to think about the constructor parameters. Maybe have different constructors to avoid passing null (as both success and failure are very expected cases)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other idea is to model a sealed class hierarchy with sucess/failure as simple record and then you can do pattern matching with switch ...
https://medium.com/@sandip.v.salunkhe/java-record-and-sealed-classes-features-to-enhance-modelling-data-35705c571f70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example at the bottom https://gamlor.info/posts-output/2022-06-01-java-advanced-enums/en/
} else { | ||
throw new OcrException("Could not find tessdata directory. Please set TESSDATA_PREFIX environment variable."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid logical branching to throw exceptions.
Can this be modeled with try-catch?
If not, and if this is an expected case, there should just be an error dialog shown.
Closes #13267
In about one to three sentences, describe the changes you have made: what, where, why, ...
Steps to test
Describe how reviewers can test this fix/feature. Ideally, think of how you would guide a beginner user of Jabef to try out your change.
You can add screenshots or videos (using Loom or by just adding .mp4 files).
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if change is visible to the user)