Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kraken OCR engine #89

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Add kraken OCR engine #89

wants to merge 27 commits into from

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Aug 19, 2023

No description provided.

@stweil stweil marked this pull request as draft August 19, 2023 11:13
@stweil
Copy link
Contributor Author

stweil commented Aug 19, 2023

This is a first implementation to support the Open Source kraken OCR engine.

I created this draft pull request to allow public review and comments although the implementation is still incomplete. OCR for cropped image still needs testing, and there is also currently no unit test code for the new engine.

@samwilson
Copy link
Member

Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks!

@stweil
Copy link
Contributor Author

stweil commented Aug 28, 2023

Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks!

Done, see https://phabricator.wikimedia.org/T345055. I also updated the PR here to solve a merge conflict.

@stweil stweil force-pushed the kraken branch 2 times, most recently from d773984 to a704816 Compare August 28, 2023 08:39
@stweil
Copy link
Contributor Author

stweil commented Aug 28, 2023

See also my test installation.

@stweil
Copy link
Contributor Author

stweil commented Sep 22, 2023

The test installation is meanwhile available on https://kraken-ocr.wmcloud.org/.

Both are not language specific, but support historic and current scripts
used by many European languages.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Kraken is an Open Source OCR engine with trainable segmentation and
OCR models. It can work with printed and handwritten texts.

This initial implementation comes with two generic OCR models which
can be used on a wide range of German publications, but also with
other languages which are based on Latin script.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Cropping is also implemented now, but still untested.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
…test)

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Segmentation models are currently only supported for kraken.
All other OCR engines return an empty list.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Copy link
Collaborator

@Parthiv-M Parthiv-M left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review focuses on the following main things

  • Files related to Transkribus have also been modified, which I think is not ideal
  • A new Transkribus model has also been added, which should be removed
  • Normalise text has been added here for all engines (we would want that in a separate PR)
  • Renaming the newly added API route

Overall, we'd like to keep the non-Kraken changes out of the way before testing kraken once again

/**
* Get a list of available segmentation models for use with a specific OCR engine.
*
* @Route("/api/available_segmentation_models", name="apiSegmentationModels", methods={"GET"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This route, since it is related to Kraken, should be named /api/kraken/available_segmentation_models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having segmentation models is not kraken specific. All OCR processes require a segmentation step, and if that step uses AI, it also requires a model. That's why I did not use a route with "kraken" here. So even if it is currently only used for kraken, I'd suggest to use a generic route.

Comment on lines +289 to +293
"german-fraktur-19th-20th-century": {
"transkribus": {
"htr": 37738
}
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Transkribus model should not be added along with Kraken changes, it would be better to separate it out

@@ -118,9 +118,6 @@ public function getResult(
): EngineResult {
$this->checkImageUrl( $imageUrl );

$image = $this->getImage( $imageUrl, $crop );
$imageUrl = $image->getUrl();

$points = '';
if ( $crop ) {
$x = $crop['x'];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd prefer to isolate changes other than those related to Kraken to another PR!

Comment on lines +39 to +41
$this->transkribusEngine = $this->instantiateEngine( 'transkribus' );

$this->transkribusEngine = $this->instatiateEngine( 'transkribus' );
$this->krakenEngine = $this->instantiateEngine( 'kraken' );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instantiate() fixed in #115

description: A web service for Tesseract, Google and Transkribus OCR engines.
version: 1.0.0
description: A web service for Kraken, Tesseract, Google and Transkribus OCR engines.
version: 1.4.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it has been bumped up to 1.4.4 now.

areas:
path_patterns:
- ^/api$
- ^/api/available_langs$
- ^/api/available_segmentation_models$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to change this in accordance with my comment on route path

@@ -1,6 +1,6 @@
{
"@metadata": {},
"title": "WikimediaOCR",
"title": "WikimediaOCR – Kraken Test",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should remain as WikimediaOCR

@@ -13,6 +13,7 @@
"regenerator-runtime": "^0.13.11",
"select2": "^4.0.13",
"select2-bootstrap-theme": "0.1.0-beta.10",
"stylelint": "^15.10.3",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylelint can be removed from this PR as well

@@ -23,5 +24,6 @@
"watch": "encore dev --watch",
"build": "encore production --progress",
"test": "grunt test"
}
},
"dependencies": {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty entries should be removed

@@ -65,6 +72,7 @@ abstract class EngineBase {
'fro' => 'Franceis, François, Romanz (1400-1600)',
'ger-hd-m1' => 'Transkribus German handwriting M1',
'ger-15' => '15th-16th century German',
'german-fraktur-19th-20th-century' => 'German Fraktur 19th-20th century',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Transkribus model and needs to be removed from this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants