Add new OCR parameter to normalize the result text #112

stweil · 2023-09-22T16:07:03Z

No description provided.

stweil · 2023-09-22T16:10:43Z

Example: Tesseract OCR with and without normalization.

The normalization works with any OCR engine. The cache always stores the original OCR text. Therefore it is possible to switch to normalized text without a new OCR run.

stweil · 2023-09-22T16:23:13Z

src/Engine/EngineResult.php

+	 * Normalize result by replacing some historic characters
+	 */
+	public function normalize() {
+		$this->text = strtr( $this->text, [


Some (and more) of these translations could be done with Normalizer::normalize( $this->text, Normalizer::FORM_KC ), but that causes a runtime conflict with the Symfony class which is also called Normalizer.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

samwilson

This looks like a good addition, but note that there's been various discussions over the years about how to normalize OCR output, and not always with huge agreement. Mainly because different Wikisources want to do things differently, and many already have gadgets in place for doing the exact replacements that they want.

For example T278443 fix issue with lines being formatted incorrectly, and T250185 Make Wikisource-OCR handle paragraphs better.

I think there needs to be a way to make this configurable per-project, or perhaps retrieve a config from on-wiki (e.g. a normalize_config param could point to a JSON page's URL, where the actual replacement patterns are defined).

samwilson · 2023-09-23T00:19:24Z

src/Engine/EngineResult.php

+	 * Normalize result by replacing some historic characters
+	 */
+	public function normalize() {
+		$this->text = strtr( $this->text, [


but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil commented Sep 22, 2023

View reviewed changes

samwilson reviewed Sep 23, 2023

View reviewed changes

stweil marked this pull request as draft September 23, 2023 08:05

Add new OCR parameter to normalize the result text

744ad1c

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil force-pushed the normalize branch from 4322273 to 744ad1c Compare October 12, 2023 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new OCR parameter to normalize the result text #112

Add new OCR parameter to normalize the result text #112

stweil commented Sep 22, 2023

stweil commented Sep 22, 2023 •

edited

Loading

stweil Sep 22, 2023 •

edited

Loading

samwilson Sep 23, 2023

samwilson left a comment

samwilson Sep 23, 2023

Add new OCR parameter to normalize the result text #112

Are you sure you want to change the base?

Add new OCR parameter to normalize the result text #112

Conversation

stweil commented Sep 22, 2023

stweil commented Sep 22, 2023 • edited Loading

stweil Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

samwilson Sep 23, 2023

Choose a reason for hiding this comment

samwilson left a comment

Choose a reason for hiding this comment

samwilson Sep 23, 2023

Choose a reason for hiding this comment

stweil commented Sep 22, 2023 •

edited

Loading

stweil Sep 22, 2023 •

edited

Loading