Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new OCR parameter to normalize the result text #112

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions i18n/en.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
"report-issue": "Report an issue",
"langs-placeholder": "Leave blank for automatic language detection.",
"langs-param-error": "The following {{PLURAL:$1|language is|languages are}} not supported by the OCR engine: $2",
"normalize-ocr-text": "Normalize the text from OCR",
"tesseract-options": "Tesseract options",
"tesseract-psm-label": "Page segmentation method",
"tesseract-psm-help": "Try \"Sparse text\" for better multi-column support.",
Expand Down
1 change: 1 addition & 0 deletions i18n/qqq.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
"report-issue": "Link text in the footer for the issue-reporting link.",
"langs-placeholder": "Placeholder text for the language input field.",
"langs-param-error": "Error message displayed when invalid language(s) are submitted.\n\nParameters:\n* $1 – number of invalid languages\n* $2 - the list of invalid languages\n\nOCR is a common abbreviation in English for \"Optical Characters Recognition\".",
"normalize-ocr-text": "Normalize the text from OCR (replaces long s and some other historic characters)",
"tesseract-options": "Heading for Tesseract-specific options.",
"tesseract-psm-label": "Form label for the Tesseract page segmentation mode.",
"tesseract-psm-help": "Help text for the Tesseract page segmentation mode option. 'Sparse text' refers to options, see messages:\n* {{msg-wm|Wikimedia-ocr-tesseract-psm-11}} and\n* {{msg-wm|Wikimedia-ocr-tesseract-psm-12}}.",
Expand Down
11 changes: 11 additions & 0 deletions src/Controller/OcrController.php
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ class OcrController extends AbstractController {
'image' => '',
'engine' => self::DEFAULT_ENGINE,
'langs' => [],
'normalize' => false,
'psm' => TesseractEngine::DEFAULT_PSM,
'crop' => [],
'line_id' => TranskribusEngine::DEFAULT_LINEID,
Expand Down Expand Up @@ -112,6 +113,7 @@ private function setup(): void {
}
static::$params['langs'] = $this->getLangs( $this->request );
static::$params['image_hosts'] = $this->engine->getImageHosts();
static::$params['normalize'] = $this->request->query->get( 'normalize' );
$crop = $this->request->query->get( 'crop' );
if ( !is_array( $crop ) ) {
$crop = [];
Expand Down Expand Up @@ -228,6 +230,12 @@ public function homeAction(): Response {
* @OA\JsonContent(type="array", @OA\Items(type="string"))
* )
* @OA\Parameter(
* name="normalize",
* in="query",
* description="Normalize OCR text.",
* @OA\Schema(type="boolean")
* )
* @OA\Parameter(
* name="psm",
* in="query",
* description="The Page Segmentation Mode for Tesseract.",
Expand Down Expand Up @@ -365,6 +373,9 @@ private function getResult( string $invalidLangsMode ): EngineResult {
if ( !$result instanceof EngineResult ) {
throw new Exception( 'Incorrect (possibly cached) result: ' . var_export( $result, true ) );
}
if ( static::$params['normalize'] ) {
$result->normalize();
}
return $result;
}
}
15 changes: 15 additions & 0 deletions src/Engine/EngineResult.php
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,19 @@ public function getText(): string {
public function getWarnings(): array {
return $this->warnings;
}

/**
* Normalize result by replacing some historic characters
*/
public function normalize() {
$this->text = strtr( $this->text, [
Copy link
Contributor Author

@stweil stweil Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (and more) of these translations could be done with Normalizer::normalize( $this->text, Normalizer::FORM_KC ), but that causes a runtime conflict with the Symfony class which is also called Normalizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

'ſ' => 's',
'ꝛ' => 'r',
'ℳ' => 'M',
'aͤ' => 'ä',
'oͤ' => 'ö',
'uͤ' => 'ü',
'⸗' => '-',
] );
}
}
4 changes: 4 additions & 0 deletions templates/output.html.twig
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@
</select>
{% include '_transkribus_help.html.twig' with {engine: engine} %}
</div>
<div class="form-group">
<input type="checkbox" id="normalize" name="normalize" value="1">
<label for="normalize">{{ msg('normalize-ocr-text') }}</label>
</div>
</fieldset>

{% include '_tesseract_options.html.twig' with {engine: engine} %}
Expand Down