implement AlternativeImage-based processing:

- base all processors on AlternativeImage - make all processors create a parameter MetadataItem - make all processors create output file names from the input files, and use .xml extension for PAGE - introduce a `common` module along the lines of the ocropy wrapper (but without ocropy-specific segmentation), i.e. functions to be moved into core: - polygon_mask - rotate_polygon - image_from_page - image_from_region - image_from_line - image_from_word - image_from_glyph - save_image_file - bbox_from_points - points_from_bbox - xywh_from_bbox - bbox_from_xywh - points_from_polygon - in crop: - set textord_tabfind_find_tables=0 (because with table detection, the hinge often gets confused with a table column) - if a Border already exists, warn that it will be overwritten - if TextRegions already exist, calculate their common extent and warn it will be ignored - use PSM.SPARSE_TEXT instead of PSM.AUGO (so no images regions creep into neighbouring pages) - ignore regions which are empty after binarization - ignore regions with tiny width or height (< 30px) - add a padding to the result on all sides (4px) - do not annotate a (wrong) Border if no regions have been found - in deskew: - convert skewing angle from radians to degrees, and mind the direction (clockwise in PAGE, but mathematically positive in Pillow) and map to the numeric interval (-179,180) - add orientation (+90/180/270) to skewing angle - also rotate the raw image of the page/region (expand and fill with white) and store as file; reference in METS (under OCR-D-IMG-DESKEW) and in PAGE (as AlternativeImage, with appropriate comments) - annotate writing direction and textline order in PAGE too - use OSD (DetectOrientationScript) in addition to layout analysis (AnalyseLayout/Orientation), with confidence thresholds (>= 10): ensure that orientation is consistent between both (and in case of conflict, use the former), also annotate primary script; init appropriately (i.e. load "osd", use legacy OEM and AUTO_OSD) - on region level, process TableRegions as well - change default operation_level to region (because we still cannot annotate orientation on page level) - in segment_region: - add parameter `find_tables` (default: true) to allow disabling table detection (textord_tabfind_find_tables=0), so they can be analysed into independent text/sep regions - add parameter `overwrite_regions` (default: true) to allow enabling removal of any existing text regions - unconditionally remove any existing non-text regions and reading order groups - cover PT.VERTICAL_TEXT (as TextRegionType) and PT.TABLE (as TableRegionType) - use BlockPolygon (if present) to annotate polygon outline in Coords – but comment away, because patch against tesserocr segfaults awaits merge - add parameter `crop_polygons` (default: false) to enable: retrieve the raw region image along the (internal) polygon outline, store image as file, and reference in METS (under OCR-D-IMG-CROP) and in PAGE (as AlternativeImage) - in segment_line, add parameter `overwrite_lines` (default: true) to allow enabling removal of any existing text lines - in segment_word, add parameter `overwrite_words` (default: true) to allow enabling removal of any existing words - new processor binarize: - operate on page, region or line level - retrieve cropped, raw page/region/line image, then enter PSM.AUTO/SINGLE_BLOCK/SINGLE_LINE, and run layout analysis on the image, retrieve the binary image for RIL.BLOCK/TEXTLINE store image as file, and reference in METS (under OCR-D-IMG-BIN) and in PAGE (as AlternativeImage) - improve docstrings - remove redundant locale workaround from config (already in __init__) - version 0.2.2 → 0.2.3
OCR-D · Jun 28, 2019 · 5107f3f · 5107f3f
1 parent b598dee
commit 5107f3f
Show file tree

Hide file tree

Showing 15 changed files with 1,527 additions and 340 deletions.
diff --git a/.pylintrc b/.pylintrc
@@ -5,13 +5,23 @@ ignored-modules=cv2,tesserocr
 [MESSAGES CONTROL]
 disable =
     ungrouped-imports,
-    fixme,
+#    fixme,
     bad-continuation,
     missing-docstring,
     no-self-use,
-    too-many-arguments,
     superfluous-parens,
     invalid-name,
     line-too-long,
+    too-many-arguments,
+    too-many-branches,
+    too-many-statements,
     too-many-locals,
     too-few-public-methods,
+    wrong-import-order,
+    duplicate-code
+
+# allow indented whitespace (as required by interpreter):
+no-space-check=empty-line
+
+# allow non-snake-case identifiers:
+good-names=n,i
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,26 @@ Versioned according to [Semantic Versioning](http://semver.org/).
 
 ## Unreleased
 
+## [0.2.3] - 2019-06-28
+
+Changed:
+  * Use basename of input file for output name
+  * Use .xml filename extension for PAGE output
+  * Warn about existing border or regions in `crop`
+  * Use `PSM.SPARSE_TEXT` without tables in `crop`
+  * Filter unreliable regions in `crop`
+  * Add padding around border in `crop`
+  * Delete existing regions in `segment_region`
+  * Cover vertical text and tables in `segment_region`
+  * Add parameter `find_tables` in `segment_region`
+  * Add parameter `crop_polygons` in `segment_region`
+  * Add parameter `overwrite_regions` in `segment_region`
+  * Add parameter `overwrite_lines` in `segment_line`
+  * Add parameter `overwrite_words` in `segment_word`
+  * Add page/region-level processor `deskew`
+  * Add page/region/line-level processor `binarize`
+  * Respect AlternativeImage on all levels
+
 ## [0.2.2] - 2019-05-20
 
 Changed:

diff --git a/ocrd_tesserocr/__init__.py b/ocrd_tesserocr/__init__.py
@@ -12,3 +12,4 @@
 from .segment_region import TesserocrSegmentRegion
 from .crop import TesserocrCrop
 from .deskew import TesserocrDeskew
+from .binarize import TesserocrBinarize
diff --git a/ocrd_tesserocr/binarize.py b/ocrd_tesserocr/binarize.py
@@ -0,0 +1,139 @@
+from __future__ import absolute_import
+
+import os.path
+from tesserocr import (
+    PyTessBaseAPI,
+    PSM, RIL
+)
+
+from ocrd_utils import (
+    getLogger, concat_padded,
+    MIMETYPE_PAGE
+)
+from ocrd_modelfactory import page_from_file
+from ocrd_models.ocrd_page import (
+    MetadataItemType,
+    LabelsType, LabelType,
+    AlternativeImageType,
+    TextRegionType,
+    to_xml
+)
+from ocrd import Processor
+
+from .config import TESSDATA_PREFIX, OCRD_TOOL
+from .common import (
+    image_from_page,
+    image_from_region,
+    image_from_line,
+    save_image_file,
+    membername
+)
+
+TOOL = 'ocrd-tesserocr-binarize'
+LOG = getLogger('processor.TesserocrBinarize')
+FILEGRP_IMG = 'OCR-D-IMG-BIN'
+
+class TesserocrBinarize(Processor):
+
+    def __init__(self, *args, **kwargs):
+        kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
+        kwargs['version'] = OCRD_TOOL['version']
+        super(TesserocrBinarize, self).__init__(*args, **kwargs)
+
+    def process(self):
+        """Performs binarization with Tesseract on the workspace.
+        
+        Open and deserialise PAGE input files and their respective images,
+        then iterate over the element hierarchy down to the requested level.
+        
+        Set up Tesseract to recognise the segment image's layout, and get
+        the binarized image. Create an image file, and reference it as
+        AlternativeImage in the element and as file with a fileGrp USE
+        equal `OCR-D-IMG-BIN` in the workspace.
+        
+        Produce a new output file by serialising the resulting hierarchy.
+        """
+        oplevel = self.parameter['operation_level']
+        with PyTessBaseAPI(path=TESSDATA_PREFIX) as tessapi:
+            for n, input_file in enumerate(self.input_files):
+                file_id = input_file.ID.replace(self.input_file_grp, FILEGRP_IMG)
+                page_id = input_file.pageId or input_file.ID
+                LOG.info("INPUT FILE %i / %s", n, page_id)
+                pcgts = page_from_file(self.workspace.download_file(input_file))
+                metadata = pcgts.get_Metadata() # ensured by from_file()
+                metadata.add_MetadataItem(
+                    MetadataItemType(type_="processingStep",
+                                     name=self.ocrd_tool['steps'][0],
+                                     value=TOOL,
+                                     # FIXME: externalRef is invalid by pagecontent.xsd, but ocrd does not reflect this
+                                     # what we want here is `externalModel="ocrd-tool" externalId="parameters"`
+                                     Labels=[LabelsType(#externalRef="parameters",
+                                                        Label=[LabelType(type_=name,
+                                                                         value=self.parameter[name])
+                                                               for name in self.parameter.keys()])]))
+                page = pcgts.get_Page()
+                page_image = self.workspace.resolve_image_as_pil(page.imageFilename)
+                LOG.info("Binarizing on '%s' level in page '%s'", oplevel, page_id)
+
+                page_image, page_xywh = image_from_page(
+                    self.workspace, page, page_image, page_id)
+                if oplevel == 'page':
+                    tessapi.SetPageSegMode(PSM.AUTO)
+                    self._process_segment(tessapi, RIL.BLOCK, page, page_image, page_xywh,
+                                          "page '%s'" % page_id, input_file.pageId,
+                                          file_id)
+                else:
+                    regions = page.get_TextRegion() + page.get_TableRegion()
+                    if not regions:
+                        LOG.warning("Page '%s' contains no text regions", page_id)
+                    for region in regions:
+                        region_image, region_xywh = image_from_region(
+                            self.workspace, region, page_image, page_xywh)
+                        if oplevel == 'region':
+                            tessapi.SetPageSegMode(PSM.SINGLE_BLOCK)
+                            self._process_segment(tessapi, RIL.BLOCK, region, region_image, region_xywh,
+                                                  "region '%s'" % region.id, input_file.pageId,
+                                                  file_id + '_' + region.id)
+                        elif isinstance(region, TextRegionType):
+                            lines = region.get_TextLine()
+                            if not lines:
+                                LOG.warning("Page '%s' region '%s' contains no text lines",
+                                            page_id, region.id)
+                            for line in lines:
+                                line_image, line_xywh = image_from_line(
+                                    self.workspace, line, region_image, region_xywh)
+                                tessapi.SetPageSegMode(PSM.SINGLE_LINE)
+                                self._process_segment(tessapi, RIL.TEXTLINE, line, line_image, line_xywh,
+                                                      "line '%s'" % line.id, input_file.pageId,
+                                                      file_id + '_' + region.id + '_' + line.id)
+
+                # Use input_file's basename for the new file -
+                # this way the files retain the same basenames:
+                file_id = input_file.ID.replace(self.input_file_grp, self.output_file_grp)
+                if file_id == input_file.ID:
+                    file_id = concat_padded(self.output_file_grp, n)
+                self.workspace.add_file(
+                    ID=file_id,
+                    file_grp=self.output_file_grp,
+                    mimetype=MIMETYPE_PAGE,
+                    local_filename=os.path.join(self.output_file_grp,
+                                                file_id + '.xml'),
+                    content=to_xml(pcgts))
+
+    def _process_segment(self, tessapi, ril, segment, image, xywh, where, page_id, file_id):
+        tessapi.SetImage(image)
+        image_bin = None
+        layout = tessapi.AnalyseLayout()
+        if layout:
+            image_bin = layout.GetBinaryImage(ril)
+        if not image_bin:
+            LOG.error('Cannot binarize %s', where)
+            return
+        # update METS (add the image file):
+        file_path = save_image_file(self.workspace, image_bin,
+                                    file_id,
+                                    page_id=page_id,
+                                    file_grp=FILEGRP_IMG)
+        # update PAGE (reference the image file):
+        segment.add_AlternativeImage(AlternativeImageType(
+            filename=file_path, comments="binarized"))
diff --git a/ocrd_tesserocr/cli.py b/ocrd_tesserocr/cli.py
@@ -7,6 +7,7 @@
 from ocrd_tesserocr.segment_word import TesserocrSegmentWord
 from ocrd_tesserocr.crop import TesserocrCrop
 from ocrd_tesserocr.deskew import TesserocrDeskew
+from ocrd_tesserocr.binarize import TesserocrBinarize
 
 @click.command()
 @ocrd_cli_options
@@ -37,3 +38,8 @@ def ocrd_tesserocr_crop(*args, **kwargs):
 @ocrd_cli_options
 def ocrd_tesserocr_deskew(*args, **kwargs):
     return ocrd_cli_wrap_processor(TesserocrDeskew, *args, **kwargs)
+
+@click.command()
+@ocrd_cli_options
+def ocrd_tesserocr_binarize(*args, **kwargs):
+    return ocrd_cli_wrap_processor(TesserocrBinarize, *args, **kwargs)