Skip to content

Commit

Permalink
implement AlternativeImage-based processing:
Browse files Browse the repository at this point in the history
- base all processors on AlternativeImage
- make all processors create a parameter MetadataItem
- make all processors create output file names
  from the input files, and use .xml extension for PAGE
- introduce a `common` module along the lines
  of the ocropy wrapper (but without
  ocropy-specific segmentation), i.e. functions
  to be moved into core:
  - polygon_mask
  - rotate_polygon
  - image_from_page
  - image_from_region
  - image_from_line
  - image_from_word
  - image_from_glyph
  - save_image_file
  - bbox_from_points
  - points_from_bbox
  - xywh_from_bbox
  - bbox_from_xywh
  - points_from_polygon
- in crop:
  - set textord_tabfind_find_tables=0 (because with
    table detection, the hinge often gets confused
    with a table column)
  - if a Border already exists, warn that it will
    be overwritten
  - if TextRegions already exist, calculate their
    common extent and warn it will be ignored
  - use PSM.SPARSE_TEXT instead of PSM.AUGO (so
    no images regions creep into neighbouring pages)
  - ignore regions which are empty after binarization
  - ignore regions with tiny width or height (< 30px)
  - add a padding to the result on all sides (4px)
  - do not annotate a (wrong) Border if no regions
    have been found
- in deskew:
  - convert skewing angle from radians to degrees,
    and mind the direction (clockwise in PAGE, but
    mathematically positive in Pillow) and map to
    the numeric interval (-179,180)
  - add orientation (+90/180/270) to skewing angle
  - also rotate the raw image of the page/region
    (expand and fill with white) and store as file;
    reference in METS (under OCR-D-IMG-DESKEW) and
    in PAGE (as AlternativeImage, with appropriate
    comments)
  - annotate writing direction and textline order
    in PAGE too
  - use OSD (DetectOrientationScript) in addition to
    layout analysis (AnalyseLayout/Orientation), with
    confidence thresholds (>= 10):
    ensure that orientation is consistent between both
    (and in case of conflict, use the former), also
    annotate primary script; init appropriately
    (i.e. load "osd", use legacy OEM and AUTO_OSD)
  - on region level, process TableRegions as well
  - change default operation_level to region (because
    we still cannot annotate orientation on page level)
- in segment_region:
  - add parameter `find_tables` (default: true) to allow
    disabling table detection (textord_tabfind_find_tables=0),
    so they can be analysed into independent text/sep regions
  - add parameter `overwrite_regions` (default: true) to
    allow enabling removal of any existing text regions
  - unconditionally remove any existing non-text regions
    and reading order groups
  - cover PT.VERTICAL_TEXT (as TextRegionType) and PT.TABLE
    (as TableRegionType)
  - use BlockPolygon (if present) to annotate polygon outline
    in Coords – but comment away, because patch against tesserocr
    segfaults awaits merge
  - add parameter `crop_polygons` (default: false) to enable:
    retrieve the raw region image along the (internal) polygon
    outline, store image as file, and reference in METS (under
    OCR-D-IMG-CROP) and in PAGE (as AlternativeImage)
- in segment_line, add parameter `overwrite_lines` (default: true) to
    allow enabling removal of any existing text lines
- in segment_word, add parameter `overwrite_words` (default: true) to
    allow enabling removal of any existing words
- new processor binarize:
  - operate on page, region or line level
  - retrieve cropped, raw page/region/line image, then enter
    PSM.AUTO/SINGLE_BLOCK/SINGLE_LINE, and run layout analysis
    on the image, retrieve the binary image for RIL.BLOCK/TEXTLINE
    store image as file, and reference in METS (under OCR-D-IMG-BIN)
    and in PAGE (as AlternativeImage)
- improve docstrings
- remove redundant locale workaround from config
  (already in __init__)
- version 0.2.2 → 0.2.3
  • Loading branch information
bertsky committed Jun 28, 2019
1 parent b598dee commit 5107f3f
Show file tree
Hide file tree
Showing 15 changed files with 1,527 additions and 340 deletions.
14 changes: 12 additions & 2 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,23 @@ ignored-modules=cv2,tesserocr
[MESSAGES CONTROL]
disable =
ungrouped-imports,
fixme,
# fixme,
bad-continuation,
missing-docstring,
no-self-use,
too-many-arguments,
superfluous-parens,
invalid-name,
line-too-long,
too-many-arguments,
too-many-branches,
too-many-statements,
too-many-locals,
too-few-public-methods,
wrong-import-order,
duplicate-code

# allow indented whitespace (as required by interpreter):
no-space-check=empty-line

# allow non-snake-case identifiers:
good-names=n,i
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,26 @@ Versioned according to [Semantic Versioning](http://semver.org/).

## Unreleased

## [0.2.3] - 2019-06-28

Changed:
* Use basename of input file for output name
* Use .xml filename extension for PAGE output
* Warn about existing border or regions in `crop`
* Use `PSM.SPARSE_TEXT` without tables in `crop`
* Filter unreliable regions in `crop`
* Add padding around border in `crop`
* Delete existing regions in `segment_region`
* Cover vertical text and tables in `segment_region`
* Add parameter `find_tables` in `segment_region`
* Add parameter `crop_polygons` in `segment_region`
* Add parameter `overwrite_regions` in `segment_region`
* Add parameter `overwrite_lines` in `segment_line`
* Add parameter `overwrite_words` in `segment_word`
* Add page/region-level processor `deskew`
* Add page/region/line-level processor `binarize`
* Respect AlternativeImage on all levels

## [0.2.2] - 2019-05-20

Changed:
Expand Down
1 change: 1 addition & 0 deletions ocrd_tesserocr/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@
from .segment_region import TesserocrSegmentRegion
from .crop import TesserocrCrop
from .deskew import TesserocrDeskew
from .binarize import TesserocrBinarize
139 changes: 139 additions & 0 deletions ocrd_tesserocr/binarize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
from __future__ import absolute_import

import os.path
from tesserocr import (
PyTessBaseAPI,
PSM, RIL
)

from ocrd_utils import (
getLogger, concat_padded,
MIMETYPE_PAGE
)
from ocrd_modelfactory import page_from_file
from ocrd_models.ocrd_page import (
MetadataItemType,
LabelsType, LabelType,
AlternativeImageType,
TextRegionType,
to_xml
)
from ocrd import Processor

from .config import TESSDATA_PREFIX, OCRD_TOOL
from .common import (
image_from_page,
image_from_region,
image_from_line,
save_image_file,
membername
)

TOOL = 'ocrd-tesserocr-binarize'
LOG = getLogger('processor.TesserocrBinarize')
FILEGRP_IMG = 'OCR-D-IMG-BIN'

class TesserocrBinarize(Processor):

def __init__(self, *args, **kwargs):
kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
kwargs['version'] = OCRD_TOOL['version']
super(TesserocrBinarize, self).__init__(*args, **kwargs)

def process(self):
"""Performs binarization with Tesseract on the workspace.
Open and deserialise PAGE input files and their respective images,
then iterate over the element hierarchy down to the requested level.
Set up Tesseract to recognise the segment image's layout, and get
the binarized image. Create an image file, and reference it as
AlternativeImage in the element and as file with a fileGrp USE
equal `OCR-D-IMG-BIN` in the workspace.
Produce a new output file by serialising the resulting hierarchy.
"""
oplevel = self.parameter['operation_level']
with PyTessBaseAPI(path=TESSDATA_PREFIX) as tessapi:
for n, input_file in enumerate(self.input_files):
file_id = input_file.ID.replace(self.input_file_grp, FILEGRP_IMG)
page_id = input_file.pageId or input_file.ID
LOG.info("INPUT FILE %i / %s", n, page_id)
pcgts = page_from_file(self.workspace.download_file(input_file))
metadata = pcgts.get_Metadata() # ensured by from_file()
metadata.add_MetadataItem(
MetadataItemType(type_="processingStep",
name=self.ocrd_tool['steps'][0],
value=TOOL,
# FIXME: externalRef is invalid by pagecontent.xsd, but ocrd does not reflect this
# what we want here is `externalModel="ocrd-tool" externalId="parameters"`
Labels=[LabelsType(#externalRef="parameters",
Label=[LabelType(type_=name,
value=self.parameter[name])
for name in self.parameter.keys()])]))
page = pcgts.get_Page()
page_image = self.workspace.resolve_image_as_pil(page.imageFilename)
LOG.info("Binarizing on '%s' level in page '%s'", oplevel, page_id)

page_image, page_xywh = image_from_page(
self.workspace, page, page_image, page_id)
if oplevel == 'page':
tessapi.SetPageSegMode(PSM.AUTO)
self._process_segment(tessapi, RIL.BLOCK, page, page_image, page_xywh,
"page '%s'" % page_id, input_file.pageId,
file_id)
else:
regions = page.get_TextRegion() + page.get_TableRegion()
if not regions:
LOG.warning("Page '%s' contains no text regions", page_id)
for region in regions:
region_image, region_xywh = image_from_region(
self.workspace, region, page_image, page_xywh)
if oplevel == 'region':
tessapi.SetPageSegMode(PSM.SINGLE_BLOCK)
self._process_segment(tessapi, RIL.BLOCK, region, region_image, region_xywh,
"region '%s'" % region.id, input_file.pageId,
file_id + '_' + region.id)
elif isinstance(region, TextRegionType):
lines = region.get_TextLine()
if not lines:
LOG.warning("Page '%s' region '%s' contains no text lines",
page_id, region.id)
for line in lines:
line_image, line_xywh = image_from_line(
self.workspace, line, region_image, region_xywh)
tessapi.SetPageSegMode(PSM.SINGLE_LINE)
self._process_segment(tessapi, RIL.TEXTLINE, line, line_image, line_xywh,
"line '%s'" % line.id, input_file.pageId,
file_id + '_' + region.id + '_' + line.id)

# Use input_file's basename for the new file -
# this way the files retain the same basenames:
file_id = input_file.ID.replace(self.input_file_grp, self.output_file_grp)
if file_id == input_file.ID:
file_id = concat_padded(self.output_file_grp, n)
self.workspace.add_file(
ID=file_id,
file_grp=self.output_file_grp,
mimetype=MIMETYPE_PAGE,
local_filename=os.path.join(self.output_file_grp,
file_id + '.xml'),
content=to_xml(pcgts))

def _process_segment(self, tessapi, ril, segment, image, xywh, where, page_id, file_id):
tessapi.SetImage(image)
image_bin = None
layout = tessapi.AnalyseLayout()
if layout:
image_bin = layout.GetBinaryImage(ril)
if not image_bin:
LOG.error('Cannot binarize %s', where)
return
# update METS (add the image file):
file_path = save_image_file(self.workspace, image_bin,
file_id,
page_id=page_id,
file_grp=FILEGRP_IMG)
# update PAGE (reference the image file):
segment.add_AlternativeImage(AlternativeImageType(
filename=file_path, comments="binarized"))
6 changes: 6 additions & 0 deletions ocrd_tesserocr/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from ocrd_tesserocr.segment_word import TesserocrSegmentWord
from ocrd_tesserocr.crop import TesserocrCrop
from ocrd_tesserocr.deskew import TesserocrDeskew
from ocrd_tesserocr.binarize import TesserocrBinarize

@click.command()
@ocrd_cli_options
Expand Down Expand Up @@ -37,3 +38,8 @@ def ocrd_tesserocr_crop(*args, **kwargs):
@ocrd_cli_options
def ocrd_tesserocr_deskew(*args, **kwargs):
return ocrd_cli_wrap_processor(TesserocrDeskew, *args, **kwargs)

@click.command()
@ocrd_cli_options
def ocrd_tesserocr_binarize(*args, **kwargs):
return ocrd_cli_wrap_processor(TesserocrBinarize, *args, **kwargs)
Loading

0 comments on commit 5107f3f

Please sign in to comment.