Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
implement AlternativeImage-based processing:
- base all processors on AlternativeImage - make all processors create a parameter MetadataItem - make all processors create output file names from the input files, and use .xml extension for PAGE - introduce a `common` module along the lines of the ocropy wrapper (but without ocropy-specific segmentation), i.e. functions to be moved into core: - polygon_mask - rotate_polygon - image_from_page - image_from_region - image_from_line - image_from_word - image_from_glyph - save_image_file - bbox_from_points - points_from_bbox - xywh_from_bbox - bbox_from_xywh - points_from_polygon - in crop: - set textord_tabfind_find_tables=0 (because with table detection, the hinge often gets confused with a table column) - if a Border already exists, warn that it will be overwritten - if TextRegions already exist, calculate their common extent and warn it will be ignored - use PSM.SPARSE_TEXT instead of PSM.AUGO (so no images regions creep into neighbouring pages) - ignore regions which are empty after binarization - ignore regions with tiny width or height (< 30px) - add a padding to the result on all sides (4px) - do not annotate a (wrong) Border if no regions have been found - in deskew: - convert skewing angle from radians to degrees, and mind the direction (clockwise in PAGE, but mathematically positive in Pillow) and map to the numeric interval (-179,180) - add orientation (+90/180/270) to skewing angle - also rotate the raw image of the page/region (expand and fill with white) and store as file; reference in METS (under OCR-D-IMG-DESKEW) and in PAGE (as AlternativeImage, with appropriate comments) - annotate writing direction and textline order in PAGE too - use OSD (DetectOrientationScript) in addition to layout analysis (AnalyseLayout/Orientation), with confidence thresholds (>= 10): ensure that orientation is consistent between both (and in case of conflict, use the former), also annotate primary script; init appropriately (i.e. load "osd", use legacy OEM and AUTO_OSD) - on region level, process TableRegions as well - change default operation_level to region (because we still cannot annotate orientation on page level) - in segment_region: - add parameter `find_tables` (default: true) to allow disabling table detection (textord_tabfind_find_tables=0), so they can be analysed into independent text/sep regions - add parameter `overwrite_regions` (default: true) to allow enabling removal of any existing text regions - unconditionally remove any existing non-text regions and reading order groups - cover PT.VERTICAL_TEXT (as TextRegionType) and PT.TABLE (as TableRegionType) - use BlockPolygon (if present) to annotate polygon outline in Coords – but comment away, because patch against tesserocr segfaults awaits merge - add parameter `crop_polygons` (default: false) to enable: retrieve the raw region image along the (internal) polygon outline, store image as file, and reference in METS (under OCR-D-IMG-CROP) and in PAGE (as AlternativeImage) - in segment_line, add parameter `overwrite_lines` (default: true) to allow enabling removal of any existing text lines - in segment_word, add parameter `overwrite_words` (default: true) to allow enabling removal of any existing words - new processor binarize: - operate on page, region or line level - retrieve cropped, raw page/region/line image, then enter PSM.AUTO/SINGLE_BLOCK/SINGLE_LINE, and run layout analysis on the image, retrieve the binary image for RIL.BLOCK/TEXTLINE store image as file, and reference in METS (under OCR-D-IMG-BIN) and in PAGE (as AlternativeImage) - improve docstrings - remove redundant locale workaround from config (already in __init__) - version 0.2.2 → 0.2.3
- Loading branch information