Add flexible character accuracy #47

b2m · 2020-11-11T10:34:16Z

This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.

C. Clausner, S. Pletschacher, A. Antonacopoulos , "Flexible character accuracy measure for reading-order-independent evaluation", Pattern Recognition Letters, Volume 131, March 2020, Pages 390-397

There are still some open topics so I opened this pull request as draft so you may already comment on some issues.

Handling of coefficients

The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).

Coefficient	Min	Max	Step
minDist	15	30	5
lengthDiff	0	23	3
offset	0	3	1
length	0	5	1

Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.

The runs for each set of coefficients is also a good place for parallelization.
Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?

Penalty and distance functions

The algorithm depends a lot on a penalty and a distance function.
From a library point of view I would like them to be exchangeable with other functions.
But from a ocrd-processor/cli perspective this is not really useful.

So should we make the distance and penalty functions configurable?

Use of caching

Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.

The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.

Performance

~~At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.~~

mikegerber · 2020-11-11T15:08:00Z

Awesome! I'll try to review it in the next days

b2m · 2020-11-12T17:43:52Z

Note that the flexible character accuracy in the current implementation will take minutes on real world data (e.g. the ground truth provided in the tests data folder).

I will try to investigate but I suspect that it is a mix of:

original algorithm ✔️
my implementation ✔️
unfortunate caching ✔️
editops calculating ✅
python ✔️

Update: all confirmed and now the performance is "reasonable".

b2m · 2020-11-19T13:26:52Z

So using multiprocessing is estimated by my local benchmarking to run on 10 CPUs in about 1/2 the time while using ~4 times more cpu cycles.

The reason is that each process has it's own cache and flexible character accuracy relies heavily on caching.

So we can decide to continue with multiprocessing and maybe a shared cache or keep the code slower but simple... (and maybe optimize in the future by running multiple documents in parallel).

mikegerber · 2020-12-15T17:27:55Z

This is on the agenda for 2021-01, sorry for not doing it earlier :)

As ocr-d continues the support for Python 3.5 until the end of this year version specific data structures have been implemented. When the support for Python 3.5 is dropped the extra file can easily be removed.

As the fca implementation already knows the editing operations for each segment we use a different sequence alignment method.

Temporarily switch to the c-implementation of python-levenshtein for editops calculatation. Also added some variables, caching and type changes for performance gains.

jbarth-ubhd · 2021-02-17T13:19:28Z

if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n characters.

b2m · 2021-02-17T13:29:41Z

if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n characters.

FCA is using a sliding window to compare a smaller string with a longer one with the considerations of inserts and deletes.
For example a is compared three times with cba:

a__   _a_   __a
cba   cba   cba

So there is no need to extend the window by n characters.

b2m · 2021-04-24T18:03:50Z

Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).

mikegerber · 2021-04-26T11:25:46Z

Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).

Or expressed in a more positive matter: I am looking forward to get the necessary features into RapidFuzz and then get this PR merged too 😄

mikegerber self-assigned this Nov 11, 2020

b2m mentioned this pull request Nov 16, 2020

Switch from custom Levenshtein to python-Levenshtein #48

Closed

b2m added 20 commits February 16, 2021 11:28

First draft of flexible character accuracy

d7a74fa

Fix some special cases

5277593

Reformat using black

2a215a1

Implement version specific data structures

4a87adc

As ocr-d continues the support for Python 3.5 until the end of this year version specific data structures have been implemented. When the support for Python 3.5 is dropped the extra file can easily be removed.

Readd pytest.ini

26fe98d

Fix numpy version conflict with ocrd_utils

9b76539

Include fca as parameter and add some tests

53064bf

Add tooltips to fca report

750ad00

Correct report for fca

1bc7ef6

As the fca implementation already knows the editing operations for each segment we use a different sequence alignment method.

Evaluate some performance issues

cac437a

Fix broken build on Python 3.5

fd6f57a

Make sure that 0 cer and wer are reported

c9219cb

Reduce number of splits for short (one char) elements

0ef7810

Performance increases

b24d8d5

Temporarily switch to the c-implementation of python-levenshtein for editops calculatation. Also added some variables, caching and type changes for performance gains.

Small corrections

0dd5fc0

Fix annoying logging exceptions and encoding errors.

84d34f5

Increase cache size for bad OCR results.

c4f75d5

Add multiprocessing to flexible_character_accuracy

b9259b9

Remove obsolete test

9e64c4f

Fix problem with json encoding

85b784f

b2m mentioned this pull request Apr 26, 2021

Improve performance when calculating sequence alignment #56

Closed

Remove restrictions on numpy

675a096

b2m closed this Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flexible character accuracy #47

Add flexible character accuracy #47

b2m commented Nov 11, 2020 •

edited

Loading

mikegerber commented Nov 11, 2020

b2m commented Nov 12, 2020 •

edited

Loading

b2m commented Nov 19, 2020

mikegerber commented Dec 15, 2020

jbarth-ubhd commented Feb 17, 2021 •

edited

Loading

b2m commented Feb 17, 2021 •

edited

Loading

b2m commented Apr 24, 2021

mikegerber commented Apr 26, 2021

Add flexible character accuracy #47

Add flexible character accuracy #47

Conversation

b2m commented Nov 11, 2020 • edited Loading

Handling of coefficients

Penalty and distance functions

Use of caching

Performance

mikegerber commented Nov 11, 2020

b2m commented Nov 12, 2020 • edited Loading

b2m commented Nov 19, 2020

mikegerber commented Dec 15, 2020

jbarth-ubhd commented Feb 17, 2021 • edited Loading

b2m commented Feb 17, 2021 • edited Loading

b2m commented Apr 24, 2021

mikegerber commented Apr 26, 2021

b2m commented Nov 11, 2020 •

edited

Loading

b2m commented Nov 12, 2020 •

edited

Loading

jbarth-ubhd commented Feb 17, 2021 •

edited

Loading

b2m commented Feb 17, 2021 •

edited

Loading