Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flexible character accuracy #47

Closed
wants to merge 21 commits into from
Closed

Add flexible character accuracy #47

wants to merge 21 commits into from

Conversation

b2m
Copy link
Contributor

@b2m b2m commented Nov 11, 2020

This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.

C. Clausner, S. Pletschacher, A. Antonacopoulos , "Flexible character accuracy measure for reading-order-independent evaluation", Pattern Recognition Letters, Volume 131, March 2020, Pages 390-397

There are still some open topics so I opened this pull request as draft so you may already comment on some issues.

Handling of coefficients

The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).

Coefficient Min Max Step
minDist 15 30 5
lengthDiff 0 23 3
offset 0 3 1
length 0 5 1

Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.

The runs for each set of coefficients is also a good place for parallelization.
Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?

Penalty and distance functions

The algorithm depends a lot on a penalty and a distance function.
From a library point of view I would like them to be exchangeable with other functions.
But from a ocrd-processor/cli perspective this is not really useful.

So should we make the distance and penalty functions configurable?

Use of caching

Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.

The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.

Performance

At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.

@mikegerber mikegerber self-assigned this Nov 11, 2020
@mikegerber
Copy link
Member

Awesome! I'll try to review it in the next days

@b2m
Copy link
Contributor Author

b2m commented Nov 12, 2020

Note that the flexible character accuracy in the current implementation will take minutes on real world data (e.g. the ground truth provided in the tests data folder).

I will try to investigate but I suspect that it is a mix of:

  • original algorithm ✔️
  • my implementation ✔️
  • unfortunate caching ✔️
  • editops calculating ✅
  • python ✔️

Update: all confirmed and now the performance is "reasonable".

@b2m
Copy link
Contributor Author

b2m commented Nov 19, 2020

So using multiprocessing is estimated by my local benchmarking to run on 10 CPUs in about 1/2 the time while using ~4 times more cpu cycles.

The reason is that each process has it's own cache and flexible character accuracy relies heavily on caching.

So we can decide to continue with multiprocessing and maybe a shared cache or keep the code slower but simple... (and maybe optimize in the future by running multiple documents in parallel).

@mikegerber
Copy link
Member

This is on the agenda for 2021-01, sorry for not doing it earlier :)

@jbarth-ubhd
Copy link

jbarth-ubhd commented Feb 17, 2021

if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n characters.

@b2m
Copy link
Contributor Author

b2m commented Feb 17, 2021

if there would be enough cpu power, one could perhaps calculate editDist() of t2.length() ± n characters.

FCA is using a sliding window to compare a smaller string with a longer one with the considerations of inserts and deletes.
For example a is compared three times with cba:

a__   _a_   __a
cba   cba   cba

So there is no need to extend the window by n characters.

@b2m
Copy link
Contributor Author

b2m commented Apr 24, 2021

Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).

@mikegerber
Copy link
Member

Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that).

Or expressed in a more positive matter: I am looking forward to get the necessary features into RapidFuzz and then get this PR merged too 😄

@b2m b2m closed this Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants