-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flexible character accuracy #47
Conversation
Awesome! I'll try to review it in the next days |
Note that the flexible character accuracy in the current implementation will take minutes on real world data (e.g. the ground truth provided in the tests data folder). I will try to investigate but I suspect that it is a mix of:
Update: all confirmed and now the performance is "reasonable". |
So using The reason is that each process has it's own cache and flexible character accuracy relies heavily on caching. So we can decide to continue with |
This is on the agenda for 2021-01, sorry for not doing it earlier :) |
As ocr-d continues the support for Python 3.5 until the end of this year version specific data structures have been implemented. When the support for Python 3.5 is dropped the extra file can easily be removed.
As the fca implementation already knows the editing operations for each segment we use a different sequence alignment method.
Temporarily switch to the c-implementation of python-levenshtein for editops calculatation. Also added some variables, caching and type changes for performance gains.
if there would be enough cpu power, one could perhaps calculate editDist() of |
FCA is using a sliding window to compare a smaller string with a longer one with the considerations of inserts and deletes.
So there is no need to extend the window by n characters. |
Note that this pull request is still blocked by the performance of calculating the distance between two strings in dinglehopper (see #48 for details on that). |
Or expressed in a more positive matter: I am looking forward to get the necessary features into RapidFuzz and then get this PR merged too 😄 |
This is a first draft for adding the flexible character accuracy as suggested by @cneud in #32.
There are still some open topics so I opened this pull request as draft so you may already comment on some issues.
Handling of coefficients
The algorithm uses a "range of coefficients for penalty calculation" (see Table 1 in the paper).
Should we make the coefficients configurable? If so, we might need a configuration file because handling 12 additional parameters on the command line is quite messy.
The runs for each set of coefficients is also a good place for parallelization.
Should we include parallelization at this point or is this a non-issue as typically the processors in ORC-D workflows are already busy doing other things?
Penalty and distance functions
The algorithm depends a lot on a penalty and a distance function.
From a library point of view I would like them to be exchangeable with other functions.
But from a ocrd-processor/cli perspective this is not really useful.
So should we make the distance and penalty functions configurable?
Use of caching
Because of regularly splitting lines and repeating runs with different coefficients the algorithm needs a lot of caching to speed up the progress. The easiest way to do this is with Python < 3.9 is to use the @lru_cache annotation.
The performance could benefit from a custom tailored cache but it would also add more code we have to maintain.
Performance
At the moment it takes several minutes to analyse real world paris of ground truth and ocr data.