Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from custom Levenshtein to python-Levenshtein #48

Closed
wants to merge 4 commits into from
Closed

Switch from custom Levenshtein to python-Levenshtein #48

wants to merge 4 commits into from

Conversation

b2m
Copy link
Contributor

@b2m b2m commented Nov 16, 2020

As the distance and editops calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.

We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.

Here are some benchmarks for the behaviour before and after this change using data already available in the repository:

Command Mean [s] Min [s] Max [s] Relative
before brochrnx_73075507X 3.359 ± 0.039 3.305 3.408 8.29 ± 0.13
after brochrnx_73075507X 0.405 ± 0.004 0.399 0.409 1.00
before actevedef_718448162 CALAMARI 34.410 ± 0.561 33.918 35.362 84.91 ± 1.61
after actevedef_718448162 CALAMARI 0.926 ± 0.010 0.911 0.935 2.28 ± 0.03
before actevedef_718448162 TESS 34.103 ± 0.305 33.685 34.529 84.16 ± 1.11
after actevedef_718448162 TESS 0.909 ± 0.008 0.899 0.921 2.24 ± 0.03

Boxplot generated by hyperfine

Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.

Copy link
Member

@mikegerber mikegerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.

@b2m
Copy link
Contributor Author

b2m commented Nov 17, 2020

Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.

Yes... I refactored the tests (1st commit) to be able to test the original implementation and the new implementation in parallel (2nd commit) and then replace the original implementation (3rd commit).

I then messed it up during the second commit because I overlooked some tests.

So you can still go back to the second commit (fa617e8) and see both implementations work in parallel.

Copy link
Contributor

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Learned something new (https://pypi.org/project/multimethod/), well-tested and summarized. Great PR IMHO.

@mikegerber mikegerber self-assigned this Nov 17, 2020
@mikegerber
Copy link
Member

I had a Gitter chat iwth @b2m: I'll 1. cherry pick the test refactoring and 2. see if I can come up with a Cython implementation of the Wagner-Fisher algorithm - because I am not so super happy with the proposed transform_lists. If that is not working out, I'll try to be happy with the encode-list-elements-as-strings hack ;-)

(Best solution: Add a generic solution to python-Levenshtein)

@mikegerber
Copy link
Member

There is another remark I should make: There are other sequence matching algorithms and I am not sure yet if I want them in here because they do not necessarily output the optimal solution.

@b2m
Copy link
Contributor Author

b2m commented Nov 17, 2020

(Best solution: Add a generic solution to python-Levenshtein)

I agree that the best solution would be a contribution to the python-Levenshtein package.
But I lack the experience in C to do this on a production grade level 😞.

@mikegerber
Copy link
Member

This is on the agenda for 2021-01, sorry for not doing it earlier :) Merry christmas!

(I actually have a naive Cython implementation which is faster than the pure Python implementation. More on that in January!)

b2m added 4 commits February 16, 2021 11:26
As the distance and editops calculation is a performance bottleneck in
this application we substituted the custom Levenshtein implementation to
the C implementation in the python-Levenshtein package.

We now also have separate entrypoints for texts with unicode normalization
and without because this also can be done more efficiently once upon
preprocessing.
@maxbachmann
Copy link
Contributor

One thing to keep in mind is that python-Levenshtein is GPL licensed. When using an external Levenshtein distance implementation it might better to use a different library with a more fitting license. I did a performance comparision in the documentation of my own library (RapidFuzz) https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein and afaik there are multiple implementations, which are more open licensed and faster than python-Levenshtein.

@mikegerber
Copy link
Member

mikegerber commented Apr 12, 2021

@maxbachmann Thanks, I will definitely check it out! Does your implementation allow for arbitrary sequences or is it limited to str? If you know any other implementations, please feel free to tell me :) (There are a lot of sequence alignment algorithms, but I don't know how they behave exactly, so I opted to use the basic "optimal" algorithm, i.e. shortest alignment)

@mikegerber
Copy link
Member

@maxbachmann To elaborate a bit:

In dinglehopper, we need to align both sequences of Unicode characters and also sequences of words. (We need to align those, too, not only compute the edit distance.)

When aligning strings (sequences of Unicode characters) we also need to address a problem that comes up especially when dealing with OCR from historic prints: characters that cannot be represented with a single codepoints. For example, in "Schlym̃" there is a code point sequence/grapheme cluster LATIN SMALL LETTER M + COMBINING TILDE that cannot be normalized to a single code point. This need to be treated as one "character" when aligning/calculating in one way or another.

@maxbachmann
Copy link
Contributor

maxbachmann commented Apr 13, 2021

  1. at least right now RapidFuzz only calculates the edit distance and does not has a feature to return the editops. However it mostly does not exist, because I did not have the need for it yet. A implementation in RapidFuzz would probably perform pretty much exactly as well as it does in python-Levenshtein, since it simply extracts one possible path through the Levenshtein matrix.
  2. both python-Levenshtein and RapidFuzz currently only work with strings. E.g. https://github.com/roy-ht/editdistance will always calculate the edit distance between two sequences where each element has to be hashable, which would work with a list of words as well. I opened Support distance calculation for types other than strings rapidfuzz/RapidFuzz#100 in RapidFuzz to discuss the potential implementation.
  3. Similar to python-Levenshtein RapidFuzz works with single code points. So e.g. "Schlym̃", which is represented as 7 characters in Python will be interpreted as 7 characters by RapidFuzz as well. I guess when adding the extension from 2) you could pass the clustered version into the algorithms which would work as expected (especially when you use a string multiple times it would make sense to pass a array.array('Q',...), so the hashes do not have to be calculated multiple times.

@mikegerber
Copy link
Member

Alright, I decided something:

  • I am going to merge this PR in the coming days, even though I am not happy with some design decisions, e.g. encoding arbitrary data as a string. The quality of the PR is exceptionally good, it's just that design that held me up.
  • This is provided it solves some of the performance problems we observed (@cneud mentioned 2 hours processing time for one newspaper page 🐌)
  • I might do some work on this when merging
  • If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that

@b2m
Copy link
Contributor Author

b2m commented Apr 23, 2021

We still have the GPL-Problem @maxbachmann mentioned.

Dinglehopper is licensed as Apache 2 and therefore can not use a GPL licensed python module (AFAIK).

I would consider it a showstopper for this PR and have not found a suitable replacement that has the same functionality (edit operations + edit distance).

@maxbachmann
Copy link
Contributor

If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that

I am currently working on this. Will let you know when it is added.

@mikegerber
Copy link
Member

@b2m I'll investigate this. I hate licensing problems :-) (My last review comment is weird, it's old and I don't know how to delete it :) )

@maxbachmann Awesome!

@mikegerber
Copy link
Member

While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL, too. So the python-Levensthein library cannot be used, unfortunately. (Using GPL tools in e.g. Makefiles seems to be fine, though, as these tools are used as separate programs.)

I'll make my team aware of this problem. @cneud

@kba This might be of interest for the OCR-D community, too.

Side not about GPLv3: As I understand this page, we also can't use GPLv3 licensed libraries in our Apache-licensed software. (It would work the other way around, but AFAICT only with GPLv3, not GPLv2.)

@mikegerber mikegerber mentioned this pull request Apr 24, 2021
16 tasks
@maxbachmann
Copy link
Contributor

While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL

GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as. So it is unclear how the GPL has to be applied in Python. However it is definitely simpler to use a library with a more open license, since it completely avoids those issues. As a note e.g. FuzzyWuzzy was originally MIT licensed, but got GPL licensed because they use python-Levenshtein and were unsure about this as well.

@mikegerber
Copy link
Member

GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as.

I am almost convinced that importing a Python library is essentially dynamic linking but I agree it is not 100% clear. However, I have no doubt in this case because of the SO site-packages/Levenshtein/_levenshtein.cpython-38-x86_64-linux-gnu.so ;-)

I just checked all other (first level) dependencies and all are licensed with less strict licenses (MIT, Apache, BSD). So I think we will go down this route for this aspect of the software as well. (Not discussed with the team yet, but I doubt there is anyone in my team who wants to fist-fight the Free Software Foundation over this.)

@b2m
Copy link
Contributor Author

b2m commented Apr 24, 2021

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)

Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?

@kba
Copy link
Contributor

kba commented Apr 25, 2021

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions)

I am fairly certain that we're not tainted by license incompatibilities anywhere but it would be nice to be more than "fairly certain". So if you do set up something like this, let me know, it might be useful to do this as part of the CI/CD of ocrd_all.

@mikegerber
Copy link
Member

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)

I've reopened #54 for this!

Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?

Yes, that would be nice!

@b2m
Copy link
Contributor Author

b2m commented Apr 26, 2021

Closing this pull request and opened #56 to keep track of a more appropriate solution.

@b2m b2m closed this Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants